Skip to main content
Redhat Developers  Logo
  • Products

    Featured

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat OpenShift AI
      Red Hat OpenShift AI
    • Red Hat Enterprise Linux AI
      Linux icon inside of a brain
    • Image mode for Red Hat Enterprise Linux
      RHEL image mode
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • Red Hat Developer Hub
      Developer Hub
    • View All Red Hat Products
    • Linux

      • Red Hat Enterprise Linux
      • Image mode for Red Hat Enterprise Linux
      • Red Hat Universal Base Images (UBI)
    • Java runtimes & frameworks

      • JBoss Enterprise Application Platform
      • Red Hat build of OpenJDK
    • Kubernetes

      • Red Hat OpenShift
      • Microsoft Azure Red Hat OpenShift
      • Red Hat OpenShift Virtualization
      • Red Hat OpenShift Lightspeed
    • Integration & App Connectivity

      • Red Hat Build of Apache Camel
      • Red Hat Service Interconnect
      • Red Hat Connectivity Link
    • AI/ML

      • Red Hat OpenShift AI
      • Red Hat Enterprise Linux AI
    • Automation

      • Red Hat Ansible Automation Platform
      • Red Hat Ansible Lightspeed
    • Developer tools

      • Red Hat Trusted Software Supply Chain
      • Podman Desktop
      • Red Hat OpenShift Dev Spaces
    • Developer Sandbox

      Developer Sandbox
      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Secure Development & Architectures

      • Security
      • Secure coding
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
      • View All Technologies
    • Start exploring in the Developer Sandbox for free

      sandbox graphic
      Try Red Hat's products and technologies without setup or configuration.
    • Try at no cost
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • Java
      Java icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • API Catalog
    • Product Documentation
    • Legacy Documentation
    • Red Hat Learning

      Learning image
      Boost your technical skills to expert-level with the help of interactive lessons offered by various Red Hat Learning programs.
    • Explore Red Hat Learning
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

How to use AMD GPUs for model serving in OpenShift AI

October 8, 2024
Vaibhav Jain
Related topics:
Artificial intelligence
Related products:
Red Hat OpenShift AI

Share:

    As artificial intelligence and machine learning (AI/ML) workloads continue to grow, the demand for powerful, efficient hardware accelerators such as GPUs has become paramount. In this article, we will explore how to integrate and utilize AMD GPUs in Red Hat OpenShift AI for model serving. Specifically, we'll dive into how to set up and configure the AMD Instinct MI300X GPU with KServe in OpenShift AI.

    Note

    This is the third article in our series covering OpenShift AI capabilities on various AI accelerators. Catch up on the other parts:

    • How to fine-tune Llama 3.1 with Ray on OpenShift AI
    • How AMD GPUs accelerate model training and tuning with OpenShift AI

    AMD GPU devices used

    For this tutorial, we will focus on AMD's MI300X GPU, a powerful device designed to accelerate machine learning workloads.

    Create an accelerator profile in OpenShift AI

    To begin, you must create an accelerator profile that tells OpenShift AI about the AMD GPU. This profile will ensure that the system recognizes and utilizes the AMD GPU effectively for machine learning tasks.

    You can use the following YAML file to create an accelerator profile:

    apiVersion: dashboard.opendatahub.io/v1
    kind: AcceleratorProfile
    metadata:
      name: amd-gpu
      namespace: redhat-ods-applications
    spec:
      displayName: AMD GPU
      enabled: true
      identifier: amd.com/gpu
      tolerations:
        - effect: NoSchedule
          key: amd.com/gpu
          operator: Exists

    Parameters:

    • displayName: This field displays the name of the accelerator, i.e., AMD GPU.
    • identifier: The key (amd.com/gpu) is used to mark workloads that can be scheduled on AMD GPUs.
    • tolerations: This allows the workload to be scheduled on OpenShift nodes labeled with AMD GPU resources.

    Selecting accelerator profile in dashboard

    Once the accelerator profile is created, you can select this profile in the OpenShift AI dashboard to leverage AMD GPUs for your workloads. Ensure that the profile is enabled and available for selection.

    Configure KServe serving runtimes with AMD GPUs

    With OpenShift AI, vLLM is only supported as a single-model serving platform that is based on KServe.

    Let’s configure serving runtimes for deploying machine learning models using KServe, a model serving tool that allows you to easily deploy and manage models on Kubernetes clusters.

    HTTP is simpler and widely used for general web-based API requests, while gRPC offers better performance with lower latency and is ideal for high-throughput, real-time applications. You might choose HTTP for ease of integration or gRPC for performance efficiency, depending on your specific use case.

    HTTP ServingRuntime with AMD GPU

    To deploy models over HTTP using the AMD GPU, you can use the following ServingRuntime configuration:

    apiVersion: serving.kserve.io/v1alpha1
    kind: ServingRuntime
    metadata:
      annotations:
        opendatahub.io/recommended-accelerators: '["amd.com/gpu"]'
        openshift.io/display-name: vLLM AMD HTTP ServingRuntime for KServe
      name: vllm-amd-http-runtime
    spec:
      builtInAdapter:
        modelLoadingTimeoutMillis: 90000
      containers:
        - args:
            - '--port=8080'
            - '--model=/mnt/models'
            - '--served-model-name={{.Name}}'
            - '--distributed-executor-backend=mp'
            - '--chat-template=/app/data/template/template_chatml.jinja'
          command:
            - python3
            - '-m'
            - vllm.entrypoints.openai.api_server
          image: 'quay.io/modh/vllm@sha256:2e7f97b69d6e0aa7366ee6a841a7e709829136a143608bee859b1fe700c36d31'
          name: kserve-container
          ports:
            - containerPort: 8080
              name: http1
              protocol: TCP
      multiModel: false
      supportedModelFormats:
        - autoSelect: true
          name: pytorch

    gRPC ServingRuntime with AMD GPU

    For deploying models using gRPC, modify the ServingRuntime configuration as follows:

    apiVersion: serving.kserve.io/v1alpha1
    kind: ServingRuntime
    metadata:
      annotations:
        opendatahub.io/recommended-accelerators: '["amd.com/gpu"]'
        openshift.io/display-name: vLLM AMD GRPC ServingRuntime for KServe
      name: vllm-amd-grpc-runtime
      namespace: tgismodel-granite-8b-code
    spec:
      builtInAdapter:
        modelLoadingTimeoutMillis: 90000
      containers:
        - args:
            - '--port=8080'
            - '--model=/mnt/models'
            - '--served-model-name={{.Name}}'
            - '--distributed-executor-backend=mp'
            - '--chat-template=/app/data/template/template_chatml.jinja'
          command:
            - python3
            - '-m'
            - vllm_tgis_adapter
          image: 'quay.io/modh/vllm@sha256:2e7f97b69d6e0aa7366ee6a841a7e709829136a143608bee859b1fe700c36d31'
          name: kserve-container
          ports:
            - containerPort: 8033
              name: h2c
              protocol: TCP
      multiModel: false
      supportedModelFormats:
        - autoSelect: true
          name: pytorch

    Parameters:

    • supportedModelFormats: The above configurations support PyTorch models, but you can modify this based on your model format.
    • image: As part of the developer preview for AMD GPU support in the OpenShift AI serving stack, Red Hat published a container image at  (quay.io/modh/vllm@sha256:2e7f97b69d6e0aa7366ee6a841a7e709829136a143608bee859b1fe700c36d31). In this example, we are using vLLM for LLM inference and serving.

    Configure inference services

    Once your serving runtimes are set up, you can create an InferenceService for serverless or raw deployment modes. Below are the configurations for both modes.

    Serverless deployment automatically scales based on demand and is ideal for dynamic workloads with fluctuating traffic, while raw deployment offers more control over resource management and is better suited for stable, predictable workloads requiring fine-tuned configurations.

    Serverless InferenceService

    To create an inference service for serverless mode, modify the InferenceService configuration as follows:

    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: '3000'
        serving.knative.openshift.io/enablePassthrough: 'true'
        serving.kserve.io/deploymentMode: Serverless
        sidecar.istio.io/inject: 'true'
      name: granite-8b-code
      namespace: tgismodel-granite-8b-code
    spec:
      predictor:
        minReplicas: 1
        model:
          env:
            - name: HF_HUB_CACHE
              value: /tmp
            - name: TRANSFORMERS_CACHE
              value: $(HF_HUB_CACHE)
            - name: DTYPE
              value: float16
          modelFormat:
            name: pytorch
          name: ''
          resources:
            limits:
              amd.com/gpu: '1'
            requests:
              memory: 40Gi
          runtime: vllm-runtime
          storageUri: 's3://ods-ci-wisdom/granite-8b-code-base'

    Raw deployment InferenceService

    To create an inference service for raw deployment, modify the InferenceService configuration as follows:

    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: '3000'
        serving.knative.openshift.io/enablePassthrough: 'true'
        serving.kserve.io/deploymentMode: RawDeployment
        sidecar.istio.io/inject: 'true'
      name: granite-8b-code
      namespace: tgismodel-granite-8b-code
    spec:
      predictor:
        minReplicas: 1
        model:
          env:
            - name: HF_HUB_CACHE
              value: /tmp
            - name: TRANSFORMERS_CACHE
              value: $(HF_HUB_CACHE)
            - name: DTYPE
              value: float16
          modelFormat:
            name: pytorch
          name: ''
          resources:
            limits:
              amd.com/gpu: '1'
            requests:
              memory: 40Gi
          runtime: vllm-runtime
          storageUri: 's3://ods-ci-wisdom/granite-8b-code-base'

    AMD runtime considerations

    Compared to NVIDIA-based runtimes, the AMD runtime typically requires more memory resources to run efficiently. In the examples above, note that 40GI of memory is requested; this can vary depending on the model complexity.

    Tested models

    Here are some of the models we tested using AMD GPUs:

    • granite-8b-code-base
    • Meta-Llama-3.1-8B models

    AMD GPU ServingRuntime image

    As earlier noted, the serving runtime image for AMD GPUs is available at:

    quay.io/opendatahub/vllm@sha256:3a84d90113cb8bfc9623a3b4e2a14d4e9263e2649b9e2e51babdbaf9c3a6b1c8

    This image is tailored for serving models on AMD GPUs and includes optimizations for performance and compatibility.

    Conclusion

    Integrating AMD GPUs into your Red Hat OpenShift AI environment offers an efficient way to accelerate AI/ML workloads. By setting up the proper accelerator profiles, serving runtimes, and inference services, you can unlock the full potential of AMD’s MI300X GPUs. Whether you're deploying models via HTTP or gRPC, serverless or raw deployment modes, the flexibility of KServe with AMD GPUs ensures smooth, scalable AI model serving.

    Happy GPU computing!

    For more information on Red Hat OpenShift AI, visit the OpenShift AI product page.

    Related Posts

    • How AMD GPUs accelerate model training and tuning with OpenShift AI

    • Why GPUs are essential for AI and high-performance computing

    • How to fine-tune Llama 3.1 with Ray on OpenShift AI

    • Model training in Red Hat OpenShift AI

    • How InstructLab enables accessible model fine-tuning for gen AI

    • Introducing GPU support for Podman AI Lab

    Recent Posts

    • How to run a fraud detection AI model on RHEL CVMs

    • How we use software provenance at Red Hat

    • Alternatives to creating bootc images from scratch

    • How to update OpenStack Services on OpenShift

    • How to integrate vLLM inference into your macOS and iOS apps

    What’s up next?

    This hands-on learning path demonstrates how retrieval-augmented generation (RAG) works and how users can implement a RAG workflow using Red Hat OpenShift AI and Elasticsearch vector database.

    Start the activity
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Products

    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue

    OSZAR »