Skip to main content
Redhat Developers  Logo
  • Products

    Featured

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat OpenShift AI
      Red Hat OpenShift AI
    • Red Hat Enterprise Linux AI
      Linux icon inside of a brain
    • Image mode for Red Hat Enterprise Linux
      RHEL image mode
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • Red Hat Developer Hub
      Developer Hub
    • View All Red Hat Products
    • Linux

      • Red Hat Enterprise Linux
      • Image mode for Red Hat Enterprise Linux
      • Red Hat Universal Base Images (UBI)
    • Java runtimes & frameworks

      • JBoss Enterprise Application Platform
      • Red Hat build of OpenJDK
    • Kubernetes

      • Red Hat OpenShift
      • Microsoft Azure Red Hat OpenShift
      • Red Hat OpenShift Virtualization
      • Red Hat OpenShift Lightspeed
    • Integration & App Connectivity

      • Red Hat Build of Apache Camel
      • Red Hat Service Interconnect
      • Red Hat Connectivity Link
    • AI/ML

      • Red Hat OpenShift AI
      • Red Hat Enterprise Linux AI
    • Automation

      • Red Hat Ansible Automation Platform
      • Red Hat Ansible Lightspeed
    • Developer tools

      • Red Hat Trusted Software Supply Chain
      • Podman Desktop
      • Red Hat OpenShift Dev Spaces
    • Developer Sandbox

      Developer Sandbox
      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Secure Development & Architectures

      • Security
      • Secure coding
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
      • View All Technologies
    • Start exploring in the Developer Sandbox for free

      sandbox graphic
      Try Red Hat's products and technologies without setup or configuration.
    • Try at no cost
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • Java
      Java icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • API Catalog
    • Product Documentation
    • Legacy Documentation
    • Red Hat Learning

      Learning image
      Boost your technical skills to expert-level with the help of interactive lessons offered by various Red Hat Learning programs.
    • Explore Red Hat Learning
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

vLLM V1: Accelerating multimodal inference for large language models

How vLLM V1 drives enhanced support for multimodal LLMs

February 27, 2025
Michael Goin Addie Stevens Roger Wang, Senior Machine Learning Engineer - ML Platform at Roblox
Related topics:
Artificial intelligenceData ScienceOpen source
Related products:
Red Hat AI

Share:

    This blog recaps the February 6th vLLM Office Hour, where host Michael Goin was joined by Roger Wang, a vLLM committer from Roblox, to discuss the new multimodal capabilities in vLLM V1.

    In the AI space, efficient inference isn’t just about speed; it’s about flexibility, scalability, and the ability to seamlessly handle diverse data modalities—beyond just text. vLLM has emerged as the open source standard for serving language model inference, supporting models from Hugging Face and more across a wide array of hardware. With robust support for GPUs, TPUs, and even CPUs, vLLM is paving the way for next-generation multimodal applications.

    In this article, we dive into the innovations behind vLLM V1 (V1 Alpha), which addresses the challenges of multimodal inference encountered in V0. We’ll explore the design decisions that enhance performance, from encoder caching to optimized data processing and share benchmark results that highlight the improvements. Finally, we’ll outline our vision for future work to further push the boundaries of efficient, scalable AI.

    About vLLM

    vLLM is the go-to open source model serving framework for LM inference. Its design emphasizes:

    • Speed and ease of use: vLLM works out-of-the-box with models from Hugging Face and supports dozens of key models.
    • Hardware versatility: Built on PyTorch, vLLM isn’t limited to NVIDIA GPUs. It extends support to AMD GPUs, Google TPUs, AWS Accelerators, Intel accelerators, and even CPUs.
    • Beyond text-only models: Today’s applications demand multimodal capabilities. vLLM now supports not only text but also images, audio, and video inputs—enabling tasks like document parsing, object recognition, video understanding, and computer use.
    • Advanced inference optimizations: With features like quantization, chunked prefill, and prefix caching, vLLM is continually optimized for both high-throughput and low-latency inference.

    Learn more: Meet vLLM: For faster, more efficient LLM inference and serving

    Overview of large multimodal models

    Modern large multimodal models typically leverage a decoder-only language model (LM) backbone paired with an encoder for non-text modalities. In practice, when you provide an image or audio clip, it’s first transformed into embeddings by a dedicated encoder. These embeddings are then merged with text embeddings and fed into the decoder LM.

    For example:

    • LLaVA: Uses CLIP to encode images into embeddings before merging them with text (see Figure 1).
    • Qwen2-audio: Uses a Whisper audio encoder to process audio inputs, which are then merged with text embeddings for decoding.
    Figure 1: LLaVA Architecture
    Figure 1: LLaVA architecture.
    Source: https://encord.com/blog/llava-large-language-vision-assistant/

    vLLM’s flexible architecture now supports this diverse range of inputs, setting the stage for richer, more capable multimodal applications.

    What went wrong in vLLM V0

    While vLLM V0 set the foundation, it wasn’t without limitations, especially when dealing with multimodal inputs.

    Chunked prefill challenges

    Chunked prefill allows prompts to be partially prefilled so that long requests don’t block the entire decoding process of existing requests. For example, with three incoming requests (R1, R2, R3), R1 and R2 might be fully prefilled, while only a portion of R3 is prefilled initially. This staggered approach, illustrated in Figure 2, keeps latency in check.

    Figure 2: A simplified diagram of Chunked Prefill running 3 prompts (R1, R2, R3) under a 10-token budget, illustrating staggered prefill and embedding challenges.
    Figure 2: A simplified diagram of chunked prefill running 3 prompts (R1, R2, R3) under a 10-token budget, illustrating staggered prefill and embedding challenges. 
    Source: https://docs.google.com/presentation/d/1SZOJ1lCOj6BpHcwqCMcRNfjCNvEPInv8/edit#slide=id.p1

    However, multimodal embeddings are continuous by nature and cannot be broken into discrete tokens to be incrementally produced. If an image produces 10 embeddings but only 2 tokens are reserved in a prefill chunk, a shape mismatch occurs. Early designs assumed a direct merge into text embeddings, which proved problematic.

    Prefix caching limitations

    In V0, prefix caching was based solely on token IDs. For multimodal inputs, where placeholder tokens (e.g., <image>) are identical across requests, this led to cache collisions. Different images sharing the same placeholder would mistakenly trigger cached results, compromising correctness.

    Innovations in vLLM V1

    vLLM V1 introduces several key improvements to overcome these challenges.

    1. Encoder cache and encoder-aware scheduler

    The challenge: Repeatedly regenerating multimodal embeddings for every prefill operation can be inefficient, especially when a single image may generate thousands of embeddings (e.g., Pixtral produces 4096 embeddings for a single 1024x1024 image).

    The V1 solution:

    • Encoder cache: Multimodal embeddings are computed once and stored directly on the GPU.
    • Encoder-aware scheduler: The scheduler tracks the positions of multimodal embeddings within each request. When merging with text embeddings, it retrieves cached data, eliminating redundant encoder execution. See Figure 3.
    Figure 3: Flowchart illustrating how the encoder cache and scheduler work together to streamline multimodal inference.
    Figure 3: Flowchart illustrating how the encoder cache and scheduler work together to streamline multimodal inference.
    Source: https://docs.google.com/presentation/d/1SZOJ1lCOj6BpHcwqCMcRNfjCNvEPInv8/edit#slide=id.p1

    Why a GPU cache? 

    Transferring tensors to and from CPU memory is often more expensive than re-executing the encoder. Keeping the cache on the GPU minimizes latency.

    2. Enhanced prefix caching with metadata

    To address the shortcomings of token-ID–based caching, V1 incorporates additional metadata, such as hashes of images or audio chunks, into the caching mechanism (Figure 4). This ensures that even if placeholder tokens are identical, the underlying multimodal content is correctly distinguished.

    Figure 4: Schematic showing how metadata enhances prefix caching for multimodal data.
    Figure 4: Schematic showing how metadata enhances prefix caching for multimodal data.
    Source: https://docs.google.com/presentation/d/1SZOJ1lCOj6BpHcwqCMcRNfjCNvEPInv8/edit#slide=id.p1

    3. Optimized multimodal data processing

    In V0, converting raw data (e.g., PIL images) to tensors was a blocking CPU operation, often stalling GPU kernels. V1 tackles this by decoupling the processes:

    • Process 0 (CPU): Handles input processing and raw data conversion.
    • Process 1 (GPU): Executes the forward pass independently.

    This asynchronous pipeline ensures that heavy CPU operations do not block GPU performance, leading to significant latency reductions.

    4. Multimodal feature caching

    Beyond prefix caching, V1 introduces feature caching for raw data conversion:

    • Dual mirror caches: Both CPU and GPU processes maintain mirrored caches on CPU memory, minimizing data transfers.
    • Efficient hashing: Using consistent hashes for raw data allows the system to skip redundant conversions, improving throughput in both online and offline scenarios.

    Benchmark results

    vLLM V1’s improvements have been validated across two key scenarios: online serving and offline inference.

    Online serving

    Using the Qwen2-VL 7B model on the VisionArena dataset—a real-world vision QA benchmark—vLLM V1 demonstrates:

    • Low latency at high QPS: While differences are subtle at low QPS, at higher throughput, V1 significantly outperforms V0, as shown in Figure 5.
    • Competitive edge: When compared with other open source alternatives, V1 maintains superior performance in high QPS regimes.
    Figure 5: Latency vs. QPS comparison for vLLM V0, V1, and a leading open source alternative.
    Figure 5: Latency vs. QPS comparison for vLLM V0, V1, and a leading open source alternative.
    Source: https://docs.google.com/presentation/d/1SZOJ1lCOj6BpHcwqCMcRNfjCNvEPInv8/edit#slide=id.p1

    Offline inference

    For offline inference, we benchmarked using the MMMU Pro Vision dataset with the Molmo-72B model using 4xH100s. Figure 6 shows the following results:

    • Throughput gains: vLLM V1, even without caching, shows around a 40% performance boost over V0. 
    • Caching benefits: With both prefix and feature caching enabled, scenarios with repeated requests (up to 100% repeat) experience dramatic throughput improvements. Even for unique prompts, the overhead is minimal compared to the benefits.
    Figure 6: Throughput improvements in offline inference under varying request repeat conditions.
    Figure 6: Throughput improvements in offline inference under varying request repeat conditions.
    Source: https://docs.google.com/presentation/d/1SZOJ1lCOj6BpHcwqCMcRNfjCNvEPInv8/edit#slide=id.p1

    Conclusion

    vLLM V1 marks a pivotal upgrade in serving large, multimodal language models. By addressing the challenges of chunked prefill, enhancing caching mechanisms, and optimizing data processing pipelines, V1 delivers lower latency, higher throughput, and robust performance across diverse hardware platforms.

    Neural Magic (now part of Red Hat) is proud to be a top commercial contributor to vLLM, driving these innovations forward and empowering the community with open, efficient, and scalable AI solutions. We invite you to explore vLLM V1, experiment with our open source tools, and join us in shaping the future of multimodal inference.

    For more information and to get started with vLLM, visit the GitHub repository. See more on the support vLLM provides for multi-modal models here.

    Feel free to reach out with questions or share your feedback on the vLLM Slack workspace as we continue to evolve vLLM!

    OSZAR »
    Last updated: March 31, 2025

    Related Posts

    • Generative AI large language model prompt patterns: Tips for developers

    • Red Hat publishes Docker Hub images for Granite 7B LLMs and InstructLab

    • Introducing Podman AI Lab: Developer tooling for working with LLMs

    • How to use LLMs in Java with LangChain4j and Quarkus

    • How to use AMD GPUs for model serving in OpenShift AI

    • Enhance LLMs and streamline MLOps using InstructLab and KitOps

    Recent Posts

    • Alternatives to creating bootc images from scratch

    • How to update OpenStack Services on OpenShift

    • How to integrate vLLM inference into your macOS and iOS apps

    • How Insights events enhance system life cycle management

    • Meet the Red Hat Node.js team at PowerUP 2025

    What’s up next?

    Learn how large language models (LLMs) are created and use Red Hat Enterprise Linux AI to experiment within an LLM in this hands-on learning path.

    Start the activity
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Products

    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue

    OSZAR »