Skip to main content
Redhat Developers  Logo
  • Products

    Featured

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat OpenShift AI
      Red Hat OpenShift AI
    • Red Hat Enterprise Linux AI
      Linux icon inside of a brain
    • Image mode for Red Hat Enterprise Linux
      RHEL image mode
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • Red Hat Developer Hub
      Developer Hub
    • View All Red Hat Products
    • Linux

      • Red Hat Enterprise Linux
      • Image mode for Red Hat Enterprise Linux
      • Red Hat Universal Base Images (UBI)
    • Java runtimes & frameworks

      • JBoss Enterprise Application Platform
      • Red Hat build of OpenJDK
    • Kubernetes

      • Red Hat OpenShift
      • Microsoft Azure Red Hat OpenShift
      • Red Hat OpenShift Virtualization
      • Red Hat OpenShift Lightspeed
    • Integration & App Connectivity

      • Red Hat Build of Apache Camel
      • Red Hat Service Interconnect
      • Red Hat Connectivity Link
    • AI/ML

      • Red Hat OpenShift AI
      • Red Hat Enterprise Linux AI
    • Automation

      • Red Hat Ansible Automation Platform
      • Red Hat Ansible Lightspeed
    • Developer tools

      • Red Hat Trusted Software Supply Chain
      • Podman Desktop
      • Red Hat OpenShift Dev Spaces
    • Developer Sandbox

      Developer Sandbox
      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Secure Development & Architectures

      • Security
      • Secure coding
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
      • View All Technologies
    • Start exploring in the Developer Sandbox for free

      sandbox graphic
      Try Red Hat's products and technologies without setup or configuration.
    • Try at no cost
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • Java
      Java icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • API Catalog
    • Product Documentation
    • Legacy Documentation
    • Red Hat Learning

      Learning image
      Boost your technical skills to expert-level with the help of interactive lessons offered by various Red Hat Learning programs.
    • Explore Red Hat Learning
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Fly Eagle(3) fly: Faster inference with vLLM & speculative decoding

Accelerate AI inference with vLLM's Eagle 3 integration

July 1, 2025
Alexandre Marques Megan Flynn Mark Kurtz
Related topics:
Artificial intelligence
Related products:
Red Hat AI

Share:

    Key insights

    • vLLM now supports Eagle 3 for speculative decoding, boosting inference performance by up to 2.5X across diverse scenarios.
    • Speculative decoding performance is highly dependent on both request content and request rate, with the most significant gains in synchronous use cases.
    • Only a few speculative models are currently available for shorter context lengths; however, more are coming soon!  

    Large language models (LLMs) generate tokens autoregressively: they run a forward pass, sample a token, append it to the input, and then repeat this process until completion. This causes high latency because each step waits for the prior token.  Especially at low batch sizes, memory movement can take more time than the computation itself.

    If we could parallelize the forward pass to generate multiple tokens simultaneously, we would perform roughly the same number of floating-point operations (FLOPs) per token while fully leveraging the available GPU compute, with roughly the same memory movement as a forward pass that only generates one token. However, simply parallelizing the token generation for a standard LLM will break the step-by-step dependencies, resulting in different predictions and lower accuracy.

    Speculative decoding: A solution for faster LLMs

    This is where speculative decoding comes in! We use a smaller, “speculative” model to generate k draft tokens, and then use the original "verifier" model to check whether those tokens are correct. We probabilistically accept or reject tokens according to the algorithm in Fast Inference from Transformers via Speculative Decoding, which guarantees that the generated tokens come from the same distribution as the original model. If the tokens we draft have a high probability of acceptance, we can achieve large speedups, especially if we can draft them at a low cost. 

    However, if the verifier does not accept the tokens, then we waste compute both drafting and verifying tokens that are not used. Therefore, we have a tradeoff between the number of tokens we can potentially generate in parallel and the likelihood that those tokens will be accepted. The optimal point along that tradeoff depends heavily on the specific use case. 

    Eagle 3 is the current state-of-the-art method for speculative decoding. It uses a single transformer layer to generate multiple draft tokens autoregressively based on the original model’s current state. Specifically, it reuses the feature outputs from specific layers of the original model and concatenates them with the embeddings to augment its generation. This allows it to accurately predict the verifier’s next tokens, even with a very small speculator. Because of this feature, however, Eagle speculators only work for the model that they were trained for. 

    Eagle-3 with vLLM

    Research implementations are generally very limited, so the vLLM community has been working hard to introduce both Eagle 1 and Eagle 3 into vLLM as of version 0.8.5. The newest version (0.9.1) supports CUDA graphs for Eagle 1+3 and provides speculative decoding metrics, including the draft acceptance rate, per-position acceptance rates, and the mean acceptance length (the average number of tokens generated per forward pass of the verifier model).

    With the latest releases of vLLM, you can serve an Eagle 3 speculator for Llama 3.3 70B with the following command:

    VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.3-70B-Instruct \
    --seed 42 \
    -tp 4 \
    --speculative-config '{"model": "yuhuili/EAGLE3-LLaMA3.3-Instruct-70B", "num_speculative_tokens": 3, "method":"eagle3", "draft_tensor_parallel_size":1}'  

    You can interact with the server the exact same way as before, with a boost! As shown in Figure 1, incorporating Eagle 3 speculative decoding leads to significantly reduced end-to-end latency in text generation compared to a baseline model.

    Figure 1
    Figure 1: A text generation example comparing a prompt run through a baseline model and the same model with Eagle 3 speculative decoding added showing faster end-to-end latency for Eagle 3.

    Real-world performance

    Eagle can provide significant speedups in various scenarios. Let's examine several variables that can influence performance in a specific use case.

    Varying request rates

    Some servers can have lower usage, meaning fewer requests per second, and prioritize the latency (the time it takes for a response) of the request, while others might be used more heavily with many requests per second and therefore require higher throughput. Speculative decoding is fundamentally a tradeoff—we spend a little bit of extra compute to reduce memory movement. At low request rates, we are memory-bound, so reducing memory movement can really help with latency. However, at higher throughputs or batch sizes, we are compute-bound, and speculative decoding can provide worse performance. 

    We utilize GuideLLM, a popular LLM evaluation toolkit, to benchmark two Eagle models trained by the Eagle team: Llama 3.1 8B and Llama 3.3 70B, across a range of request rates for the MT Bench dataset, which comprises a set of two-turn questions that emulate a chat environment, for example, asking the model to write a travel blog about a trip to Hawaii.  

    For this test, the 8B model is served on a single A100, and the 70B model is served on four A100s. Additionally, we limit the generation length to 1024 tokens for consistent testing. As shown in Figure 2, for a given request rate (x-axis), the latency of each request is reduced by up to 1.8X for the 8B model. The latency for each request of the 70B model is reduced by up to 1.6X at low request rates, but latency increases at higher request rates due to compute saturation.

    Figure 2
    Figure 2: Comparing requests per second vs. request latency with and without Eagle 3 speculative decoding, shown for Llama 3.1 8B (left) and Llama 3.3 70B (right), highlighting the tradeoffs for when speculative decoding helps and when it can hurt performance.

    Tailoring performance 

    The more accurately a speculative model can predict the next tokens, the greater the speedup. Eagle was primarily trained on chat data and tends to perform best on tasks that mimic its training data. Therefore, to ensure robustness, we evaluate the model on a variety of tasks from SpecBench, a standardized benchmark suite for comparing speculative decoding methods. We specifically chose math reasoning, retrieval-augmented generation (RAG), translation, and added HumanEval as well to test coding use cases. 

    As shown in Figure 3, Eagle performs poorly on the German-to-English translation task. Conversely, it performs very well on tasks that are more similar to its training data, such as RAG and Math Reasoning, with up to 2.1X better latency, due to more accurate predictions for its next token. Eagle drafters could be trained specifically for specific datasets to improve their performance, such as the German-to-English translation task, by incorporating that data.

    Figure 3
    Figure 3: Comparing requests per second vs request latency with and without Eagle 3 speculative decoding for Llama 3.3 70B across various tasks including coding (upper left), math (upper right), retrieval augmented generation (lower left), and translation (lower right). The graphs highlight the impact training data has on inference performance.

    Tuning performance 

    The optimal draft length for a given situation can depend on both the request rate and the data being input. As shown above, we have a tradeoff between memory and compute. Creating a longer draft requires more compute, with the potential to accept a longer draft for each forward pass. Since the acceptance of a token in the draft is conditional on the acceptance of all of the prior tokens, the chances that a given token will be accepted decrease later in the draft. As noted earlier, we only get a speedup from drafting a token that matches the verifier’s.  

    At low request rates, it is often better to create a longer draft, since we have extra compute even if there is a low probability that the last few tokens will be accepted. At high request rates, though, it is optimal to reduce the draft length to avoid drafting tokens that have a lower probability of being accepted.  

    This optimal length is also heavily dependent on the specific task, since that determines the probability of acceptance of a given token. For example, Figure 4 shows that for the translation task, Eagle performs so poorly that the optimal draft length is 1, or even 0. RAG tokens, though, are easier for the Eagle speculator to predict, so even draft lengths of 5 tokens improve performance. So, it is important to measure performance with your specific scenario to determine the optimal draft length!  

    Figure 4
    Figure 4: Comparing requests per second vs request latency with and without Eagle 3 speculative decoding for Llama 3.3 70B across various draft lengths for translation (left) and Retrieval Augmented Generation (right) highlighting the impact both data and draft length have on performance.

    Decoding with trees 

    For the results above, we focused on greedy decoding, where we draft one sequence of tokens with the speculator. Many academic works (such as those in 1, 2, 3, 4) have recently focused on tree decoding, though, which is a method that creates a tree of possible tokens (often 64) where there are multiple options for the first token, second token, etc. in the draft. Multiple branches of possible drafts are appended into a single input, and attention masking is used to ensure that each draft branch only "sees" the tokens from its branch. The best branch is selected using a single forward pass of the verifier model. This allows methods to increase the length of the accepted draft for a single forward pass.  

    However, this involves spending a lot of extra compute for relatively small gains in the acceptance length. For example, with "greedy" decoding, we can increase the mean acceptance length slightly by drafting 1 or 2 more tokens. With tree decoding, we might go from drafting and verifying 5 tokens to drafting and verifying 64 tokens, even though this may only increase the accepted length by 1 or 2 tokens. This can be worthwhile in theory with lots of extra compute; however, the tradeoff is not particularly favorable in most practical deployment cases.  

    We test this empirically using an engine that supports tree decoding using Eagle. At low request rates, specifically synchronous, tree decoding is the best option for reducing latency. Unfortunately, as request rates increase, it quickly becomes less efficient to validate so many extra tokens for a slight increase in accepted length, and tree decoding begins to significantly slow down the generation, as seen in Figure 5. For this reason, vLLM does not currently support tree decoding.  

    Figure 5
    Figure 5: Comparing requests per second vs request latency with different types of decoding including greedy and tree for Llama 3.1 8B highlighting that tree decoding quickly gives worse performance past synchronous use cases.

    Conclusion

    Speculative decoding delivers impressive speedups in LLM inference, and vLLM’s Eagle 3 integration brings those gains to production-ready deployments, enabling faster and more efficient generation. However, today’s available speculative models come with limitations, such as a 2048 context length cap, and their performance can vary depending on the task and training data.

    To maximize gains, it’s crucial to benchmark speculative models on your own workloads and consider fine-tuning them on your data to overcome performance gaps.

    At Red Hat, we’re working to expand the ecosystem with more speculative models, longer context lengths, and production-grade tools that make speculative decoding easier and more powerful for enterprise use. Stay tuned for upcoming releases as we continue advancing the capabilities of fast, scalable AI!

    Ready to deploy faster, more scalable AI? Contact us to learn more about enterprise solutions or contribute to our open source journey today.

    OSZAR »

    Related Posts

    • Llama 4 herd is here with Day 0 inference support in vLLM

    • How we optimized vLLM for DeepSeek-R1

    • How RamaLama makes working with AI models boring

    • LLM Compressor is here: Faster inference with vLLM

    • Deployment-ready reasoning with quantized DeepSeek-R1 models

    • Distributed inference with vLLM

    Recent Posts

    • Our top 10 articles of 2025 (so far)

    • The benefits of auto-merging GitHub and GitLab repositories

    • Supercharging AI isolation: microVMs with RamaLama & libkrun

    • Simplify multi-VPC connectivity with amazon.aws 9.0.0

    • How HaProxy router settings affect middleware applications

    What’s up next?

    Learn how to access a large language model using Node.js and LangChain.js. You’ll also explore LangChain.js APIs that simplify common requirements like retrieval-augmented generation (RAG).

    Start the activity
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Products

    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue

    OSZAR »