Mark Kurtz

Mark Kurtz's contributions

Boost inference performance by up to 2.5X with vLLM's Eagle 3 speculative decoding integration. Discover how in this blog post.

Discover how model compression slashes LLM deployment costs for technical practitioners, covering quantization, pruning, distillation, and speculative decoding.

Learn how to evaluate the performance of your LLM deployments with the open source GuideLLM toolkit to optimize cost, reliability, and user experience.

Discover how to deploy compressed, fine-tuned models for efficient inference with the new Axolotl and LLM Compressor integration.

Learn how quantized vision-language models enable faster inference, lower costs, and scalable AI deployment without compromising capability.

Explore new open source quantized reasoning models based on the DeepSeek-R1-Distill suite that deliver near-perfect accuracy and inference speed improvements.

Discover Sparse Llama: A 50% pruned, GPU-optimized Llama 3.1 model with 2:4 sparsity, enabling faster, cost-effective inference without sacrificing accuracy.

Explore multimodal model quantization in LLM Compressor, a unified library for optimizing models for deployment with vLLM.

Report a website issue

Linux

Java runtimes & frameworks

Kubernetes

Integration & App Connectivity

AI/ML

Automation

Developer tools

Developer Sandbox

Programming Languages & Frameworks

System Design & Architecture

Developer Productivity

Secure Development & Architectures

Platform Engineering

Automated Data Processing

Start exploring in the Developer Sandbox for free

E-Books

Cheat Sheets

Documentation

Red Hat Learning

Mark Kurtz

Mark Kurtz's contributions

Fly Eagle(3) fly: Faster inference with vLLM & speculative decoding

The hidden cost of large language models

GuideLLM: Evaluate LLM deployments for real-world inference

Axolotl meets LLM Compressor: Fast, sparse, open

Enable 3.5 times faster vision language models with quantization

Deployment-ready reasoning with quantized DeepSeek-R1 models

2:4 Sparse Llama: Smaller models for efficient GPU inference

Multimodal model quantization support through LLM Compressor

Products

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue