Mark Kurtz
Mark Kurtz's contributions
Article
Fly Eagle(3) fly: Faster inference with vLLM & speculative decoding
Alexandre Marques
+2
Boost inference performance by up to 2.5X with vLLM's Eagle 3 speculative decoding integration. Discover how in this blog post.
Article
The hidden cost of large language models
Mark Kurtz
Discover how model compression slashes LLM deployment costs for technical practitioners, covering quantization, pruning, distillation, and speculative decoding.
Article
GuideLLM: Evaluate LLM deployments for real-world inference
Jenny Yi
+2
Learn how to evaluate the performance of your LLM deployments with the open source GuideLLM toolkit to optimize cost, reliability, and user experience.
Article
Axolotl meets LLM Compressor: Fast, sparse, open
Rahul Tuli
+3
Discover how to deploy compressed, fine-tuned models for efficient inference with the new Axolotl and LLM Compressor integration.
Article
Enable 3.5 times faster vision language models with quantization
Shubhra Pandit
+4
Learn how quantized vision-language models enable faster inference, lower costs, and scalable AI deployment without compromising capability.
Article
Deployment-ready reasoning with quantized DeepSeek-R1 models
Eldar Kurtić
+3
Explore new open source quantized reasoning models based on the DeepSeek-R1-Distill suite that deliver near-perfect accuracy and inference speed improvements.
Article
2:4 Sparse Llama: Smaller models for efficient GPU inference
Eldar Kurtić
+4
Discover Sparse Llama: A 50% pruned, GPU-optimized Llama 3.1 model with 2:4 sparsity, enabling faster, cost-effective inference without sacrificing accuracy.
Article
Multimodal model quantization support through LLM Compressor
Kyle Sayers
+3
Explore multimodal model quantization in LLM Compressor, a unified library for optimizing models for deployment with vLLM.

Fly Eagle(3) fly: Faster inference with vLLM & speculative decoding
Boost inference performance by up to 2.5X with vLLM's Eagle 3 speculative decoding integration. Discover how in this blog post.

The hidden cost of large language models
Discover how model compression slashes LLM deployment costs for technical practitioners, covering quantization, pruning, distillation, and speculative decoding.

GuideLLM: Evaluate LLM deployments for real-world inference
Learn how to evaluate the performance of your LLM deployments with the open source GuideLLM toolkit to optimize cost, reliability, and user experience.

Axolotl meets LLM Compressor: Fast, sparse, open
Discover how to deploy compressed, fine-tuned models for efficient inference with the new Axolotl and LLM Compressor integration.

Enable 3.5 times faster vision language models with quantization
Learn how quantized vision-language models enable faster inference, lower costs, and scalable AI deployment without compromising capability.

Deployment-ready reasoning with quantized DeepSeek-R1 models
Explore new open source quantized reasoning models based on the DeepSeek-R1-Distill suite that deliver near-perfect accuracy and inference speed improvements.

2:4 Sparse Llama: Smaller models for efficient GPU inference
Discover Sparse Llama: A 50% pruned, GPU-optimized Llama 3.1 model with 2:4 sparsity, enabling faster, cost-effective inference without sacrificing accuracy.

Multimodal model quantization support through LLM Compressor
Explore multimodal model quantization in LLM Compressor, a unified library for optimizing models for deployment with vLLM.