Skip to main content
Redhat Developers  Logo
  • Products

    Featured

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat OpenShift AI
      Red Hat OpenShift AI
    • Red Hat Enterprise Linux AI
      Linux icon inside of a brain
    • Image mode for Red Hat Enterprise Linux
      RHEL image mode
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • Red Hat Developer Hub
      Developer Hub
    • View All Red Hat Products
    • Linux

      • Red Hat Enterprise Linux
      • Image mode for Red Hat Enterprise Linux
      • Red Hat Universal Base Images (UBI)
    • Java runtimes & frameworks

      • JBoss Enterprise Application Platform
      • Red Hat build of OpenJDK
    • Kubernetes

      • Red Hat OpenShift
      • Microsoft Azure Red Hat OpenShift
      • Red Hat OpenShift Virtualization
      • Red Hat OpenShift Lightspeed
    • Integration & App Connectivity

      • Red Hat Build of Apache Camel
      • Red Hat Service Interconnect
      • Red Hat Connectivity Link
    • AI/ML

      • Red Hat OpenShift AI
      • Red Hat Enterprise Linux AI
    • Automation

      • Red Hat Ansible Automation Platform
      • Red Hat Ansible Lightspeed
    • Developer tools

      • Red Hat Trusted Software Supply Chain
      • Podman Desktop
      • Red Hat OpenShift Dev Spaces
    • Developer Sandbox

      Developer Sandbox
      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Secure Development & Architectures

      • Security
      • Secure coding
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
      • View All Technologies
    • Start exploring in the Developer Sandbox for free

      sandbox graphic
      Try Red Hat's products and technologies without setup or configuration.
    • Try at no cost
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • Java
      Java icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • API Catalog
    • Product Documentation
    • Legacy Documentation
    • Red Hat Learning

      Learning image
      Boost your technical skills to expert-level with the help of interactive lessons offered by various Red Hat Learning programs.
    • Explore Red Hat Learning
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

SparseGPT: Remove 100 billion parameters for free

March 21, 2023
Robert Shaw Michael Goin
Related topics:
Artificial intelligence
Related products:
Red Hat AI

Share:

    Large language models (LLMs) solve natural language processing problems with astounding accuracy. However, these models are enormous and require a lot of space, cost, and computation power to deploy. For example, the GPT-175B model has 175 billion parameters requiring 320GB of storage and at least 5 A100 GPUs with 80GB of memory each for inference. This computation power is expensive, making this solution only viable for some organizations. Hence, deployment of such models falls outside the purview of small organizations and individuals. 

    Several model compression techniques, such as quantization and pruning, have been proposed to compress LLMs. Previous solutions have required extensive retraining to recover lost accuracy. This retraining also requires massive computation power, which is not economical for real-world deployments. Therefore, a solution that can prune LLMs without retraining the model would be a game changer in deploying large language models. At the moment, it is estimated that it can take several weeks to prune a GPT3 model. Enter SparseGPT.

    SparseGPT is a post-training pruning method for compressing large language models such as GPT3 efficiently and accurately. The method can be used to prune large language models in one-shot with minimal accuracy loss. For example, you can use SparseGPT to prune OPT-175B to 50% sparsity with a 0.13 decrease in perplexity.

    Thus 100 billion weights from the model can be ignored at inference time, increasing the model's throughput while reducing latency. SparseGPT can be applied to a GPT model with 175B parameters on a single GPU in a couple of hours with negligible accuracy loss.

    In this article, you will learn about the SparseGPT algorithm and how it works. By the end of the article, you will be familiar with how this method reduces the size of large language models, making it possible to deploy them efficiently on commodity CPUs.

    Let’s dive in.

    Methods for pruning large language models

    To understand how SparseGPT works, you have to understand post-training pruning and layer-wise pruning. 

    Post-training pruning works by creating a compressed version of a model that satisfies a given compression predicate. The method has been used to reduce the computation cost of quantization-aware training. Existing post-training pruning work mostly focuses on CNN and transformer models with less than 100 million parameters. There is also existing work on quantization of large language models, but it mostly focuses on reducing the precision of the numerical representation of individual weights.

    Layer-wise pruning splits the model compression problem into layer-wise subproblems. The quality of the solution is measured by the error between the output of the uncompressed and compressed layer by solving a constrained optimization problem. The objective is to find a sparsity mask for each layer that satisfies the constraint and the dense weights. The sparsity mask of a tensor is the binary tensor of the same dimensions with 0 at the indices of the sparsified entries and 1 at other indices. After solving each of the layer-wise problems, the model is rebuilt from the compressed layers. The reconstructed model preserves the accuracy of the dense model if the layer-wise errors are small. 

    Optimizing the sparsity mask and the weights jointly makes the problem NP-hard. Hence, solving this problem can only be done through approximations. AdaPrune has been proposed for post-training through magnitude-based weight selection and reconstruction of the remaining weights by applying SGD finetuning. Other methods such as Optimal Brain Compression, have also been proposed to remove weights one at a time and fully reconstruct the weights after each iteration. Unfortunately, these methods have a high runtime on LLMs. And this is where SparseGPT enters the scene.

    How does the SparseGPT algorithm work?

    Assuming that a pruning mask is fixed, the update has a closed-form solution via the Hessian. The problem can be split into row-wise subproblems. Reconstructing pruned weights is a complex problem because it requires the inversion of a matrix while solving for each row, and the inverse of a masked Hessian is not equal to the masked version of the full inverse. This problem is referred to as the Row-Hessian challenge. The task would be easier if all row masks were the same (for example, if pruning happens column-wise), meaning that only a single shared inverse would be computed. Introducing such constraints in the mask selection would mean sparsifying weights in entire columns, which is difficult and reduces the model's accuracy. SparseGPT solves the row-hessian challenge by reusing Hessians between rows and distinct pruning masks, leading to an accurate and efficient algorithm.

    Illustration of the Row-Hessian challenge.

    SparseGPT aims to prune large language models at a fraction of the cost and time spent by previous methods. This makes SparseGPT applicable to models with 100 billion parameters. It works by setting the weights not in the mask to 0 and setting the rest to their current value. 

    The SparseGPT algorithm works as follows, given a fixed pruning mask: 

    • Prune weights in each column of the weight matrix incrementally using a sequence of Hessian inverses
    • Update the remainder of the weights in the rows located to the right of the column being processed

    Pruned weights (blue) are updated to compensate for the pruning error, while unpruned weights (light blue) generate no updates.

    Visualization of the SparseGPT reconstruction alogrithm.

    Figure 2: Visualization of the SparseGPT reconstruction algorithm.

    SparseGPT is local because it performs weight updates after each pruning step, maintaining the input-output relationship between each layer. The high parametrization of GPT models makes it possible to make the updates without any global gradient information. The cost of the reconstruction process consists of the computation of the initial Hessian, iterating through the inverse Hessian sequence, and pruning.

    The pruning mask is chosen adaptively while performing the reconstruction process. This is done to reap the benefits adaptive weight selection offers during pruning. The selection is done through iterative blocking, where a pruning mask is chosen based on the reconstruction error while using the diagonal values in the Hessian sequence. The weights are then updated before the pruning mask for the next block is chosen. This is done iteratively, enabling non-uniform selection per column using the corresponding Hessian information and prior weight updates.

    Image illustrating mask selection.

    Figure 3: Mask selection.

    The SparseGPT algorithm pseudocode is shown below. 

    Full SparseGPT alogrithm pseudocode.

    Figure 4: Full SparseGPT Algorithm Pseudocode.

    The pseudocode is interpreted as:

    1. Create a pruning mask M with zeros and ones.
    2. Construct a matrix E to store the block quantization errors.
    3. Calculate the inverse Hessian sequence information via Cholesky decomposition.
    4. Loop the blocks while updating the pruning mask and the error.
    5. Select the largest weights for each block and set their corresponding values in the pruning mask to 1.
    6. Prune the remaining weights and set their corresponding values in the pruning mask to 0.
    7. Compute the pruning error for each weight in the block and set the error in the E matrix.
    8. Freeze the weights that were not pruned.
    9. Update the weights in the block using the Cholesky decomposition of the inverse Hessian matrix.
    10. Update weights not updated in the previous loop after processing all the blocks.
    11. Set pruned weights to 0 by element-wise multiplying the weight matrix with the pruning mask.

    SparseGPT is implemented in PyTorch using the Hugging Face Transformers library. Experiments on a single NVIDIA A100 GPU with 80GB of memory show that ​​SparseGPT can fully sparsify 175-billion-parameter models in approximately 4 hours without fine tuning.

    SparseGPT evaluation

    SparseGPT is evaluated using perplexity on the raw-WikiText2 test set, performing better than magnitude pruning.

    Graph of OPT perplexity results on raw-WikiText2.

    Figure 5: OPT perplexity results on raw-WikiText2.

    The researchers also evaluated SparseGPT and magnitude pruning on OPT-175B. The left graph shows how the performance of both methods scales with the degree of sparsity. Magnitude pruning achieves 10% sparsity, after which the model’s loss starts reducing, while SparseGPT enables up to 60% sparsity.  

    Line graph comparison of SparseGPT against magnitude pruning on OPT-175B.

    Figure 6: Comparison of SparseGPT against magnitude pruning on OPT-175B.Line graph on compressing the entire OPT model family to different sparsity patterns using SparseGPT.Figure 7: Above: Compressing the entire OPT model family (135M, 350M, ..., 66B, 175B) to different sparsity patterns using SparseGPT.

    For BLOOM-176B, magnitude pruning achieves 30% sparsity and SparseGPT 50%. SparseGPT can remove 100 billion weights from BLOOM and OPT models without significant loss in accuracy.

    Line graph showing uniformly compressing BLOOM-176B to various sparsity levels with SparseGPT and magnitude pruning.

    Figure 8: Uniformly compressing BLOOM-176B to various sparsity levels with SparseGPT and magnitude pruning.

    The researchers obtained the following results for ZeroShot experiments, showing that SparseGPT retains accuracy closer to the original while magnitude pruning doesn’t.

    Table of zeroshot results on several datasets for sparsified variants of OPT-175B.

    Figure 9: ZeroShot results on several datasets for sparsified variants of OPT-175B.

    Final thoughts

    Amazingly, 100 billion parameters from large language models can be ignored at inference. This accomplishment paves the way for the broad usage of these models because of their lower computational and storage requirements. The researchers conjecture that fine-tuning the large models after pruning can lead to the recovery of more accuracy while allowing the models to be pruned between 80% and 90%.

    Source for all figures: "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot." arXiv, 19 January 2023, https://arxiv.org/pdf/2301.00774.pdf

    OSZAR »
    Last updated: March 25, 2025

    Recent Posts

    • How to set up NVIDIA NIM on Red Hat OpenShift AI

    • Leveraging Ansible Event-Driven Automation for Automatic CPU Scaling in OpenShift Virtualization

    • Python packaging for RHEL 9 & 10 using pyproject RPM macros

    • Kafka Monthly Digest: April 2025

    • How to scale smarter with OpenShift Serverless and Knative

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Products

    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue

    OSZAR »