DeepSpeed

DeepSpeedMain

Community hub

DeepSpeed

8 pages, 0 posts

0 subscribers

Recent from talks

Be the first to start a discussion here.

Recent from talks

Be the first to start a discussion here.

Contribute something

About hubMembersContent overviewUpdatesRules

Main reference articles

DeepSpeed

View on Wikipedia

from Wikipedia

DeepSpeed

Original author	Microsoft Research
Developer	Microsoft
Initial release	May 18, 2020; 5 years ago (2020-05-18)

Stable release	v0.16.5 / March 27, 2025; 10 months ago (2025-03-27)

Repository	github.com/microsoft/DeepSpeed
Written in	Python, CUDA, C++
Type	Software library
License	Apache License 2.0
Website	deepspeed.ai

DeepSpeed is an open source deep learning optimization library for PyTorch.^[1]

Library

[edit]

The library is designed to reduce computing power and memory use and to train large distributed models with better parallelism on existing computer hardware.^[2]^[3] DeepSpeed is optimized for low latency, high throughput training. It includes the Zero Redundancy Optimizer (ZeRO) for training models with 1 trillion or more parameters.^[4] Features include mixed precision training, single-GPU, multi-GPU, and multi-node training as well as custom model parallelism. The DeepSpeed source code is licensed under MIT License and available on GitHub.^[5]

The team claimed to achieve up to a 6.2x throughput improvement, 2.8x faster convergence, and 4.6x less communication.^[6]

References

[edit]

^ "Microsoft Updates Windows, Azure Tools with an Eye on The Future". PCMag UK. May 22, 2020.
^ Yegulalp, Serdar (February 10, 2020). "Microsoft speeds up PyTorch with DeepSpeed". InfoWorld.
^ "Microsoft unveils "fifth most powerful" supercomputer in the world". Neowin. 18 June 2023.
^ "Microsoft trains world's largest Transformer language model". February 10, 2020.
^ "microsoft/DeepSpeed". July 10, 2020 – via GitHub.
^ "DeepSpeed: Accelerating large-scale model inference and training via system optimizations and compression". Microsoft Research. 2021-05-24. Retrieved 2021-06-19.

External links

[edit]

v t e Deep learning software
Comparison
Open source	Apache MXNet Apache SINGA Caffe Deeplearning4j DeepSpeed Dlib Keras Microsoft Cognitive Toolkit ML.NET OpenNN PyTorch TensorFlow Theano Torch ONNX OpenVINO MindSpore
Proprietary	Apple Core ML IBM Watson Neural Designer Wolfram Mathematica MATLAB Deep Learning Toolbox
Category

Microsoft free and open-source software (FOSS)

Overview

Software

Applications	3D Movie Maker Atom Conference XP Family.Show File Manager Open Live Writer Microsoft Edit Microsoft PowerToys Terminal Windows Calculator Windows Console Windows Package Manager WorldWide Telescope XML Notepad
Video games	Allegiance Zork
Programming languages	Bosque C# Dafny F# F* GW-BASIC IronPython IronRuby Lean P Power Fx PowerShell Project Verona Q# Small Basic Online TypeScript Visual Basic
Frameworks, development tools	.NET .NET Framework .NET Gadgeteer .NET MAUI .NET Micro Framework AirSim ASP.NET ASP.NET AJAX ASP.NET Core ASP.NET MVC ASP.NET Razor ASP.NET Web Forms Avalonia Babylon.js BitFunnel Blazor C++/WinRT CCF ChakraCore CLR Profiler Dapr DeepSpeed DiskSpd Dryad Dynamic Language Runtime eBPF on Windows Electron Entity Framework Fluent Design System Fluid Framework Infer.NET LightGBM Managed Extensibility Framework Microsoft Automatic Graph Layout Microsoft C++ Standard Library Microsoft Cognitive Toolkit Microsoft Design Language Microsoft Detours Microsoft Enterprise Library Microsoft SEAL mimalloc Mixed Reality Toolkit ML.NET mod_mono Mono MonoDevelop MSBuild MsQuic Neural Network Intelligence npm NuGet OneFuzz Open Management Infrastructure Open Neural Network Exchange Open Service Mesh Open XML SDK Orleans Playwright ProcDump ProcMon Python Tools for Visual Studio R Tools for Visual Studio RecursiveExtractor Roslyn Sandcastle SignalR StyleCop SVNBridge T2 Temporal Prover Text Template Transformation Toolkit TLA+ Toolbox U-Prove vcpkg Virtual File System for Git Voldemort VoTT Vowpal Wabbit Windows App SDK Windows Communication Foundation Windows Driver Frameworks KMDF UMDF Windows Forms Windows Presentation Foundation Windows Template Library Windows UI Library WinJS WinObjC WiX XDP for Windows XSP xUnit.net Z3 Theorem Prover
Operating systems	MS-DOS (v1.25, v2.0 & v4.0) Barrelfish SONiC Azure Linux
Other	ChronoZoom Extensible Storage Engine FlexWiki FourQ Gollum Project Mu ReactiveX SILK TLAPS TPM 2.0 Reference Implementation Windows Subsystem for Linux

Licenses

Forges

View on Grokipedia

from Grokipedia

DeepSpeed is an open-source deep learning optimization library developed by Microsoft that enables efficient, scalable distributed training and inference for large-scale models, incorporating innovations such as memory optimization and parallelism to achieve unprecedented speed and resource efficiency.^[1]^[2] As part of Microsoft's AI at Scale initiative, DeepSpeed provides a suite of tools including DeepSpeed-Training, which features technologies like ZeRO (Zero Redundancy Optimizer) for partitioning model states across GPUs to train trillion-parameter models without memory bottlenecks, and 3D-Parallelism combining data, tensor, and pipeline parallelism for extreme scalability.^[3]^[4] Additional training advancements include DeepSpeed-MoE for sparse mixture-of-experts architectures that reduce computational costs while maintaining performance, and ZeRO-Infinity that extends training to massive supercomputers with thousands of GPUs by offloading parameters to CPU and NVMe storage.^[5]^[6]^[7] For inference, DeepSpeed-Inference optimizes transformer-based models through kernel fusions, parallelism strategies (including tensor, pipeline, and expert parallelism), and memory-breaking techniques, delivering up to 6.9 times higher throughput for large models compared to standard frameworks.^[8] Complementing these, DeepSpeed-Compression employs methods like ZeroQuant for quantization and XTC for extreme compression, enabling faster inference and reduced model sizes for pre-trained transformers without significant accuracy loss. DeepSpeed integrates seamlessly with popular frameworks such as Hugging Face Transformers, PyTorch Lightning, and MosaicML, supporting models like MT-530B and BLOOM, and has powered breakthroughs in training the world's largest language models while democratizing access to high-performance AI for researchers and practitioners.^[2]^[9] Extensions include DeepSpeed-Ulysses for sequence parallelism (2023), DeepSpeed-Chat for reinforcement learning from human feedback (RLHF), and DeepSpeed4Science for accelerating scientific simulations (all 2023), as well as Arctic for long sequence training (2025).^[2]^[10]^[11]

Overview

Purpose and Capabilities

DeepSpeed is an open-source deep learning optimization library developed by Microsoft Research, designed to integrate seamlessly with PyTorch for enabling distributed training and inference of large-scale models.^[1] Its core purpose is to reduce the memory footprint of training processes, accelerate computational speed, and scale model training to handle trillion-parameter architectures across thousands of GPUs, thereby democratizing access to extreme-scale deep learning.^[12]^[2] At a high level, DeepSpeed supports training models up to 10 times larger and 3 to 5 times faster than standard PyTorch implementations, with innovations like the ZeRO optimizer serving as a key enabler for memory efficiency.^[12]^[2] The library encompasses a suite of components tailored to different phases of the deep learning pipeline: DeepSpeed-Training for optimizing model training at massive scales, DeepSpeed-Inference for high-throughput deployment, and DeepSpeed-Compression for reducing model sizes through techniques like quantization.^[2]^[1]

Development History

DeepSpeed was developed by Microsoft Research as part of the AI at Scale initiative to overcome challenges in training large-scale deep learning models, particularly addressing memory and efficiency limitations encountered in projects like the Turing-NLG, a 17-billion-parameter language model released in 2020.^[12]^[13] The project aimed to enable distributed training optimizations for PyTorch, focusing on scaling beyond existing frameworks. Led by researchers including Samyam Rajbhandari, the team included key contributors such as Jeff Rasley, Olatunji Ruwase, and Yuxiong He.^[12] The library was initially released in February 2020 as an open-source framework on GitHub, introducing core features like the ZeRO optimizer for efficient PyTorch distributed training of models exceeding 100 billion parameters.^[12]^[9] In May 2020, an update added ZeRO-2, extending memory optimizations to support training of models with over 100 billion parameters while improving speed and scalability.^[14] This marked an early milestone in enabling larger model training with reduced hardware demands. By October 2021, DeepSpeed powered the training of the Megatron-Turing NLG 530B model in collaboration with NVIDIA, demonstrating its capability for monolithic transformers at unprecedented scales using 3D parallelism strategies.^[15] In July 2022, it supported the training of the 176-billion-parameter BLOOM model through integration with Megatron-DeepSpeed, combining ZeRO sharding, pipeline parallelism, and tensor parallelism across international research efforts.^[16]^[17] DeepSpeed's evolution expanded beyond training optimizations in 2022, incorporating inference capabilities with the September launch of ZeRO-Inference, which enabled deployment of massive models on single GPUs by offloading weights to CPU or NVMe storage.^[18] That December, the Data Efficiency Library was released, introducing techniques for curriculum learning and data routing to reduce training data needs by 1.5-2x while enhancing model quality for tasks like GPT-3 and BERT fine-tuning.^[19] Compression methods were also integrated around this time, broadening applicability to resource-constrained environments.^[1] In 2023, DeepSpeed introduced DeepSpeed-Chat, a framework for training chat models using reinforcement learning from human feedback (RLHF), and launched the DeepSpeed4Science initiative in September to develop AI system technologies for accelerating scientific discoveries in areas like biology and climate modeling.^[20]^[21] In 2024, advancements included FP6-LLM for efficient serving of large language models using FP6 quantization.^[22] By February 2025, DeepSpeed joined the LF AI & Data Foundation to further advance open-source deep learning optimization efforts.^[23] Throughout its development, DeepSpeed has been open-sourced on GitHub, fostering community contributions and continuous updates. By 2022, it integrated with Azure infrastructure to scale training workloads to 1024 GPUs, supporting models up to 2 trillion parameters and achieving near-linear performance gains.^[24]^[25]

Core Technologies

ZeRO Optimizer

The Zero Redundancy Optimizer (ZeRO) is a core memory optimization technology in DeepSpeed that partitions the optimizer states, gradients, and model parameters across multiple GPUs during distributed data-parallel training, eliminating redundant copies of these states that are typically replicated in standard data parallelism approaches. This partitioning allows each GPU to hold only a fraction of the total model states, significantly reducing per-GPU memory consumption and enabling the training of much larger models on hardware with limited memory, such as scaling to billion- or trillion-parameter language models without requiring full model replication on every device.^[4] ZeRO operates in progressive stages to balance memory savings with communication overhead. In ZeRO-1, only the optimizer states—such as momentum and variance in Adam—are partitioned across

N_d

GPUs, where

N_d

is the number of data-parallel devices; this achieves a 4x memory reduction compared to baseline data parallelism by storing optimizer states as

\frac{12\Psi}{N_d}

per GPU, where

\Psi

is the model size in parameters, while keeping parameters and gradients fully replicated. ZeRO-2 extends this by also partitioning gradients, yielding an 8x memory reduction with per-GPU usage of

2\Psi + \frac{14\Psi}{N_d}

, and introduces reduce-scatter operations to aggregate gradients without increasing communication volume beyond standard data parallelism. ZeRO-3 further partitions the model parameters themselves using dynamic sharding, resulting in linear memory scaling where per-GPU memory approximates

\frac{16\Psi}{N_d}

(accounting for parameters, gradients, and optimizer states, plus fixed overheads like activations), enabling up to

N_d

x reduction— for instance, 64x on 64 GPUs—though it incurs a modest 50% increase in communication via all-gather for forward passes and reduce-scatter for backward passes.^[4] These stages are complemented by extensions for even greater memory efficiency on resource-constrained systems. ZeRO-Offload builds on ZeRO-2 by offloading partitioned optimizer states and gradients to CPU memory while retaining parameters and core computations on the GPU, minimizing data movement to preserve training speed and achieving up to 8x GPU memory savings; for example, it enables training a 13 billion-parameter model on a single NVIDIA V100 GPU, compared to just 1.4 billion without offloading. ZeRO-Infinity advances this further by integrating NVMe storage alongside CPU offloading for all model states in ZeRO-3, employing an infinity offload engine and memory-centric tiling to handle operators that exceed GPU limits, thus supporting "infinite" effective model sizes limited only by aggregate system memory rather than individual GPU capacity.^[26]^[27] The primary benefits of ZeRO include up to 10x reductions in GPU memory usage across its stages and extensions, allowing efficient training of models exceeding 1 trillion parameters on clusters with hundreds of GPUs, such as fine-tuning a 1 trillion-parameter model on a single node or scaling to 32 trillion parameters across 512 GPUs with sustained throughput. When combined with other parallelism strategies, ZeRO forms hybrid approaches like 3D parallelism for further scalability.^[4]^[27] Implementation in DeepSpeed is seamless and requires no changes to the model code, integrating automatically through a JSON configuration file that specifies the ZeRO stage and options like offloading devices. For instance, a basic ZeRO-3 configuration with CPU offloading might look like:

json

{ "zero_optimization": { "stage": 3, "offload_optimizer": {"device": "cpu"}, "offload_param": {"device": "cpu"}, "reduce_bucket_size": 5e8, "stage3_max_live_parameters": 1e9 } }

This config partitions states across GPUs and offloads as specified, with DeepSpeed handling the underlying communication primitives. The partitioning logic can be outlined in pseudocode as follows, illustrating the core forward-backward step with all-gather and reduce-scatter:

# Pseudocode for ZeRO-3 Partitioning (simplified) for each training step: # Forward pass: All-gather full parameters from shards all_gather(parameters_shard, full_parameters) output = model_forward(full_parameters, input) # Backward pass: Compute and partition gradients gradients = model_backward(output, input) reduce_scatter(gradients, gradients_shard) # Aggregate and shard gradients # Optimizer step: All-gather parameters for update, then re-shard all_gather(parameters_shard, full_parameters) update_optimizer(full_parameters, gradients_shard, optimizer_shard) all_to_all(parameters, parameters_shard) # Re-partition parameters

This ensures minimal redundancy while overlapping communication with computation for efficiency.^[28]^[4]

Parallelism Strategies

DeepSpeed implements parallelism strategies to distribute model computation and data across multiple GPUs, enabling efficient scaling of large-scale deep learning training. A key approach is 3D parallelism, which combines data parallelism (DP), tensor parallelism (TP), and pipeline parallelism (PP) to address memory and compute limitations in training massive models.^[5]^[29] In data parallelism, the model is replicated across GPUs, with input data partitioned into mini-batches processed independently; gradients are then synchronized via all-reduce operations. Tensor parallelism splits individual model tensors—such as weight matrices in transformer layers—across GPUs, allowing parallel computation of operations like matrix multiplications while reducing per-GPU memory usage. Pipeline parallelism divides the model into sequential stages, where each stage (comprising one or more layers) is assigned to a separate GPU or group of GPUs, enabling concurrent processing of micro-batches across stages to overlap forward and backward passes.^[5]^[29] DeepSpeed facilitates automatic hybrid 3D parallelism through JSON configuration files, where users specify degrees for DP, TP, and PP without modifying model code; the system optimizes the combination for hardware constraints, such as interconnect bandwidth. For pipeline parallelism, DeepSpeed employs 1F1B (one forward, one backward) scheduling, which processes one micro-batch forward through all stages before initiating the backward pass, minimizing pipeline bubbles—periods of GPU idleness—by tightly coupling computation with communication via blocking send-receive pairs. This scheduling supports dynamic buffer sizes based on the number of in-flight micro-batches, ensuring efficient resource utilization across stages.^[5]^[30] Throughput in 3D parallelism scales with the effective global batch size, given by the formula

\text{global batch size} = \text{train micro-batch size per GPU} \times \text{gradient accumulation steps} \times \text{data parallelism degree},

where TP and PP degrees influence model partitioning but preserve the overall batch scaling from DP; communication overhead arises from inter-stage activations and gradients in PP, modeled as proportional to micro-batch size and stage boundaries, though exact costs depend on network topology.^[31]^[30] DeepSpeed further extends parallelism to Mixture-of-Experts (MoE) models through DeepSpeed-MoE, which supports sparse activation by routing each input token to a subset of experts (typically top-2 via a gating network), activating only those experts to reduce compute demands sublinearly relative to total parameters—for instance, a 1.6 trillion-parameter MoE can match the compute of a 10 billion-parameter dense model. Experts are distributed across GPUs using expert parallelism, integrated with DP and ZeRO for configurations like expert+data or expert+ZeRO, enabling scalable training of sparse architectures.^[32]^[6] DeepSpeed supports sequence parallelism through DeepSpeed-Ulysses, introduced in 2023, to enable efficient training of transformer models with extremely long sequences. DeepSpeed-Ulysses partitions input tensors along the sequence dimension across GPUs and employs efficient all-to-all collective communication for distributed attention computations. This approach maintains constant communication volume as sequence length and the number of devices increase proportionally, overcoming memory and communication inefficiencies of prior sequence parallelism methods. It enables training with sequence lengths up to 1 million tokens, achieves up to 2.5× faster training with 4× longer sequences compared to state-of-the-art baselines, and integrates with the ZeRO optimizer for enhanced memory efficiency.^[11]^[33] These strategies were integrated into DeepSpeed starting with its 2020 updates, initially powering models like the 17-billion-parameter Turing-NLG and later enabling trillion-parameter training on thousands of GPUs through flexible 3D combinations.^[29]^[13]

Training Optimizations

Memory Management Techniques

DeepSpeed employs several techniques to optimize memory usage during model training, focusing on offloading, recomputation, and precision reduction to enable larger models on constrained hardware. One key approach is offloading, which moves memory-intensive components away from GPUs to slower but larger storage. In ZeRO-Offload, optimizer states and gradients are offloaded to CPU memory, allowing the training of models up to 13 billion parameters on a single GPU with 32 GB of memory, such as a NVIDIA V100, by leveraging a high-performance CPU implementation of Adam that is 5x to 7x faster than PyTorch's version. This technique integrates with ZeRO stage 2 to partition parameters and gradients while shifting computation of optimizer steps to the CPU, freeing GPU resources without requiring code changes beyond configuration updates. Extending this capability, ZeRO-Infinity introduces NVMe offloading to handle even larger tensors by spilling model states—including weights, gradients, and optimizer states—to both CPU and NVMe storage, enabling trillion-parameter models through memory-centric tiling that breaks operations into smaller, sequentially executed tiles to minimize peak working memory. ZeRO-Infinity achieves linear memory scaling with data parallelism and overlaps computation with communication for better bandwidth utilization, as demonstrated in training setups exceeding single-GPU limits. Another prominent method is activation checkpointing, which trades additional computation for reduced memory by recomputing intermediate activations during the backward pass instead of storing them all. DeepSpeed's implementation optimizes this with features like activation partitioning across model-parallel GPUs, CPU offloading of checkpoints, and contiguous memory buffers to further minimize fragmentation. A specialized enhancement is micro-checkpointing, which applies fine-grained checkpointing within layers (e.g., partitioning activations and using contiguous allocations specified by the number of checkpoints), providing more granular control over memory usage compared to standard PyTorch checkpointing. This approach significantly lowers activation memory; without checkpointing, it scales as

O(L \times H)

where

L

is the number of layers and

H

is the hidden size, but with optimal checkpoint placement, it reduces to approximately

O(\sqrt{L \times H})

O(L×H

), allowing deeper models to fit within GPU limits at the cost of roughly 20-30% increased compute time. DeepSpeed also incorporates quantization during training to reduce precision for optimizer states, thereby saving memory. The 1-Bit Adam optimizer compresses gradients to 1-bit representations after a warm-up period, reducing communication volume by up to 5x while maintaining convergence speed equivalent to full-precision Adam, which indirectly lowers memory pressure in distributed settings by enabling larger batch sizes or models on communication-bound hardware. This results in up to 3.4x faster training for tasks like BERT-Large pretraining, with no loss in final model quality. Support for heterogeneous hardware extends these techniques' applicability. In March 2022, DeepSpeed added compatibility with AMD ROCm, allowing memory offloading and checkpointing optimizations to run on AMD Instinct GPUs, facilitating efficient large-model training in mixed NVIDIA-AMD environments. Subsequent expansions, starting in 2023, include support for Intel GPUs through the Intel Extension for DeepSpeed, enabling these optimizations on Intel Data Center GPU Max Series and other Intel hardware.^[34]

Data Efficiency Features

DeepSpeed's Data Efficiency Library, released on December 11, 2022, introduces composable tools designed to optimize data usage during deep learning training by reducing the required dataset size while maintaining or enhancing model performance.^[19]^[35] The library focuses on efficient data sampling and routing, enabling up to 1.5-2x reductions in training data and time for large language models without quality degradation.^[35] Key techniques include curriculum learning, which progressively samples easier examples early in training using customizable difficulty metrics such as sequence length or vocabulary rarity, and data pruning via random layerwise token dropping (random-LTD), a method that skips computations for select tokens in intermediate layers to accelerate forward passes.^[19]^[35] Adaptive data selection combines these approaches for synergistic gains, while progressive learning rates—implemented as token-based decay schedules—further refine training dynamics to support filtered, high-quality datasets like The Pile in large-scale pretraining scenarios.^[35] Integration is straightforward, allowing configuration through DeepSpeed's JSON files to embed these features directly into PyTorch training loops with minimal code modifications.^[19]^[9] In practice, the library has been applied to models such as BERT-large, achieving 2x data and time savings with comparable downstream task accuracy, and GPT-3 (1.3B parameters), where it yields over 1 point improvement in zero-shot and few-shot evaluations alongside 1.5-2x efficiency gains.^[35] These optimizations promote faster convergence across transformer-based architectures, often combined briefly with parallelism strategies for enhanced throughput in distributed environments.^[35]

Inference and Compression

DeepSpeed-Inference

DeepSpeed-Inference, launched in 2022, is a software library developed by Microsoft Research to optimize the deployment of large transformer models, achieving latency reductions of up to 7.3× and throughput improvements of 1.5× compared to prior systems through kernel-level optimizations and parallelism techniques.^[36] It addresses challenges in serving models ranging from billions to trillions of parameters across single- or multi-GPU setups, supporting both latency-critical and high-throughput scenarios while handling dense and sparse architectures like Mixture-of-Experts (MoE) models.^[36] Key techniques in DeepSpeed-Inference include expert parallelism tailored for MoE models, which distributes expert computations across GPUs to minimize activation memory and communication overhead; tensor parallelism and pipeline parallelism for standard transformers, enabling models to span multiple GPUs; and custom CUDA kernels optimized for critical layers such as attention and multilayer perceptrons, which reduce computation time by fusing operations and avoiding intermediate materializations.^[36]^[37] These optimizations collectively deliver 2-7× speedups in end-to-end inference latency for large models, depending on the deployment scale and hardware.^[36] In November 2023, DeepSpeed released FastGen as version 2 of DeepSpeed-Inference, introducing innovations like Dynamic SplitFuse—a prompt and generation composition strategy—for high-throughput text generation in large language models (LLMs). FastGen achieves up to 2.3× higher throughput compared to earlier versions and supports newer architectures such as Mixtral, Phi-2, and Falcon, with further enhancements in 2024 including up to 2.5× effective throughput over competitors like vLLM as of March 2025.^[38]^[39]^[40] ZeRO-Inference, released in September 2022, extends these capabilities by offloading model weights and optimizer states to CPU or NVMe storage, allowing single-GPU inference for models exceeding 100 billion parameters that would otherwise require multi-GPU setups.^[18] This approach partitions parameters across heterogeneous memory, prefetching weights to GPU as needed, and supports models up to 15 trillion parameters on a single NVIDIA V100 GPU with NVMe, achieving token generation rates such as 43 tokens/second for OPT-30B using CPU offload.^[18] DeepSpeed-MII, introduced in October 2022, builds on DeepSpeed-Inference to provide a user-friendly serving framework for low-latency inference, integrating optimizations like tensor slicing and ZeroQuant to reduce costs by up to 40× relative to unoptimized baselines. It accelerates deployment of over 24,000 open-source models, delivering, for instance, 5.7× faster inference for the 176-billion-parameter BLOOM model on standard hardware. As of 2025, MII continues to evolve with releases supporting broader model families and performance optimizations, integrating FastGen for enhanced LLM serving. Latency in these systems can be conceptually modeled as

\text{inference time} \approx \frac{\text{model FLOPs}}{\text{parallelism factor}} + \text{offload overhead},

where the parallelism factor scales compute across GPUs, and offload overhead accounts for data movement in heterogeneous setups like ZeRO-Inference.^[18]^[36]

Compression Methods

DeepSpeed-Compression is a library released in 2022 by Microsoft Research, designed to enable post-training model compression for large-scale transformers, achieving up to 32x reduction in model size and 5.2x faster inference on models like GPT-NeoX-20B without significant accuracy loss.^[43]^[44] The library's core techniques include ZeroQuant, a post-training quantization method that supports INT4 and INT8 precision for weights and activations without requiring retraining or access to original training data, enabling data-free compression to as low as 1.5 bits per parameter in extreme cases via fine-grained quantization and layer-wise knowledge distillation.^[45]^[44] Another key technique is XTC (extreme tensor compression), which combines ultra-low-bit quantization (e.g., binary or ternary weights at approximately 1-1.5 bits) with sparsity induction to achieve extreme compression ratios while preserving performance on tasks like GLUE benchmarks.^[46]^[43] Supporting methods in DeepSpeed-Compression encompass weight sharing to reduce redundancy, structured pruning (including head, row, channel, and sparse variants) to eliminate less important parameters, and low-rank adaptation to approximate full-rank weights with lower-dimensional matrices, all applied post-training to facilitate INT4/INT8 deployment.^[44]^[46] Compression effectiveness is quantified by the ratio of original model size to compressed size, often computed as the product of a sparsity factor (e.g., 1 - pruning ratio) and bit reduction factor (e.g., 16 bits for FP16 divided by target bits like 4 for INT4), yielding ratios up to 32x for BERT-base models under XTC.^[43]

\text{Compression Ratio} = \frac{\text{Original Size}}{\text{Compressed Size}} = \left( \text{Sparsity Factor} \right) \times \left( \frac{\text{Original Bits}}{\text{Target Bits}} \right)

These methods are particularly suited for deploying compressed models on resource-constrained edge devices or in cost-sensitive inference scenarios, such as real-time applications in Bing Maps.^[43]^[44] DeepSpeed-Compression integrates briefly with DeepSpeed-MII for serving compressed models.^[43]

Integration and Usage

Framework Compatibility

DeepSpeed primarily integrates with PyTorch through an engine wrapper that enables distributed training optimizations without requiring extensive code modifications to the user's model or training loop.^[47] This wrapper, accessed via the deepspeed.initialize function, wraps the model, optimizer, and configuration to activate features such as ZeRO stages seamlessly.^[47] DeepSpeed offers native support for Hugging Face Transformers, allowing users to accelerate transformer-based models with minimal setup by specifying a DeepSpeed configuration file in the Trainer API.^[48] Since February 2023, this integration includes automatic tensor parallelism for Hugging Face models, which partitions model tensors across GPUs by default when kernel injection is disabled.^[16] For other frameworks, DeepSpeed supports a hybrid approach with NVIDIA's Megatron-LM, enabling 3D parallelism that combines data, tensor, and pipeline parallelism for training large language models like the 530-billion-parameter Megatron-Turing NLG.^[49] Additionally, DeepSpeed integrates with PyTorch Lightning for streamlined distributed training.^[2] On hardware, DeepSpeed is optimized for NVIDIA GPUs, including A100 and V100 architectures, where it has demonstrated efficient scaling for models up to trillions of parameters.^[5] It also supports AMD GPUs via ROCm since March 2022, allowing optimizations like ZeRO on Instinct accelerators.^[16] DeepSpeed integrates with Microsoft Azure for hyperscale training, supporting clusters of up to 1024 A100 GPUs to break memory barriers for massive models.^[25] Windows is supported for both training and inference, with most features available except asynchronous I/O and GPU direct storage.^[9] Configuration is handled through the DeepSpeedConfig JSON API, which defines activation stages (e.g., ZeRO-Offload, tensor parallelism) and is passed during engine initialization.^[50] DeepSpeed requires PyTorch version 1.8 or later, with versions 1.9 and above recommended for full feature support.^[51] It lacks native support for TensorFlow, focusing exclusively on PyTorch-based workflows.^[9]

Configuration and Examples

DeepSpeed is installed via pip with the command pip install deepspeed, which provides the latest release compatible with various PyTorch and CUDA versions without requiring specific bindings.^[47] For advanced installations, such as those with custom CUDA versions or additional dependencies like NCCL, users can build from source using the repository instructions.^[51] Configuration in DeepSpeed is managed through a JSON file that specifies optimizations like ZeRO stages, offloading, and parallelism settings. A basic configuration for ZeRO-3 with CPU offloading and FP16 training might look like this:

json

{ "fp16": { "enabled": true }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu" }, "offload_param": { "device": "cpu" }, "overlap_comm": true, "contiguous_gradients": true, "reduce_bucket_size": 5e8, "stage3_prefetch_bucket_size": 5e8, "stage3_param_persistence_threshold": 1e6, "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true }, "train_batch_size": 16, "train_micro_batch_size_per_gpu": 1, "wall_clock_breakdown": false }

This setup partitions model parameters, gradients, and optimizer states across GPUs (ZeRO stage 3), offloads them to CPU to reduce memory usage, and enables communication overlapping to improve efficiency.^[50] Parallelism options, such as data or pipeline stages, can be added under the zero_optimization or top-level keys like pipeline_parallel_size.^[52] For a basic PyTorch training loop with DeepSpeed, users initialize the distributed backend and engine as follows:

python

import deepspeed import torch from torch.utils.data import DataLoader # Assume model, optimizer, and data_loader are defined model_engine, optimizer, _, _ = deepspeed.initialize( model=model, model_parameters=model.parameters(), config="ds_config.json" ) for batch in data_loader: outputs = model_engine(batch) loss = outputs.loss # Assuming a loss computation model_engine.backward(loss) model_engine.step()

This replaces standard PyTorch distributed calls, handling sharding and optimization automatically. For inference, deepspeed.init_inference loads the model with specified configurations like tensor parallelism.^[47]^[53] Integration with Hugging Face Transformers for fine-tuning involves passing the DeepSpeed config to the Trainer:

python

from transformers import Trainer, TrainingArguments from transformers import AutoModelForSequenceClassification, AutoTokenizer model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2) tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") training_args = TrainingArguments( output_dir="./results", deepspeed="ds_config.json", # Path to the JSON config per_device_train_batch_size=8, num_train_epochs=3, ) trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, tokenizer=tokenizer, ) trainer.train()

This enables features like ZeRO during fine-tuning without modifying the core model code.^[52] Best practices for configuration include selecting ZeRO stages based on model size and hardware: ZeRO-1 for smaller models (e.g., 1B parameters) to partition only optimizer states, ZeRO-2 for medium-scale (up to 10B parameters) by adding gradient partitioning, and ZeRO-3 for large models (hundreds of billions to trillions of parameters) to fully shard all states and enable offloading to CPU or NVMe. Enable options like overlap_comm and reduce_bucket_size to minimize communication overhead in distributed setups. For debugging issues like excessive communication overhead, monitor with wall_clock_breakdown: true in the config to profile timings, and adjust bucket sizes or use ZeRO-Infinity for NVMe offload to balance compute and I/O.^[28]^[54] Official tutorials provide step-by-step guidance for advanced setups, such as pipeline parallelism, where models are expressed as PipelineModule with layers partitioned across GPUs; users define stages via num_stages in the config and use engine.train_batch() in the loop for micro-batch processing. Refer to the DeepSpeed documentation for full examples.^[55]

Impact and Adoption

Notable Applications

DeepSpeed has been instrumental in training several large-scale language models developed by Microsoft. It powered the training of Turing-NLG, a 17-billion-parameter model released in 2020, by leveraging the ZeRO optimizer to enable efficient distributed training on 256 NVIDIA V100 GPUs while achieving state-of-the-art perplexity on benchmarks like LAMBADA.^[13] In 2021, DeepSpeed, combined with Megatron-LM, facilitated the training of Megatron-Turing NLG, a 530-billion-parameter monolithic transformer model, on over 2,000 NVIDIA A100 GPUs, marking it as the largest dense generative language model at the time and demonstrating superior zero-shot and few-shot performance on natural language tasks.^[56] Additionally, DeepSpeed contributed to the BLOOM project in 2022, an open-source 176-billion-parameter multilingual language model, through the Megatron-DeepSpeed framework that implemented 3D parallelism for scalable training across 384 NVIDIA A100 GPUs.^[17] Beyond Microsoft, DeepSpeed has seen widespread external adoption for optimizing large language models. It integrates seamlessly with Hugging Face's Transformers library, enabling efficient fine-tuning of models like LLaMA and OPT on limited hardware via features such as ZeRO-Offload, which reduces memory footprint and supports training billion-parameter models on consumer GPUs.^[57] In November 2022, DeepSpeed accelerated Stable Diffusion inference, achieving up to 1.7x latency reduction through model parallelism and custom kernels.^[58] Sub-second image generation on single GPUs is enabled by DeepSpeed-MII optimizations.^[59] The DeepSpeed-Chat framework, released in April 2023, simplifies reinforcement learning from human feedback (RLHF) for chat models, providing an end-to-end pipeline that trains ChatGPT-like systems with up to 10x cost savings compared to traditional methods, as demonstrated on models exceeding 70 billion parameters.^[20] DeepSpeed's innovations have significantly impacted research by enabling open-source scaling of trillion-parameter models, with core techniques like ZeRO introduced in the 2019 arXiv preprint and detailed in the KDD 2020 DeepSpeed paper on system optimizations,^[4]^[60] and subsequent NeurIPS workshop presentations on extensions like ZeRO++. These works have facilitated broader accessibility to large-scale training, allowing researchers to experiment with massive models without prohibitive hardware costs.^[61] In the community, DeepSpeed's GitHub repository has garnered over 30,000 stars as of 2025, reflecting its milestones in democratizing deep learning optimizations. In 2025, DeepSpeed joined the Linux Foundation AI & Data as an official project and became a hosted project under the PyTorch Foundation, further enhancing its integration in the open-source AI ecosystem.^[62]^[63] It has been adopted by numerous AI labs, including those at Hugging Face and EleutherAI, for substantial cost reductions in training and inference, with over 375 downstream projects leveraging it for efficient scaling.

Performance Benchmarks

DeepSpeed has demonstrated significant efficiency gains in training large-scale models, primarily through optimizations like ZeRO and parallelism strategies. For models ranging from 1.5 billion to 100 billion parameters, it achieves up to 10x faster training speeds compared to PyTorch's Distributed Data Parallel (DDP) baseline, enabling effective scaling on NVIDIA V100 and A100 GPU clusters.^[5] On Azure infrastructure with 1024 A100-80GB GPUs in 2022, DeepSpeed delivered 1.8x higher compute throughput per GPU—reaching 157 to 171 TFLOPs/GPU—for models up to 2 trillion parameters, showcasing near-linear scaling across GPU counts from 128 to 1024.^[24] Additionally, the 1-bit Adam optimizer reduces communication volume by up to 5x while accelerating training by up to 3.4x on communication-constrained setups, maintaining convergence quality for tasks like BERT pretraining and GPT-2 training.^[64] In inference scenarios, DeepSpeed provides substantial speedups for resource-intensive models. Using DeepSpeed-MII, the BLOOM 176B model experiences a 5.7x latency reduction on multi-GPU setups with INT8 quantization, alongside over 40x cost savings for processing 1 million tokens on Azure.^[59] For Stable Diffusion, MII yields a 1.9x reduction in both latency and deployment cost, facilitating faster image generation.^[59] These gains stem from integrating ZeRO-Inference with model parallelism, as detailed in core technologies, and are benchmarked on A100 clusters from 2020 to 2023 releases.^[16] Compared to NVIDIA's Megatron-LM, DeepSpeed enables 2x better scaling, supporting up to 1024 GPUs versus 512 for trillion-parameter models on similar hardware.^[24] Model FLOPs Utilization (MFU) metrics highlight DeepSpeed's efficiency on NVIDIA A100-80GB GPUs in the Selene supercomputer, as shown below for GPT-style models up to 1 trillion parameters across up to 3072 GPUs:

Model Size	MFU (%)	Hardware FLOPs Utilization (HFU, %)
22B	41.5	43.7
175B	51.4	52.8
530B	56.0	57.0
1T	56.3	57.0

Despite these advances, DeepSpeed incurs overhead on small models under 1B parameters, where it can be up to 10x slower than DDP due to coordination costs. Performance also depends on high network bandwidth, with gains diminishing on low-bandwidth clusters without complementary optimizations like 1-bit Adam.^[64]

History

Media collections

DeepSpeed

Recent from talks

Recent from talks

Contribute something

Contribute something

Media Pages

Timelines

Articles

Notes collections

Notes

Notes

Days in Chronicle

DeepSpeed

Library

See also

References

Further reading

External links

DeepSpeed

Overview

Purpose and Capabilities

Development History

Core Technologies

ZeRO Optimizer

Parallelism Strategies

Training Optimizations

Memory Management Techniques

Add your contribution

Related Hubs

Contribute something