Hubbry Logo
DeepSpeedDeepSpeedMain
Open search
DeepSpeed
Community hub
DeepSpeed
logo
8 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Contribute something
DeepSpeed
DeepSpeed
from Wikipedia
DeepSpeed
Original authorMicrosoft Research
DeveloperMicrosoft
Initial releaseMay 18, 2020; 5 years ago (2020-05-18)
Stable release
v0.16.5 / March 27, 2025; 10 months ago (2025-03-27)
Repositorygithub.com/microsoft/DeepSpeed
Written inPython, CUDA, C++
TypeSoftware library
LicenseApache License 2.0
Websitedeepspeed.ai

DeepSpeed is an open source deep learning optimization library for PyTorch.[1]

Library

[edit]

The library is designed to reduce computing power and memory use and to train large distributed models with better parallelism on existing computer hardware.[2][3] DeepSpeed is optimized for low latency, high throughput training. It includes the Zero Redundancy Optimizer (ZeRO) for training models with 1 trillion or more parameters.[4] Features include mixed precision training, single-GPU, multi-GPU, and multi-node training as well as custom model parallelism. The DeepSpeed source code is licensed under MIT License and available on GitHub.[5]

The team claimed to achieve up to a 6.2x throughput improvement, 2.8x faster convergence, and 4.6x less communication.[6]

See also

[edit]

References

[edit]

Further reading

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
DeepSpeed is an open-source optimization library developed by that enables efficient, scalable distributed training and inference for large-scale models, incorporating innovations such as memory optimization and parallelism to achieve unprecedented speed and resource efficiency. As part of Microsoft's AI at Scale initiative, DeepSpeed provides a suite of tools including DeepSpeed-Training, which features technologies like (Zero Optimizer) for partitioning model states across GPUs to train trillion-parameter models without memory bottlenecks, and 3D-Parallelism combining data, tensor, and pipeline parallelism for extreme scalability. Additional training advancements include DeepSpeed-MoE for sparse mixture-of-experts architectures that reduce computational costs while maintaining performance, and ZeRO-Infinity that extends training to massive supercomputers with thousands of GPUs by offloading parameters to CPU and NVMe storage. For inference, DeepSpeed-Inference optimizes transformer-based models through kernel fusions, parallelism strategies (including tensor, pipeline, and expert parallelism), and memory-breaking techniques, delivering up to 6.9 times higher throughput for large models compared to standard frameworks. Complementing these, DeepSpeed-Compression employs methods like ZeroQuant for quantization and XTC for extreme compression, enabling faster inference and reduced model sizes for pre-trained transformers without significant accuracy loss. DeepSpeed integrates seamlessly with popular frameworks such as Transformers, Lightning, and MosaicML, supporting models like MT-530B and BLOOM, and has powered breakthroughs in training the world's largest language models while democratizing access to high-performance AI for researchers and practitioners. Extensions include DeepSpeed-Ulysses for sequence parallelism (2023), DeepSpeed-Chat for (RLHF), and DeepSpeed4Science for accelerating scientific simulations (all 2023), as well as Arctic for long sequence training (2025).

Overview

Purpose and Capabilities

DeepSpeed is an open-source optimization library developed by , designed to integrate seamlessly with for enabling distributed training and inference of large-scale models. Its core purpose is to reduce the of training processes, accelerate computational speed, and scale model training to handle trillion-parameter architectures across thousands of GPUs, thereby democratizing access to extreme-scale . At a high level, DeepSpeed supports models up to 10 times larger and 3 to 5 times faster than standard implementations, with innovations like the optimizer serving as a key enabler for memory efficiency. The library encompasses a suite of components tailored to different phases of the deep learning pipeline: DeepSpeed- for optimizing model at massive scales, DeepSpeed-Inference for high-throughput deployment, and DeepSpeed-Compression for reducing model sizes through techniques like quantization.

Development History

DeepSpeed was developed by as part of the AI at Scale initiative to overcome challenges in training large-scale models, particularly addressing memory and efficiency limitations encountered in projects like the Turing-NLG, a 17-billion-parameter released in 2020. The project aimed to enable distributed training optimizations for , focusing on scaling beyond existing frameworks. Led by researchers including Samyam Rajbhandari, the team included key contributors such as Jeff Rasley, Olatunji Ruwase, and Yuxiong He. The library was initially released in February 2020 as an open-source framework on , introducing core features like the optimizer for efficient distributed training of models exceeding 100 billion parameters. In May 2020, an update added , extending memory optimizations to support training of models with over 100 billion parameters while improving speed and scalability. This marked an early milestone in enabling larger model training with reduced hardware demands. By October 2021, DeepSpeed powered the training of the Megatron-Turing NLG 530B model in collaboration with , demonstrating its capability for monolithic transformers at unprecedented scales using 3D parallelism strategies. In July 2022, it supported the training of the 176-billion-parameter BLOOM model through integration with Megatron-DeepSpeed, combining sharding, pipeline parallelism, and tensor parallelism across international research efforts. DeepSpeed's evolution expanded beyond training optimizations in 2022, incorporating capabilities with the launch of ZeRO-Inference, which enabled deployment of massive models on single GPUs by offloading weights to CPU or NVMe storage. That , the Efficiency was released, introducing techniques for curriculum learning and data routing to reduce training data needs by 1.5-2x while enhancing model quality for tasks like and BERT fine-tuning. Compression methods were also integrated around this time, broadening applicability to resource-constrained environments. In 2023, DeepSpeed introduced DeepSpeed-Chat, a framework for training chat models using (RLHF), and launched the DeepSpeed4Science initiative in September to develop AI system technologies for accelerating scientific discoveries in areas like and modeling. In 2024, advancements included FP6-LLM for efficient serving of large language models using FP6 quantization. By February 2025, DeepSpeed joined the LF AI & Data Foundation to further advance open-source optimization efforts. Throughout its development, DeepSpeed has been open-sourced on , fostering community contributions and continuous updates. By 2022, it integrated with Azure infrastructure to scale training workloads to GPUs, supporting models up to 2 parameters and achieving near-linear performance gains.

Core Technologies

ZeRO Optimizer

The Zero Redundancy Optimizer (ZeRO) is a core memory optimization technology in DeepSpeed that partitions the optimizer states, gradients, and model parameters across multiple GPUs during distributed data-parallel training, eliminating redundant copies of these states that are typically replicated in standard data parallelism approaches. This partitioning allows each GPU to hold only a fraction of the total model states, significantly reducing per-GPU memory consumption and enabling the training of much larger models on hardware with limited memory, such as scaling to billion- or trillion-parameter language models without requiring full model replication on every device. ZeRO operates in progressive stages to balance memory savings with communication overhead. In ZeRO-1, only the optimizer states—such as momentum and variance in —are partitioned across NdN_d GPUs, where NdN_d is the number of data-parallel devices; this achieves a 4x memory reduction compared to baseline by storing optimizer states as 12ΨNd\frac{12\Psi}{N_d} per GPU, where Ψ\Psi is the model size in parameters, while keeping parameters and gradients fully replicated. ZeRO-2 extends this by also partitioning gradients, yielding an 8x memory reduction with per-GPU usage of 2Ψ+14ΨNd2\Psi + \frac{14\Psi}{N_d}, and introduces reduce-scatter operations to aggregate gradients without increasing communication volume beyond standard . ZeRO-3 further partitions the model parameters themselves using dynamic sharding, resulting in linear memory scaling where per-GPU memory approximates 16ΨNd\frac{16\Psi}{N_d} (accounting for parameters, gradients, and optimizer states, plus fixed overheads like activations), enabling up to NdN_dx reduction— for instance, 64x on 64 GPUs—though it incurs a modest 50% increase in communication via all-gather for forward passes and reduce-scatter for backward passes. These stages are complemented by extensions for even greater memory efficiency on resource-constrained systems. ZeRO-Offload builds on ZeRO-2 by offloading partitioned optimizer states and gradients to CPU while retaining parameters and core computations on the GPU, minimizing data movement to preserve speed and achieving up to 8x GPU memory savings; for example, it enables a 13 billion-parameter model on a single V100 GPU, compared to just 1.4 billion without offloading. ZeRO-Infinity advances this further by integrating NVMe storage alongside CPU offloading for all model states in ZeRO-3, employing an infinity offload engine and memory-centric tiling to handle operators that exceed GPU limits, thus supporting "infinite" effective model sizes limited only by aggregate system rather than individual GPU capacity. The primary benefits of ZeRO include up to 10x reductions in GPU memory usage across its stages and extensions, allowing efficient training of models exceeding 1 trillion parameters on clusters with hundreds of GPUs, such as fine-tuning a 1 trillion-parameter model on a single node or scaling to 32 trillion parameters across 512 GPUs with sustained throughput. When combined with other parallelism strategies, ZeRO forms hybrid approaches like 3D parallelism for further scalability. Implementation in DeepSpeed is seamless and requires no changes to the model code, integrating automatically through a configuration file that specifies the stage and options like offloading devices. For instance, a basic ZeRO-3 configuration with CPU offloading might look like:

json

{ "zero_optimization": { "stage": 3, "offload_optimizer": {"device": "cpu"}, "offload_param": {"device": "cpu"}, "reduce_bucket_size": 5e8, "stage3_max_live_parameters": 1e9 } }

{ "zero_optimization": { "stage": 3, "offload_optimizer": {"device": "cpu"}, "offload_param": {"device": "cpu"}, "reduce_bucket_size": 5e8, "stage3_max_live_parameters": 1e9 } }

This config partitions states across GPUs and offloads as specified, with DeepSpeed handling the underlying communication primitives. The partitioning logic can be outlined in pseudocode as follows, illustrating the core forward-backward step with all-gather and reduce-scatter:

# Pseudocode for ZeRO-3 Partitioning (simplified) for each training step: # Forward pass: All-gather full parameters from shards all_gather(parameters_shard, full_parameters) output = model_forward(full_parameters, input) # Backward pass: Compute and partition gradients gradients = model_backward(output, input) reduce_scatter(gradients, gradients_shard) # Aggregate and shard gradients # Optimizer step: All-gather parameters for update, then re-shard all_gather(parameters_shard, full_parameters) update_optimizer(full_parameters, gradients_shard, optimizer_shard) all_to_all(parameters, parameters_shard) # Re-partition parameters

# Pseudocode for ZeRO-3 Partitioning (simplified) for each training step: # Forward pass: All-gather full parameters from shards all_gather(parameters_shard, full_parameters) output = model_forward(full_parameters, input) # Backward pass: Compute and partition gradients gradients = model_backward(output, input) reduce_scatter(gradients, gradients_shard) # Aggregate and shard gradients # Optimizer step: All-gather parameters for update, then re-shard all_gather(parameters_shard, full_parameters) update_optimizer(full_parameters, gradients_shard, optimizer_shard) all_to_all(parameters, parameters_shard) # Re-partition parameters

This ensures minimal while overlapping communication with for .

Parallelism Strategies

DeepSpeed implements parallelism strategies to distribute model and data across multiple GPUs, enabling efficient scaling of large-scale training. A key approach is 3D parallelism, which combines data parallelism (DP), tensor parallelism (TP), and parallelism (PP) to address memory and compute limitations in training massive models. In , the model is replicated across GPUs, with input data partitioned into mini-batches processed independently; gradients are then synchronized via all-reduce operations. Tensor parallelism splits individual model tensors—such as weight matrices in layers—across GPUs, allowing parallel computation of operations like matrix multiplications while reducing per-GPU memory usage. parallelism divides the model into sequential stages, where each stage (comprising one or more layers) is assigned to a separate GPU or group of GPUs, enabling concurrent processing of micro-batches across stages to overlap forward and backward passes. DeepSpeed facilitates automatic hybrid 3D parallelism through configuration files, where users specify degrees for DP, TP, and PP without modifying model code; the system optimizes the combination for hardware constraints, such as interconnect bandwidth. For parallelism, DeepSpeed employs 1F1B (one forward, one backward) scheduling, which processes one micro-batch forward through all stages before initiating the backward pass, minimizing pipeline bubbles—periods of GPU idleness—by tightly coupling computation with communication via blocking send-receive pairs. This scheduling supports dynamic buffer sizes based on the number of in-flight micro-batches, ensuring efficient resource utilization across stages. Throughput in 3D parallelism scales with the effective global batch size, given by the formula global batch size=train micro-batch size per GPU×gradient accumulation steps×data parallelism degree,\text{global batch size} = \text{train micro-batch size per GPU} \times \text{gradient accumulation steps} \times \text{data parallelism degree}, where TP and PP degrees influence model partitioning but preserve the overall batch scaling from DP; communication overhead arises from inter-stage activations and gradients in PP, modeled as proportional to micro-batch size and stage boundaries, though exact costs depend on network topology. DeepSpeed further extends parallelism to Mixture-of-Experts (MoE) models through DeepSpeed-MoE, which supports sparse activation by routing each input token to a subset of experts (typically top-2 via a gating network), activating only those experts to reduce compute demands sublinearly relative to total parameters—for instance, a 1.6 trillion-parameter MoE can match the compute of a 10 billion-parameter dense model. Experts are distributed across GPUs using expert parallelism, integrated with DP and ZeRO for configurations like expert+data or expert+ZeRO, enabling scalable training of sparse architectures. DeepSpeed supports sequence parallelism through DeepSpeed-Ulysses, introduced in 2023, to enable efficient training of transformer models with extremely long sequences. DeepSpeed-Ulysses partitions input tensors along the sequence dimension across GPUs and employs efficient all-to-all collective communication for distributed attention computations. This approach maintains constant communication volume as sequence length and the number of devices increase proportionally, overcoming memory and communication inefficiencies of prior sequence parallelism methods. It enables training with sequence lengths up to 1 million tokens, achieves up to 2.5× faster training with 4× longer sequences compared to state-of-the-art baselines, and integrates with the ZeRO optimizer for enhanced memory efficiency. These strategies were integrated into DeepSpeed starting with its 2020 updates, initially powering models like the 17-billion-parameter Turing-NLG and later enabling trillion-parameter training on thousands of GPUs through flexible 3D combinations.

Training Optimizations

Memory Management Techniques

DeepSpeed employs several techniques to optimize usage during model , focusing on offloading, recomputation, and precision reduction to enable larger models on constrained hardware. One key approach is offloading, which moves -intensive components away from GPUs to slower but larger storage. In ZeRO-Offload, optimizer states and gradients are offloaded to CPU , allowing the of models up to 13 billion parameters on a single GPU with 32 GB of , such as a NVIDIA V100, by leveraging a high-performance CPU of that is 5x to 7x faster than PyTorch's version. This technique integrates with ZeRO stage 2 to partition parameters and gradients while shifting of optimizer steps to the CPU, freeing GPU resources without requiring changes beyond configuration updates. Extending this capability, ZeRO-Infinity introduces NVMe offloading to handle even larger tensors by spilling model states—including weights, gradients, and optimizer states—to both CPU and NVMe storage, enabling trillion-parameter models through -centric tiling that breaks operations into smaller, sequentially executed tiles to minimize peak . ZeRO-Infinity achieves linear scaling with and overlaps with communication for better bandwidth utilization, as demonstrated in setups exceeding single-GPU limits. Another prominent method is activation checkpointing, which trades additional computation for reduced memory by recomputing intermediate s during the backward pass instead of storing them all. DeepSpeed's implementation optimizes this with features like activation partitioning across model-parallel GPUs, CPU offloading of checkpoints, and contiguous memory buffers to further minimize fragmentation. A specialized enhancement is micro-checkpointing, which applies fine-grained checkpointing within layers (e.g., partitioning s and using contiguous allocations specified by the number of checkpoints), providing more granular control over memory usage compared to standard checkpointing. This approach significantly lowers activation memory; without checkpointing, it scales as O(L×H)O(L \times H) where LL is the number of layers and HH is the hidden size, but with optimal checkpoint placement, it reduces to approximately O(L×H)O(\sqrt{L \times H})
Add your contribution
Related Hubs
Contribute something
User Avatar
No comments yet.