Recent from talks
Nothing was collected or created yet.
Hopper (microarchitecture)
View on Wikipedia
| Launched | September 20, 2022 |
|---|---|
| Designed by | Nvidia |
| Manufactured by | |
| Fabrication process | TSMC N4 |
| Product Series | |
| Server/datacenter |
|
| Specifications | |
| L1 cache | 256 KB (per SM) |
| L2 cache | 50 MB |
| Memory support | HBM3 |
| PCIe support | PCI Express 5.0 |
| Media Engine | |
| Encoder supported | NVENC |
| History | |
| Predecessor | Ampere |
| Variant | Ada Lovelace (consumer and professional) |
| Successor | Blackwell |

Hopper is a graphics processing unit (GPU) microarchitecture developed by Nvidia. It is designed for datacenters and is used alongside the Lovelace microarchitecture.
Named for computer scientist and United States Navy rear admiral Grace Hopper, the Hopper architecture was leaked in November 2019 and officially revealed in March 2022. It improves upon its predecessors, the Turing and Ampere microarchitectures, featuring a new streaming multiprocessor, a faster memory subsystem, and a transformer acceleration engine.
Architecture
[edit]The Nvidia Hopper H100 GPU is implemented using the TSMC N4 process with 80 billion transistors. It consists of up to 144 streaming multiprocessors.[1] Due to the increased memory bandwidth provided by the SXM5 socket, the Nvidia Hopper H100 offers better performance when used in an SXM5 configuration than in the typical PCIe socket.[2]
Streaming multiprocessor
[edit]The streaming multiprocessors for Hopper improve upon the Turing and Ampere microarchitectures, although the maximum number of concurrent warps per streaming multiprocessor (SM) remains the same between the Ampere and Hopper architectures, 64.[3] The Hopper architecture provides a Tensor Memory Accelerator (TMA), which supports bidirectional asynchronous memory transfer between shared memory and global memory.[4] Under TMA, applications may transfer up to 5D tensors. When writing from shared memory to global memory, elementwise reduction and bitwise operators may be used, avoiding registers and SM instructions while enabling users to write warp specialized codes. TMA is exposed through cuda::memcpy_async.[5]
When parallelizing applications, developers can use thread block clusters. Thread blocks may perform atomics in the shared memory of other thread blocks within its cluster, otherwise known as distributed shared memory. Distributed shared memory may be used by an SM simultaneously with L2 cache; when used to communicate data between SMs, this can utilize the combined bandwidth of distributed shared memory and L2. The maximum portable cluster size is 8, although the Nvidia Hopper H100 can support a cluster size of 16 by using the cudaFuncAttributeNonPortableClusterSizeAllowed function, potentially at the cost of reduced number of active blocks.[6] With L2 multicasting and distributed shared memory, the required bandwidth for dynamic random-access memory read and writes is reduced.[7]
Hopper features improved single-precision floating-point format (FP32) throughput with twice as many FP32 operations per cycle per SM than its predecessor. Additionally, the Hopper architecture adds support for new instructions, including the Smith–Waterman algorithm.[6] Like Ampere, TensorFloat-32 (TF-32) arithmetic is supported. The mapping pattern for both architectures is identical.[8]
Memory
[edit]The Nvidia Hopper H100 supports HBM3 and HBM2e memory up to 80 GB; the HBM3 memory system supports 3 TB/s, an increase of 50% over the Nvidia Ampere A100's 2 TB/s. Across the architecture, the L2 cache capacity and bandwidth were increased.[9]
Hopper allows CUDA compute kernels to utilize automatic inline compression, including in individual memory allocation, which allows accessing memory at higher bandwidth. This feature does not increase the amount of memory available to the application, because the data (and thus its compressibility) may be changed at any time. The compressor will automatically choose between several compression algorithms.[9]
The Nvidia Hopper H100 increases the capacity of the combined L1 cache, texture cache, and shared memory to 256 KB. Like its predecessors, it combines L1 and texture caches into a unified cache designed to be a coalescing buffer. The attribute cudaFuncAttributePreferredSharedMemoryCarveout may be used to define the carveout of the L1 cache. Hopper introduces enhancements to NVLink through a new generation with faster overall communication bandwidth.[10]
Memory synchronization domains
[edit]Some CUDA applications may experience interference when performing fence or flush operations due to memory ordering. Because the GPU cannot know which writes are guaranteed and which are visible by chance timing, it may wait on unnecessary memory operations, thus slowing down fence or flush operations. For example, when a kernel performs computations in GPU memory and a parallel kernel performs communications with a peer, the local kernel will flush its writes, resulting in slower NVLink or PCIe writes. In the Hopper architecture, the GPU can reduce the net cast through a fence operation.[11]
DPX instructions
[edit]The Hopper architecture math application programming interface (API) exposes functions in the SM such as __viaddmin_s16x2_relu, which performs the per-halfword . In the Smith–Waterman algorithm, __vimax3_s16x2_relu can be used, a three-way min or max followed by a clamp to zero.[12] Similarly, Hopper speeds up implementations of the Needleman–Wunsch algorithm.[13]
Transformer engine
[edit]The Hopper architecture was the first Nvidia architecture to implement the transformer engine.[14] The transformer engine accelerates computations by dynamically reducing them from higher numerical precisions (i.e., FP16) to lower precisions that are faster to perform (i.e., FP8) when the loss in precision is deemed acceptable.[14] The transformer engine is also capable of dynamically allocating bits in the chosen precision to either the mantissa or exponent at runtime to maximize precision.[5]
Power efficiency
[edit]The SXM5 form factor H100 has a thermal design power (TDP) of 700 watts. With regards to its asynchrony, the Hopper architecture may attain high degrees of utilization and thus may have a better performance-per-watt.[15]
Grace Hopper
[edit]| Designed by | Nvidia |
|---|---|
| Manufactured by | |
| Fabrication process | TSMC 4N |
| Codename | Grace Hopper |
| Specifications | |
| Compute | GPU: 132 Hopper SMs CPU: 72 Neoverse V2 cores |
| Shader clock rate | 1980 MHz |
| Memory support | GPU: 96 GB HBM3 or 144 GB HBM3e CPU: 480 GB LPDDR5X |
The GH200 combines a Hopper-based H100 GPU with a Grace-based 72-core CPU on a single module. The total power draw of the module is up to 1000 W. CPU and GPU are connected via NVLink, which provides memory coherence between CPU and GPU memory.[16]
History
[edit]In November 2019, a well-known Twitter account posted a tweet revealing that the next architecture after Ampere would be called Hopper, named after computer scientist and United States Navy rear admiral Grace Hopper, one of the first programmers of the Harvard Mark I. The account stated that Hopper would be based on a multi-chip module design, which would result in a yield gain with lower wastage.[17]
During the March 2022 Nvidia GTC, Nvidia announced Hopper.[18]
In late 2022, due to US regulations limiting the export of chips to the People's Republic of China, Nvidia adapted the H100 chip to the Chinese market with the H800. This model has lower bandwidth compared to the original H100 model.[19][20] In late 2023, the US government announced new restrictions on the export of AI chips to China, including the A800 and H800 models.[21] This led to Nvidia creating another chip predicated on Hopper microarchitecture: the H20, a modified version of the H100. The H20 had become the most prominent chip in the Chinese market as of 2025.[22]
By 2023, during the AI boom, H100s were in great demand. Larry Ellison of Oracle Corporation said that year that at a dinner with Nvidia CEO Jensen Huang, he and Elon Musk of Tesla, Inc. and xAI "were begging" for H100s, "I guess is the best way to describe it. An hour of sushi and begging".[23]
In January 2024, Raymond James Financial analysts estimated that Nvidia was selling the H100 GPU in the price range of $25,000 to $30,000 each, while on eBay, individual H100s cost over $40,000.[24] As of February 2024, Nvidia was reportedly shipping H100 GPUs to data centers in armored cars.[25]
H100 accelerator and DGX H100
[edit]Comparison of accelerators used in DGX:[26][27][28]
| Model | Architecture | Socket | FP32 CUDA cores |
FP64 cores (excl. tensor) |
Mixed INT32/FP32 cores |
INT32 cores |
Boost clock |
Memory clock |
Memory bus width |
Memory bandwidth |
VRAM | Single precision (FP32) |
Double precision (FP64) |
INT8 (non-tensor) |
INT8 dense tensor |
INT32 | FP4 dense tensor |
FP16 | FP16 dense tensor |
bfloat16 dense tensor |
TensorFloat-32 (TF32) dense tensor |
FP64 dense tensor |
Interconnect (NVLink) |
GPU | L1 Cache | L2 Cache | TDP | Die size | Transistor count |
Process | Launched |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| P100 | Pascal | SXM/SXM2 | 3584 | 1792 | N/A | N/A | 1480 MHz | 1.4 Gbit/s HBM2 | 4096-bit | 720 GB/sec | 16 GB HBM2 | 10.6 TFLOPS | 5.3 TFLOPS | N/A | N/A | N/A | N/A | 21.2 TFLOPS | N/A | N/A | N/A | N/A | 160 GB/sec | GP100 | 1344 KB (24 KB × 56) | 4096 KB | 300 W | 610 mm2 | 15.3 B | TSMC 16FF+ | Q2 2016 |
| V100 16GB | Volta | SXM2 | 5120 | 2560 | N/A | 5120 | 1530 MHz | 1.75 Gbit/s HBM2 | 4096-bit | 900 GB/sec | 16 GB HBM2 | 15.7 TFLOPS | 7.8 TFLOPS | 62 TOPS | N/A | 15.7 TOPS | N/A | 31.4 TFLOPS | 125 TFLOPS | N/A | N/A | N/A | 300 GB/sec | GV100 | 10240 KB (128 KB × 80) | 6144 KB | 300 W | 815 mm2 | 21.1 B | TSMC 12FFN | Q3 2017 |
| V100 32GB | Volta | SXM3 | 5120 | 2560 | N/A | 5120 | 1530 MHz | 1.75 Gbit/s HBM2 | 4096-bit | 900 GB/sec | 32 GB HBM2 | 15.7 TFLOPS | 7.8 TFLOPS | 62 TOPS | N/A | 15.7 TOPS | N/A | 31.4 TFLOPS | 125 TFLOPS | N/A | N/A | N/A | 300 GB/sec | GV100 | 10240 KB (128 KB × 80) | 6144 KB | 350 W | 815 mm2 | 21.1 B | TSMC 12FFN | |
| A100 40GB | Ampere | SXM4 | 6912 | 3456 | 6912 | N/A | 1410 MHz | 2.4 Gbit/s HBM2 | 5120-bit | 1.52 TB/sec | 40 GB HBM2 | 19.5 TFLOPS | 9.7 TFLOPS | N/A | 624 TOPS | 19.5 TOPS | N/A | 78 TFLOPS | 312 TFLOPS | 312 TFLOPS | 156 TFLOPS | 19.5 TFLOPS | 600 GB/sec | GA100 | 20736 KB (192 KB × 108) | 40960 KB | 400 W | 826 mm2 | 54.2 B | TSMC N7 | Q1 2020 |
| A100 80GB | Ampere | SXM4 | 6912 | 3456 | 6912 | N/A | 1410 MHz | 3.2 Gbit/s HBM2e | 5120-bit | 1.52 TB/sec | 80 GB HBM2e | 19.5 TFLOPS | 9.7 TFLOPS | N/A | 624 TOPS | 19.5 TOPS | N/A | 78 TFLOPS | 312 TFLOPS | 312 TFLOPS | 156 TFLOPS | 19.5 TFLOPS | 600 GB/sec | GA100 | 20736 KB (192 KB × 108) | 40960 KB | 400 W | 826 mm2 | 54.2 B | TSMC N7 | |
| H100 | Hopper | SXM5 | 16896 | 4608 | 16896 | N/A | 1980 MHz | 5.2 Gbit/s HBM3 | 5120-bit | 3.35 TB/sec | 80 GB HBM3 | 67 TFLOPS | 34 TFLOPS | N/A | 1.98 POPS | N/A | N/A | N/A | 990 TFLOPS | 990 TFLOPS | 495 TFLOPS | 67 TFLOPS | 900 GB/sec | GH100 | 25344 KB (192 KB × 132) | 51200 KB | 700 W | 814 mm2 | 80 B | TSMC 4N | Q3 2022 |
| H200 | Hopper | SXM5 | 16896 | 4608 | 16896 | N/A | 1980 MHz | 6.3 Gbit/s HBM3e | 6144-bit | 4.8 TB/sec | 141 GB HBM3e | 67 TFLOPS | 34 TFLOPS | N/A | 1.98 POPS | N/A | N/A | N/A | 990 TFLOPS | 990 TFLOPS | 495 TFLOPS | 67 TFLOPS | 900 GB/sec | GH100 | 25344 KB (192 KB × 132) | 51200 KB | 1000 W | 814 mm2 | 80 B | TSMC 4N | Q3 2023 |
| B100 | Blackwell | SXM6 | N/A | N/A | N/A | N/A | N/A | 8 Gbit/s HBM3e | 8192-bit | 8 TB/sec | 192 GB HBM3e | N/A | N/A | N/A | 3.5 POPS | N/A | 7 PFLOPS | N/A | 1.98 PFLOPS | 1.98 PFLOPS | 989 TFLOPS | 30 TFLOPS | 1.8 TB/sec | GB100 | N/A | N/A | 700 W | N/A | 208 B | TSMC 4NP | Q4 2024 |
| B200 | Blackwell | SXM6 | N/A | N/A | N/A | N/A | N/A | 8 Gbit/s HBM3e | 8192-bit | 8 TB/sec | 192 GB HBM3e | N/A | N/A | N/A | 4.5 POPS | N/A | 9 PFLOPS | N/A | 2.25 PFLOPS | 2.25 PFLOPS | 1.2 PFLOPS | 40 TFLOPS | 1.8 TB/sec | GB100 | N/A | N/A | 1000 W | N/A | 208 B | TSMC 4NP |
Export controls and international trade issues
[edit]In early 2026, Nvidia’s Hopper-based H200 AI accelerator became a focal point in international trade disputes involving U.S. export policy and Chinese import controls. Although the U.S. government approved the limited export of H200 chips to China under specific security conditions, reports indicated that Chinese customs officials prevented shipments of the processors from entering the country despite the U.S. clearance, leading suppliers to pause production of H200 components amid uncertainty over the import block. Chinese authorities reportedly instructed domestic firms against purchasing the chips unless necessary, though no formal ban was publicly announced and the long-term status of the restrictions remained unclear. The situation highlighted the geopolitical sensitivities surrounding advanced AI hardware exports and the complex interplay between U.S. export regulations and Chinese import policies.[29]
References
[edit]Citations
[edit]- ^ Elster & Haugdahl 2022, p. 4.
- ^ Nvidia 2023c, p. 20.
- ^ Nvidia 2023b, p. 9.
- ^ Fujita et al. 2023, p. 6.
- ^ a b "Nvidia's Next GPU Shows That Transformers Are Transforming AI - IEEE Spectrum". spectrum.ieee.org. Retrieved October 23, 2024.
- ^ a b Nvidia 2023b, p. 10.
- ^ Vishal Mehta (September 2022). CUDA Programming Model for Hopper Architecture. Santa Clara: Nvidia. Retrieved May 29, 2023.
- ^ Fujita et al. 2023, p. 4.
- ^ a b Nvidia 2023b, p. 11.
- ^ Nvidia 2023b, p. 12.
- ^ Nvidia 2023a, p. 44.
- ^ Tirumala, Ajay; Eaton, Joe; Tyrlik, Matt (December 8, 2022). "Boosting Dynamic Programming Performance Using NVIDIA Hopper GPU DPX Instructions". Nvidia. Retrieved May 29, 2023.
- ^ Harris, Dion (March 22, 2022). "NVIDIA Hopper GPU Architecture Accelerates Dynamic Programming Up to 40x Using New DPX Instructions". Nvidia. Retrieved May 29, 2023.
- ^ a b Salvator, Dave (March 22, 2022). "H100 Transformer Engine Supercharges AI Training, Delivering Up to 6x Higher Performance Without Losing Accuracy". Nvidia. Retrieved May 29, 2023.
- ^ Elster & Haugdahl 2022, p. 8.
- ^ "NVIDIA: Grace Hopper Has Entered Full Production & Announcing DGX GH200 AI Supercomputer". Anandtech. May 29, 2023. Archived from the original on May 29, 2023.
- ^ Pirzada, Usman (November 16, 2019). "NVIDIA Next Generation Hopper GPU Leaked – Based On MCM Design, Launching After Ampere". Wccftech. Retrieved May 29, 2023.
- ^ Vincent, James (March 22, 2022). "Nvidia reveals H100 GPU for AI and teases 'world's fastest AI supercomputer'". The Verge. Retrieved May 29, 2023.
- ^ "Nvidia tweaks flagship H100 chip for export to China as H800". Reuters. Archived from the original on November 22, 2023. Retrieved January 28, 2025.
- ^ "NVIDIA Prepares H800 Adaptation of H100 GPU for the Chinese Market". TechPowerUp. Archived from the original on September 2, 2023. Retrieved January 28, 2025.
- ^ Leswing, Kif (October 17, 2023). "U.S. curbs export of more AI chips, including Nvidia H800, to China". CNBC. Retrieved January 28, 2025.
- ^ "Here are the chips that Nvidia can sell to China". qz.com.
- ^ Fitch, Asa (February 26, 2024). "Nvidia's Stunning Ascent Has Also Made It a Giant Target". The Wall Street Journal. Retrieved February 27, 2024.
- ^ Vanian, Jonathan (January 18, 2024). "Mark Zuckerberg indicates Meta is spending billions of dollars on Nvidia AI chips". CNBC. Retrieved June 6, 2024.
- ^ Bousquette, Isabelle; Lin, Belle (February 14, 2024). "Armored Cars and Trillion Dollar Price Tags: How Some Tech Leaders Want to Solve the Chip Shortage". The Wall Street Journal. Retrieved May 30, 2024.
- ^ Smith, Ryan (March 22, 2022). "NVIDIA Hopper GPU Architecture and H100 Accelerator Announced: Working Smarter and Harder". AnandTech. Archived from the original on September 23, 2023.
- ^ Smith, Ryan (May 14, 2020). "NVIDIA Ampere Unleashed: NVIDIA Announces New GPU Architecture, A100 GPU, and Accelerator". AnandTech. Archived from the original on July 29, 2024.
- ^ Garreffa, Anthony (September 17, 2017). "NVIDIA Tesla V100 Tested: Near Unbelievable GPU Power". TweakTown.com. Retrieved December 30, 2025.
- ^ "China blocks Nvidia H200 AI chips that US government cleared for export – report". The Guardian. January 17, 2026. Retrieved January 21, 2026.
Works cited
[edit]- Elster, Anne; Haugdahl, Tor (March 2022). "Nvidia Hopper GPU and Grace CPU Highlights". Computing in Science & Engineering. 24 (2): 95–100. Bibcode:2022CSE....24b..95E. doi:10.1109/MCSE.2022.3163817. hdl:11250/3051840. S2CID 249474974. Retrieved May 29, 2023.
- Fujita, Kohei; Yamaguchi, Takuma; Kikuchi, Yuma; Ichimura, Tsuyoshi; Hori, Muneo; Maddegedara, Lalith (April 2023). "Calculation of cross-correlation function accelerated by TensorFloat-32 Tensor Core operations on NVIDIA's Ampere and Hopper GPUs". Journal of Computational Science. 68. doi:10.1016/j.jocs.2023.101986.
- CUDA C++ Programming Guide (PDF). Nvidia. April 17, 2023.
- Hopper Tuning Guide (PDF). Nvidia. April 13, 2023.
- NVIDIA H100 Tensor Core GPU Architecture (PDF). Nvidia. 2022.[permanent dead link]
Further reading
[edit]- Choquette, Jack (May 2023). "NVIDIA Hopper H100 GPU: Scaling Performance". IEEE Micro. 43 (3): 9–17. doi:10.1109/MM.2023.3256796. S2CID 257544490. Retrieved May 29, 2023.
- Moore, Samuel (April 8, 2022). "Nvidia's Next GPU Shows That Transformers Are Transforming AI". IEEE Spectrum. Retrieved May 29, 2023.
- Morgan, Timothy (March 31, 2022). "Deep Dive Into Nvidia's "Hopper" GPU Architecture". The Next Platform. Retrieved May 29, 2023.
Hopper (microarchitecture)
View on GrokipediaHopper is a graphics processing unit (GPU) microarchitecture developed by Nvidia for datacenter computing, succeeding the Ampere architecture and debuting in the H100 Tensor Core GPU.[1][2] Named after pioneering computer scientist Grace Hopper, the architecture was announced on March 22, 2022, and emphasizes accelerated computing for artificial intelligence, high-performance computing, and data analytics rather than consumer graphics.[1][3] The Hopper microarchitecture introduces the Transformer Engine, which combines fourth-generation Tensor Cores with support for FP8 precision to deliver up to 9x faster AI training compared to prior generations, alongside dynamic precision scaling for mixed-precision workloads.[2][3] Fabricated using TSMC's 4N process with over 80 billion transistors, Hopper GPUs like the H100 enable terabyte-scale accelerated computing through innovations such as confidential computing for secure AI processing and enhanced NVLink interconnects for multi-GPU scalability.[2][3] These advancements position Hopper as a foundational technology for large-scale language models and scientific simulations, powering systems like the NVIDIA Grace Hopper Superchip.[4][1]
Overview
Design Goals and Innovations
The NVIDIA Hopper microarchitecture was designed to deliver transformative performance for large-scale AI training and inference, particularly for trillion-parameter models, while enabling exascale high-performance computing (HPC) workloads with enhanced security and scalability.[2] Key objectives included achieving up to 30x faster inference on large language models compared to the prior Ampere architecture (A100), through optimizations for transformer-based neural networks, and providing order-of-magnitude improvements in compute throughput—up to 6x overall—via advanced precision formats and interconnects.[3] These goals addressed the escalating demands of AI supercomputing, targeting secure scaling from enterprise deployments to massive clusters supporting 1 exaFLOP of FP8 sparse AI compute across up to 256 GPUs.[3][2] Central to Hopper's innovations is the Transformer Engine, an extension of Tensor Core technology that dynamically mixes FP8 and FP16 precisions to accelerate AI model training by up to 9x and inference by up to 30x over A100, while tripling FLOPS performance in formats like TF32, FP64, FP16, and INT8.[3][2] FP8 precision halves data storage requirements and doubles throughput per streaming multiprocessor (SM) compared to FP16, enabling efficient handling of the precision needs in modern deep learning without accuracy loss.[3] Complementing this, DPX instructions optimize dynamic programming algorithms—such as those in bioinformatics (e.g., Smith-Waterman)—delivering up to 7x speedup over Ampere GPUs and 40x over dual-socket CPU servers.[2] The architecture also introduces confidential computing via hardware-enforced memory encryption and secure Tensor Memory Access (TMA), marking the first GPU platform to protect data and models during computation against insider threats or compromised software.[2] Hopper incorporates TSMC's 4N process node, packing over 80 billion transistors into an 814 mm² die, paired with HBM3 memory offering 3 TB/s bandwidth—double that of A100—and a 50 MB L2 cache for improved data locality.[3] NVLink 4 provides 900 GB/s bidirectional GPU-to-GPU bandwidth (7x PCIe Gen5), with NVSwitch enabling 57.6 TB/s all-to-all communication in large-scale systems like DGX GH200, facilitating strong scaling and reduced latencies for distributed training.[2] Second-generation Multi-Instance GPU (MIG) supports up to 7 isolated instances per GPU with dedicated confidential computing and hardware acceleration, enhancing workload isolation and efficiency in multi-tenant environments.[2] These features collectively prioritize architectural efficiency, simplifying programming while minimizing overheads for mainstream AI and HPC applications.[3]Position in NVIDIA's Architecture Lineage
Hopper represents the successor to NVIDIA's Ampere microarchitecture in the company's datacenter GPU lineage, building directly on the compute-oriented advancements introduced with the A100 in May 2020. Whereas Ampere emphasized sparse tensor operations and third-generation Tensor Cores for mixed-precision AI workloads, Hopper refines these elements with fourth-generation Tensor Cores and the introduction of the Transformer Engine, which dynamically scales precision from FP8 to FP16 to optimize transformer model performance without accuracy loss. This progression underscores NVIDIA's shift from graphics-centric designs in earlier architectures like Fermi (2010) and Kepler (2012) toward specialized accelerators for high-performance computing (HPC) and artificial intelligence, as evidenced by Hopper's deployment in systems targeting exascale supercomputing.[3][2] Positioned as a datacenter-exclusive architecture—unlike the hybrid consumer/datacenter Turing (2018) and Ampere—Hopper powers the H100 Tensor Core GPU, announced on March 22, 2022, and focuses on multi-instance GPU (MIG) partitioning for secure workload isolation, NVLink 4.0 interconnects for enhanced multi-GPU scaling, and confidential computing features via hardware-rooted trust. It bridges Ampere's bandwidth improvements (with HBM3 memory) and the subsequent Blackwell architecture, unveiled in March 2024, which further scales to support trillion-parameter models with dual-die designs and fifth-generation Tensor Cores. Hopper's emphasis on AI training efficiency, delivering up to 9x performance over A100 on large language models, solidified NVIDIA's lead in accelerated computing amid surging demand for generative AI infrastructure.[3][2][5] In the broader evolutionary context, Hopper continues the post-Volta (2017) trend of privileging tensor-accelerated matrix math over traditional rasterization, with architectures named after computing pioneers: Volta (V), Ampere (A), Hopper (H), and Blackwell (B). This lineage prioritizes causal factors like Moore's Law scaling limits and workload-specific bottlenecks, such as memory bandwidth and precision trade-offs, over general-purpose versatility seen in Maxwell (2014) or Pascal (2016). Empirical benchmarks confirm Hopper's causal advancements, with H100 achieving 4 petaFLOPS FP8 throughput per GPU, enabling systems like Frontier to reach exascale performance in 2022.[3][2]Development History
Origins and Announcement
The Hopper microarchitecture derives its name from Grace Hopper, a pioneering U.S. computer scientist and rear admiral known for her contributions to programming languages and early computing systems.[1] NVIDIA developed Hopper as the successor to its Ampere architecture to advance capabilities in accelerated computing, particularly for artificial intelligence and high-performance computing applications requiring enhanced tensor processing and memory efficiency.[3] NVIDIA formally announced the Hopper microarchitecture on March 22, 2022, during its GPU Technology Conference (GTC), positioning it as the foundation for next-generation data center GPUs.[1][3] The announcement highlighted the H100 Tensor Core GPU, fabricated on TSMC's 4NP process with over 80 billion transistors, as the inaugural product embodying Hopper's innovations for AI training and inference.[1][6] This reveal came two years after Ampere's launch, underscoring NVIDIA's accelerated cadence in architecture iterations driven by demand for scalable AI infrastructure.[1]Engineering Milestones and Production
NVIDIA revealed the Hopper microarchitecture on March 22, 2022, during its GTC keynote, highlighting its design for accelerating large-scale AI and high-performance computing workloads.[3] The architecture powers the GH100 GPU die, fabricated on TSMC's custom 4N process node with a die area of 814 mm² and 80 billion transistors, achieving unprecedented density for datacenter GPUs.[3][2] A pivotal engineering milestone was the successful implementation of fourth-generation Tensor Cores integrated with the Transformer Engine, enabling dynamic precision management that delivers up to 6x faster AI training compared to prior architectures.[3] This was complemented by the introduction of HBM3 memory support, providing 3 TB/s bandwidth in the H100 SXM variant, marking the first GPU to utilize this high-speed memory standard.[3] The H100 Tensor Core GPU entered full production on September 20, 2022, with initial shipments commencing in October 2022 to enable partner systems and services.[7] Production ramped amid high demand, with NVIDIA anticipating shipment of around 550,000 units in 2023 to meet AI infrastructure needs.[8] Variants include the SXM5 module with 132 streaming multiprocessors and the PCIe card with 114, supporting scalable deployments in DGX systems starting in Q3 2022.[9][3]Architectural Components
Streaming Multiprocessor
The Streaming Multiprocessor (SM) in the NVIDIA Hopper microarchitecture serves as the fundamental processing unit, executing parallel thread workloads through arrays of CUDA cores, Tensor Cores, and associated scheduling hardware. Each SM contains 128 FP32 CUDA cores and 4 fourth-generation Tensor Cores, enabling high-throughput scalar and matrix operations.[3] This configuration delivers 2x the clock-for-clock FP64 and FP32 performance per SM compared to the Ampere architecture's SMs, achieved through architectural refinements in execution pipelines and instruction throughput.[3] Hopper SMs introduce independent thread scheduling via thread block clusters, allowing concurrent execution across multiple SMs with hardware-accelerated synchronization barriers that reduce software overhead for cooperative workloads. Distributed shared memory supports direct inter-SM communication within graphics processing clusters (GPCs), minimizing cache round-trips and enhancing multi-instance GPU (MIG) partitioning efficiency by up to 3x in compute density over Ampere. Each SM allocates 256 KB of combined L1 cache and shared memory—1.33x larger than Ampere's 192 KB—configurable in increments up to 228 KB for flexible workload optimization.[3][10] Tensor Cores in Hopper SMs support FP8 precision with E4M3 and E5M2 formats, doubling matrix-multiply-accumulate (MMA) throughput relative to FP16/BF16 while halving memory footprint, yielding up to 4x overall rates versus Ampere when sparsity acceleration is applied. The Tensor Memory Accelerator (TMA) integrates into SMs for asynchronous, descriptor-driven data transfers between global memory and shared memory, overlapping compute with memory operations to boost efficiency in large-model training. Additionally, DPX instructions accelerate dynamic programming algorithms, such as Smith-Waterman for sequence alignment, providing up to 7x speedup over Ampere implementations by leveraging dedicated SM hardware paths.[3][10]Tensor Cores and Transformer Engine
The fourth-generation Tensor Cores in the Hopper microarchitecture represent an evolution from those in the Ampere architecture, delivering double the raw dense and sparse matrix multiply-accumulate (MMA) throughput per streaming multiprocessor (SM) at equivalent clock speeds.[3] These cores support a range of precisions, including FP64 for high-precision scientific computing, TF32 and FP16/BF16 for deep learning, INT8 for inference, and the newly introduced FP8 formats (E4M3 and E5M2) to reduce memory footprint by half while doubling computational throughput relative to FP16.[3][2] Hopper Tensor Cores also incorporate sparsity acceleration, which exploits structured sparsity in neural networks to achieve up to double the effective performance on compatible workloads.[3] In the H100 GPU, this enables peak FP8 performance of 2000 TFLOPS (scaling to 4000 TFLOPS with sparsity) in the SXM5 variant.[3] The Transformer Engine integrates these Tensor Cores with specialized software libraries to optimize transformer-based models, which dominate large-scale AI training and inference.[2] Introduced as part of Hopper at GTC 2022, it enables dynamic per-layer precision selection—switching between FP8 for compute-intensive operations and higher-precision formats like FP16 to preserve model accuracy—via automated scaling and statistical analysis during forward and backward passes.[3][11] This hardware-software approach leverages FP8's efficiency without requiring format conversions in inference, reducing memory usage and enabling faster processing of trillion-parameter models.[11] For instance, it supports FP8 on Hopper GPUs to accelerate workloads in libraries like NVIDIA's Transformer Engine API, which handles mixed-precision kernels for both training and inference.[12] Performance benchmarks demonstrate substantial gains: Hopper with the Transformer Engine achieves up to 6x higher AI training throughput without accuracy loss compared to Ampere-based systems, reducing training times for models like a 395-billion-parameter Mixture of Experts from 7 days to 20 hours on equivalent hardware.[11] Inference throughput improves by up to 30x for large language models such as Megatron 530B versus the A100, while maintaining low latency (e.g., 1 second).[11] These advancements triple overall FLOPS rates for TF32, FP16, and related formats relative to the prior generation, positioning Hopper for exascale AI and HPC applications.[2]Memory System and Bandwidth
The Hopper microarchitecture integrates a high-bandwidth memory subsystem centered on stacked High Bandwidth Memory (HBM), with HBM3 employed in premium configurations for superior throughput and HBM2e in cost-optimized variants.[3] In the H100 SXM5 implementation, this comprises 80 GB of HBM3 across five memory stacks, achieving 3 TB/s peak bandwidth—twice the 1.5 TB/s of the Ampere A100's HBM2e—via a widened interface and higher clock rates.[3] The H100 PCIe variant, by contrast, utilizes 80 GB HBM2e with five stacks and 2 TB/s bandwidth to balance performance with PCIe form factor constraints.[3] To mitigate latency from off-chip HBM accesses, Hopper features a 50 MB L2 cache, partitioned across memory partitions for concurrent read/write operations and a 25% capacity increase over Ampere's 40 MB design, thereby caching larger datasets and reducing main memory traffic.[3] At the SM level, each multiprocessor allocates 256 KB for unified L1 data cache and configurable shared memory—33% more than Ampere's 192 KB—supporting finer-grained data reuse in compute-intensive workloads.[3] Reliability enhancements include ECC protection via sideband ECC mechanisms and dynamic memory row remapping to isolate faulty cells without capacity loss.[3] The Tensor Memory Accelerator (TMA) further optimizes bandwidth utilization by enabling asynchronous, descriptor-driven data transfers between HBM and SMs, minimizing CPU intervention and overlapping memory ops with computation.[3] These elements collectively prioritize sustained high-bandwidth delivery for AI training and inference, where memory bottlenecks often constrain scaling.[3]Specialized Instructions and Features
The Hopper microarchitecture introduces DPX instructions optimized for dynamic programming algorithms, which perform fused add-min/max operations to accelerate tasks such as DNA sequence alignment via Smith-Waterman and robot path planning.[3] These instructions deliver up to 7x speedup over Ampere-based GPUs and 40x over dual-socket CPU servers for such workloads.[2][13] Hopper's fourth-generation Tensor Cores support FP8 precision with E4M3 and E5M2 formats, enabling 4x higher matrix-multiply-accumulate throughput compared to 16-bit formats in prior architectures while halving storage requirements.[3] This is integrated into the Transformer Engine, a hardware-software system that dynamically scales precision between FP8 and FP16 for transformer-based models, yielding up to 9x faster AI training and 30x faster inference relative to A100 GPUs on large language models.[3][2] The Tensor Memory Accelerator (TMA) provides asynchronous, descriptor-driven transfers of 1D to 5D tensors between global and shared memory, minimizing thread launch overhead and supporting diverse layouts like interleaved or planar formats.[3] TMA enhances efficiency in tensor-heavy operations by decoupling data movement from compute kernels, with bi-directional capabilities and integration via CUDA async APIs.[14] Additional features include thread block clusters, which extend CUDA cooperation to multiple streaming multiprocessors for finer-grained synchronization and data sharing across up to 8 SMs.[3] Hopper also incorporates hardware support for confidential computing through a secure root of trust and memory encryption, enabling isolated GPU partitions via second-generation Multi-Instance GPU (MIG).[2]Products and Implementations
H100 Tensor Core GPU Variants
The NVIDIA H100 Tensor Core GPU is produced in multiple variants optimized for distinct data center environments, differing primarily in form factor, memory configuration, power envelope, interconnect capabilities, and performance tuning. The SXM5 variant targets high-performance multi-GPU configurations in specialized systems like NVIDIA's HGX and DGX platforms, emphasizing maximum compute density and NVLink scaling for training workloads. In contrast, the PCIe variant provides broader compatibility with standard server architectures via PCI Express interfaces, while the NVL variant, also PCIe-based, prioritizes inference tasks with enhanced memory capacity and bandwidth for handling large language models.[4][15] The H100 SXM5 employs an SXM socketed module form factor, typically integrated into liquid-cooled or high-density air-cooled chassis, with 80 GB of HBM3 memory delivering 3.35 TB/s bandwidth. It supports a TDP of up to 700 W, enabling peak performance metrics such as 67 TFLOPS in FP64 Tensor Core operations and 989 TFLOPS in TF32 Tensor Core operations, facilitated by 900 GB/s bidirectional NVLink interconnects for eight-way GPU scaling. This variant is designed for exascale HPC and AI training, where sustained high power allows elevated clock speeds and transistor utilization across its 80 billion transistors.[4][16] The H100 PCIe variant uses a standard dual-slot PCI Express Gen5 x16 card, suitable for off-the-shelf servers, with 80 GB HBM2e memory at 2.0 TB/s bandwidth and a 350 W TDP. It achieves slightly lower peak throughput, such as 60 TFLOPS FP64 Tensor Core and around 835 TFLOPS TF32 Tensor Core, with 600 GB/s NVLink support for moderate multi-GPU connectivity. This configuration trades some performance for easier deployment and lower power requirements, making it viable for hybrid AI/HPC setups without custom interconnects.[4][16][17] The H100 NVL variant, a dual-slot PCIe Gen5 x16 card with passive cooling, features 94 GB HBM3 memory per GPU at 3.9 TB/s bandwidth, addressing memory-intensive inference for models like Llama 2 70B. Its TDP ranges from 200 W minimum to over 400 W maximum (configurable to 310 W or higher modes), yielding 60 TFLOPS FP64 Tensor Core and 835 TFLOPS TF32 Tensor Core performance, with 600 GB/s NVLink bridges for intra-card or elastic scaling. Optimized for data center inference at scale, it incorporates higher-density HBM3 stacks and supports PCI Express fallback to Gen4 x16 or Gen5 x8, differentiating it from the standard PCIe by prioritizing bandwidth and capacity over training-oriented power scaling.[15][4]| Variant | Form Factor | Memory | Bandwidth | TDP | NVLink Bandwidth | Key Use Case |
|---|---|---|---|---|---|---|
| H100 SXM5 | SXM module | 80 GB HBM3 | 3.35 TB/s | Up to 700 W | 900 GB/s | AI training, HPC scaling |
| H100 PCIe | PCIe x16 | 80 GB HBM2e | 2.0 TB/s | 350 W | 600 GB/s | General servers |
| H100 NVL | PCIe x16 | 94 GB HBM3 | 3.9 TB/s | 200–400+ W | 600 GB/s | LLM inference |
Grace Hopper Superchip
The NVIDIA GH200 Grace Hopper Superchip integrates the Arm-based NVIDIA Grace CPU with an NVIDIA Hopper GPU via a high-bandwidth NVLink chip-to-chip (C2C) interconnect, enabling unified memory access and low-latency data transfer for AI and high-performance computing workloads.[18][19] This design provides up to 900 GB/s of bidirectional bandwidth between the CPU and GPU, surpassing traditional PCIe connections and reducing data movement overhead by allowing direct GPU access to CPU memory.[20][18] The Grace CPU component features 72 Arm Neoverse V2 cores based on the Armv9 architecture, supporting up to 480 GB of LPDDR5X memory with error-correcting code (ECC) for reliability in data-intensive tasks.[18][21] The integrated Hopper GPU includes variants with 96 GB of HBM3 memory or up to 144 GB of HBM3e memory, delivering enhanced bandwidth of up to 10 TB/s in next-generation configurations for accelerating large-scale generative AI models and scientific simulations.[18][22] Announced as part of NVIDIA's accelerated computing roadmap in 2022, the Superchip entered full production on May 28, 2023, powering systems for complex AI training and HPC applications with coherent memory sharing that eliminates the need for explicit data copies between CPU and GPU address spaces.[23][19] This integration supports hardware-level unified addressing and page tables, facilitating seamless workload orchestration in environments requiring massive scalability, such as trillion-parameter AI models.[20][18]Integrated Systems and Platforms
The NVIDIA HGX H100 platform integrates up to eight H100 Tensor Core GPUs interconnected via fifth-generation NVLink, providing a unified memory domain with aggregate bandwidth exceeding 3 terabytes per second for AI and high-performance computing workloads.[24] Announced on April 21, 2022, HGX H100 serves as a modular building block for server manufacturers, enabling scalable GPU clusters optimized for large language model training and inference.[25] The NVIDIA DGX H100 system builds on HGX by incorporating eight H100 GPUs with dual Intel Xeon Platinum processors, up to 2 terabytes of system memory, and NVIDIA BlueField-3 data processing units for enhanced networking and security. Designed as a turnkey AI factory, DGX H100 delivers over 32 petaFLOPS of FP8 AI performance and supports NVIDIA AI Enterprise software for end-to-end workflows from data preparation to deployment.[26] It forms the core of DGX SuperPOD configurations, which scale to hundreds of GPUs via NVSwitch fabrics for exascale AI training.[27] Platforms leveraging the Grace Hopper Superchip, such as NVIDIA MGX, combine the GH200's Grace CPU and Hopper GPU via NVLink-C2C for up to 900 gigabytes per second of interconnect bandwidth, targeting HPC simulations and trillion-parameter AI models.[18] Deployments include the Venado supercomputer at Los Alamos National Laboratory, featuring GH200 nodes for AI research and ranking as the 19th-fastest system globally as of August 2025.[28] Large-scale implementations extend to custom clusters, exemplified by xAI's Colossus supercomputer, which interconnects 100,000 Hopper GPUs using NVIDIA Ethernet networking to achieve unprecedented AI training scale, operational as of October 2024.[29] These systems emphasize Hopper's multi-instance GPU partitioning and confidential computing features for secure, efficient resource utilization across enterprise and research environments.[2]Performance and Efficiency
Compute Throughput and Benchmarks
The Hopper microarchitecture in the H100 SXM GPU delivers peak FP64 Tensor Core performance of 67 teraFLOPS and FP32 performance of 67 teraFLOPS, representing a tripling of double-precision Tensor Core throughput compared to the prior Ampere architecture.[4] Fourth-generation Tensor Cores enable significantly higher rates in reduced-precision formats optimized for AI workloads, including up to 989 teraFLOPS in TF32, 1,979 teraFLOPS in FP16 and BF16, and 3,958 teraFLOPS in FP8, with these figures incorporating structured sparsity acceleration for compatible sparse matrix operations.[4] Integer operations reach 3,958 TOPS in INT8 via Tensor Cores.[4]| Precision | Peak Performance (H100 SXM) |
|---|---|
| FP64 Tensor Core | 67 TFLOPS |
| FP32 | 67 TFLOPS |
| TF32 Tensor Core | 989 TFLOPS (sparse) |
| FP16/BF16 Tensor Core | 1,979 TFLOPS (sparse) |
| FP8 Tensor Core | 3,958 TFLOPS (sparse) |
| INT8 Tensor Core | 3,958 TOPS (sparse) |
Power Consumption and Optimization
The H100 GPU, implementing the Hopper microarchitecture, has a thermal design power (TDP) of 700 W for the SXM5 variant and 350 W for the PCIe variant, compared to 400 W for the Ampere-based A100.[3] TDP is configurable in certain models, such as the H100 NVL at 350–400 W for power-constrained deployments.[4] Hopper enhances power efficiency through process and architectural advancements. Built on TSMC's 4N process node, it delivers superior performance per watt relative to Ampere's 7 nm node. Fourth-generation Tensor Cores reduce operand delivery power by up to 30%, while HBM3 memory at 3 TB/s bandwidth and a 50 MB L2 cache—25% larger than A100's—minimize energy-intensive memory accesses by cutting trips to off-chip storage.[3] The Transformer Engine optimizes for transformer-based AI models via FP8 precision support, achieving up to 4x faster training on benchmarks like GPT-3 175B over prior generations and up to 9x training or 30x inference speedups versus A100, yielding higher effective FLOPS per watt in low-precision formats. These features enable Hopper to provide up to 6x overall compute performance gains over Ampere, balancing increased capability with targeted efficiency improvements for AI and HPC workloads.[4][3]Comparisons to Ampere and Blackwell
Hopper introduced substantial enhancements over Ampere in tensor core capabilities and precision support, enabling up to 6x higher performance in MLPerf training benchmarks for deep learning workloads on H100 GPUs compared to A100 GPUs.[35] The architecture features fourth-generation Tensor Cores with native FP8 precision and dynamic scaling for the Transformer Engine, which accelerates transformer model training by handling mixed-precision computations more efficiently than Ampere's third-generation Tensor Cores, which lacked FP8 support and relied on FP16/bfloat16 with sparsity acceleration.[3] Hopper also improves multi-instance GPU (MIG) partitioning with second-generation support, allowing finer-grained resource allocation than Ampere's first-generation MIG, and enhances cache hierarchies with larger L1/L2 caches to reduce memory latency in data-intensive tasks.[36] In terms of compute throughput, Hopper's streaming multiprocessors deliver approximately 3.2x the dense tensor performance per core over Ampere despite only a 22% increase in core count, driven by architectural optimizations like improved asynchronous execution and better overlap of compute with data movement via DPX instructions. Power efficiency sees gains through these features, with H100 achieving higher FLOPS per watt in AI training due to reduced precision overhead, though Ampere remains viable for legacy FP64-heavy HPC tasks where Hopper maintains parity but excels in hybrid workloads.[37] Blackwell builds on Hopper with a dual-chiplet design connecting two GPU dies via NV-HSI links, scaling to 208 billion transistors in GB200 configurations and delivering up to 20 petaFLOPS in FP4/FP8 for AI training and inference, roughly 5x Hopper's peak in similar precisions on H100/H200.[38] This enables 30% higher FP64 and FP32 fused multiply-add performance for scientific simulations compared to Hopper, alongside 25x energy efficiency improvements in large-scale inference due to fifth-generation Tensor Cores, decompression engines, and HBM3e memory supporting up to 192 GB per GPU with 8 TB/s bandwidth—double Hopper's HBM3 capacity and 1.5x bandwidth in H200 variants.[39] However, Hopper retains advantages in balanced FP64 throughput for certain HPC applications without Blackwell's emphasis on ultra-low-precision inference, and its single-die monolithic design avoids potential inter-die latency penalties observed in early Blackwell prototypes.[40]| Metric | Ampere (A100) | Hopper (H100/H200) | Blackwell (B200/GB200) |
|---|---|---|---|
| Peak FP8 TFLOPS (dense) | ~624 | ~1,979 (with sparsity) | ~9,000+ |
| FP64 TFLOPS | 9.7 | 34 | ~44 |
| Memory Capacity | 40-80 GB HBM2e | 80-141 GB HBM3/HBM3e | 192 GB HBM3e |
| Bandwidth (TB/s) | 2 | 3-4.8 | 8 |
| Transistors (billions) | 54.2 | 80 | 208 (dual-die) |