Hubbry Logo
Ampere (microarchitecture)Ampere (microarchitecture)Main
Open search
Ampere (microarchitecture)
Community hub
Ampere (microarchitecture)
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Ampere (microarchitecture)
Ampere (microarchitecture)
from Wikipedia

Ampere
LaunchedMay 14, 2020; 5 years ago (2020-05-14)
Designed byNvidia
Manufactured by
Fabrication processTSMC N7 (professional)
Samsung 8N (consumer)
CodenameGA10x
Product Series
Desktop
Professional/workstation
  • RTX A series
Server/datacenter
  • A100
Specifications
L1 cache192 KB per SM (professional)
128 KB per SM (consumer)
L2 cache2 MB to 6 MB
Memory support
PCIe supportPCIe 4.0
Supported Graphics APIs
DirectXDirectX 12 Ultimate (Feature Level 12_2)
Direct3DDirect3D 12.0
Shader ModelShader Model 6.8
OpenGLOpenGL 4.6
CUDACompute Capability 8.6
VulkanVulkan 1.4[1]
Supported Compute APIs
OpenCLOpenCL 3.0
Media Engine
Encode codecs
Decode codecs
Color bit-depth
  • 8-bit
  • 10-bit
Encoder supportedNVENC
Display outputs
History
PredecessorTuring (consumer)
Volta (professional)
SuccessorAda Lovelace (consumer)
Hopper (datacenter)
Support status
Supported

Ampere is the codename for a graphics processing unit (GPU) microarchitecture developed by Nvidia as the successor to both the Volta and Turing architectures. It was officially announced on May 14, 2020, and is named after French mathematician and physicist André-Marie Ampère.[2][3]

Nvidia announced the Ampere architecture GeForce 30 series consumer GPUs at a GeForce Special Event on September 1, 2020.[4][5] Nvidia announced the A100 80 GB GPU at SC20 on November 16, 2020.[6] Mobile RTX graphics cards and the RTX 3060 based on the Ampere architecture were revealed on January 12, 2021.[7]

Nvidia announced Ampere's successor, Hopper, at GTC 2022, and "Ampere Next Next" (Blackwell) for a 2024 release at GPU Technology Conference 2021.

Details

[edit]

Architectural improvements of the Ampere architecture include the following:

  • CUDA Compute Capability 8.0 for A100 and 8.6 for the GeForce 30 series[8]
  • TSMC's 7 nm FinFET process for A100
  • Custom version of Samsung's 8 nm process (8N) for the GeForce 30 series[9]
  • Third-generation Tensor Cores with FP16, bfloat16, TensorFloat-32 (TF32) and FP64 support and sparsity acceleration.[10] The individual Tensor cores have with 256 FP16 FMA operations per clock 4x processing power (GA100 only, 2x on GA10x) compared to previous Tensor Core generations; the Tensor Core Count is reduced to one per SM.
  • Second-generation ray tracing cores; concurrent ray tracing, shading, and compute for the GeForce 30 series
  • High Bandwidth Memory 2 (HBM2) on A100 40 GB & A100 80 GB
  • GDDR6X memory for GeForce RTX 3090, RTX 3080 Ti, RTX 3080, RTX 3070 Ti
  • Double FP32 cores per SM on GA10x GPUs
  • NVLink 3.0 with a 50 Gbit/s per pair throughput[10]
  • PCI Express 4.0 with SR-IOV support (SR-IOV is reserved only for A100)
  • Multi-instance GPU (MIG) virtualization and spatial GPU partitioning feature in A100 supporting up to seven instances
  • PureVideo feature set K hardware video decoding with AV1 hardware decoding[11] for the GeForce 30 series and feature set J for A100
  • 5 NVDEC for A100
  • Adds new hardware-based 5-core JPEG decode (NVJPG) with YUV420, YUV422, YUV444, YUV400, RGBA. Should not be confused with Nvidia NVJPEG (GPU-accelerated library for JPEG encoding/decoding)

Chips

[edit]
  • GA100[12]
  • GA102
  • GA103
  • GA104
  • GA106
  • GA107
  • GA10B

Comparison of Compute Capability: GP100 vs GV100 vs GA100[13]

GPU features Nvidia Tesla P100 Nvidia Tesla V100 Nvidia A100
GPU codename GP100 GV100 GA100
GPU architecture Pascal Volta Ampere
Compute capability 6.0 7.0 8.0
Threads / warp 32 32 32
Max warps / SM 64 64 64
Max threads / SM 2048 2048 2048
Max thread blocks / SM 32 32 32
Max 32-bit registers / SM 65536 65536 65536
Max registers / block 65536 65536 65536
Max registers / thread 255 255 255
Max thread block size 1024 1024 1024
FP32 cores / SM 64 64 64
Ratio of SM registers to FP32 cores 1024 1024 1024
Shared Memory Size / SM 64 KB Configurable up to 96 KB Configurable up to 164 KB

Comparison of Precision Support Matrix[14][15]

Supported CUDA Core Precisions Supported Tensor Core Precisions
FP16 FP32 FP64 INT1 INT4 INT8 TF32 BF16 FP16 FP32 FP64 INT1 INT4 INT8 TF32 BF16
Nvidia Tesla P4 No Yes Yes No No Yes No No No No No No No No No No
Nvidia P100 Yes Yes Yes No No No No No No No No No No No No No
Nvidia Volta Yes Yes Yes No No Yes No No Yes No No No No No No No
Nvidia Turing Yes Yes Yes No No No No No Yes No No Yes Yes Yes No No
Nvidia A100 Yes Yes Yes No No Yes No Yes Yes No Yes Yes Yes Yes Yes Yes

Legend:

  • FPnn: floating point with nn bits
  • INTn: integer with n bits
  • INT1: binary
  • TF32: TensorFloat32
  • BF16: bfloat16

Comparison of Decode Performance

Concurrent streams H.264 decode (1080p30) H.265 (HEVC) decode (1080p30) VP9 decode (1080p30)
V100 16 22 22
A100 75 157 108

Ampere dies

[edit]
Die GA100[16] GA102[17] GA103[18] GA104[19] GA106[20] GA107[21] GA10B[22] GA10F
Die size 826 mm2 628 mm2 496 mm2 392 mm2 276 mm2 200 mm2 448 mm2 ?
Transistors 54.2B 28.3B 22B 17.4B 12B 8.7B 21B ?
Transistor density 65.6 MTr/mm2 45.1 MTr/mm2 44.4 MTr/mm2 44.4 MTr/mm2 43.5 MTr/mm2 43.5 MTr/mm2 46.9 MTr/mm2 ?
Graphics processing clusters 8 7 6 6 3 2 2 1
Streaming multiprocessors 128 84 60 48 30 20 16 12
CUDA cores 8192 10752 7680 6144 3840 2560 2048 1536
Texture mapping units 512 336 240 192 120 80 64 48
Render output units 192 112 96 96 48 32 32 16
Tensor cores 512 336 240 192 120 80 64 48
RT cores N/A 84 60 48 30 20 8 12
L1 cache 24 MB 10.5 MB 7.5 MB 6 MB 3 MB 2.5 MB 3 MB 1.5 MB
192 KB
per SM
128 KB per SM 192 KB
per SM
128 KB
per SM
L2 cache 40 MB 6 MB 4 MB 4 MB 3 MB 2 MB 4 MB 1 MB

A100 accelerator and DGX A100

[edit]

The Ampere-based A100 accelerator was announced and released on May 14, 2020.[10] The A100 features 19.5 teraflops of FP32 performance, 6912 FP32/INT32 CUDA cores, 3456 FP64 CUDA cores, 40 GB of graphics memory, and 1.6 TB/s of graphics memory bandwidth.[23] The A100 accelerator was initially available only in the 3rd generation of DGX server, including 8 A100s.[10] Also included in the DGX A100 is 15 TB of PCIe gen 4 NVMe storage,[23] two 64-core AMD Rome 7742 CPUs, 1 TB of RAM, and Mellanox-powered HDR InfiniBand interconnect. The initial price for the DGX A100 was $199,000.[10]

Comparison of accelerators used in DGX:[24][25][26]

Model Architecture Socket FP32
CUDA
cores
FP64 cores
(excl. tensor)
Mixed
INT32/FP32
cores
INT32
cores
Boost
clock
Memory
clock
Memory
bus width
Memory
bandwidth
VRAM Single
precision
(FP32)
Double
precision
(FP64)
INT8
(non-tensor)
INT8
dense tensor
INT32 FP4
dense tensor
FP16 FP16
dense tensor
bfloat16
dense tensor
TensorFloat-32
(TF32)
dense tensor
FP64
dense tensor
Interconnect
(NVLink)
GPU L1 Cache L2 Cache TDP Die size Transistor
count
Process Launched
P100 Pascal SXM/SXM2 3584 1792 N/A N/A 1480 MHz 1.4 Gbit/s HBM2 4096-bit 720 GB/sec 16 GB HBM2 10.6 TFLOPS 5.3 TFLOPS N/A N/A N/A N/A 21.2 TFLOPS N/A N/A N/A N/A 160 GB/sec GP100 1344 KB (24 KB × 56) 4096 KB 300 W 610 mm2 15.3 B TSMC 16FF+ Q2 2016
V100 16GB Volta SXM2 5120 2560 N/A 5120 1530 MHz 1.75 Gbit/s HBM2 4096-bit 900 GB/sec 16 GB HBM2 15.7 TFLOPS 7.8 TFLOPS 62 TOPS N/A 15.7 TOPS N/A 31.4 TFLOPS 125 TFLOPS N/A N/A N/A 300 GB/sec GV100 10240 KB (128 KB × 80) 6144 KB 300 W 815 mm2 21.1 B TSMC 12FFN Q3 2017
V100 32GB Volta SXM3 5120 2560 N/A 5120 1530 MHz 1.75 Gbit/s HBM2 4096-bit 900 GB/sec 32 GB HBM2 15.7 TFLOPS 7.8 TFLOPS 62 TOPS N/A 15.7 TOPS N/A 31.4 TFLOPS 125 TFLOPS N/A N/A N/A 300 GB/sec GV100 10240 KB (128 KB × 80) 6144 KB 350 W 815 mm2 21.1 B TSMC 12FFN
A100 40GB Ampere SXM4 6912 3456 6912 N/A 1410 MHz 2.4 Gbit/s HBM2 5120-bit 1.52 TB/sec 40 GB HBM2 19.5 TFLOPS 9.7 TFLOPS N/A 624 TOPS 19.5 TOPS N/A 78 TFLOPS 312 TFLOPS 312 TFLOPS 156 TFLOPS 19.5 TFLOPS 600 GB/sec GA100 20736 KB (192 KB × 108) 40960 KB 400 W 826 mm2 54.2 B TSMC N7 Q1 2020
A100 80GB Ampere SXM4 6912 3456 6912 N/A 1410 MHz 3.2 Gbit/s HBM2e 5120-bit 1.52 TB/sec 80 GB HBM2e 19.5 TFLOPS 9.7 TFLOPS N/A 624 TOPS 19.5 TOPS N/A 78 TFLOPS 312 TFLOPS 312 TFLOPS 156 TFLOPS 19.5 TFLOPS 600 GB/sec GA100 20736 KB (192 KB × 108) 40960 KB 400 W 826 mm2 54.2 B TSMC N7
H100 Hopper SXM5 16896 4608 16896 N/A 1980 MHz 5.2 Gbit/s HBM3 5120-bit 3.35 TB/sec 80 GB HBM3 67 TFLOPS 34 TFLOPS N/A 1.98 POPS N/A N/A N/A 990 TFLOPS 990 TFLOPS 495 TFLOPS 67 TFLOPS 900 GB/sec GH100 25344 KB (192 KB × 132) 51200 KB 700 W 814 mm2 80 B TSMC 4N Q3 2022
H200 Hopper SXM5 16896 4608 16896 N/A 1980 MHz 6.3 Gbit/s HBM3e 6144-bit 4.8 TB/sec 141 GB HBM3e 67 TFLOPS 34 TFLOPS N/A 1.98 POPS N/A N/A N/A 990 TFLOPS 990 TFLOPS 495 TFLOPS 67 TFLOPS 900 GB/sec GH100 25344 KB (192 KB × 132) 51200 KB 1000 W 814 mm2 80 B TSMC 4N Q3 2023
B100 Blackwell SXM6 N/A N/A N/A N/A N/A 8 Gbit/s HBM3e 8192-bit 8 TB/sec 192 GB HBM3e N/A N/A N/A 3.5 POPS N/A 7 PFLOPS N/A 1.98 PFLOPS 1.98 PFLOPS 989 TFLOPS 30 TFLOPS 1.8 TB/sec GB100 N/A N/A 700 W N/A 208 B TSMC 4NP Q4 2024
B200 Blackwell SXM6 N/A N/A N/A N/A N/A 8 Gbit/s HBM3e 8192-bit 8 TB/sec 192 GB HBM3e N/A N/A N/A 4.5 POPS N/A 9 PFLOPS N/A 2.25 PFLOPS 2.25 PFLOPS 1.2 PFLOPS 40 TFLOPS 1.8 TB/sec GB100 N/A N/A 1000 W N/A 208 B TSMC 4NP

Products using Ampere

[edit]
  • GeForce MX series
    • GeForce MX570 (mobile) (GA107)
  • GeForce 20 series
    • GeForce RTX 2050 (mobile) (GA107)
  • GeForce 30 series
    • GeForce RTX 3050 Laptop GPU (GA107)
    • GeForce RTX 3050 (GA106 or GA107)[27]
    • GeForce RTX 3050 Ti Laptop GPU (GA107)
    • GeForce RTX 3060 Laptop GPU (GA106)
    • GeForce RTX 3060 (GA106 or GA104)[28]
    • GeForce RTX 3060 Ti (GA104 or GA103)[29]
    • GeForce RTX 3070 Laptop GPU (GA104)
    • GeForce RTX 3070 (GA104)
    • GeForce RTX 3070 Ti Laptop GPU (GA104)
    • GeForce RTX 3070 Ti (GA104 or GA102)[30]
    • GeForce RTX 3080 Laptop GPU (GA104)
    • GeForce RTX 3080 (GA102)
    • GeForce RTX 3080 12 GB (GA102)
    • GeForce RTX 3080 Ti Laptop GPU (GA103)
    • GeForce RTX 3080 Ti (GA102)
    • GeForce RTX 3090 (GA102)
    • GeForce RTX 3090 Ti (GA102)
  • Nvidia Workstation GPUs (formerly Quadro)
    • RTX A1000 (mobile) (GA107)
    • RTX A2000 (mobile) (GA106)
    • RTX A2000 (GA106)
    • RTX A3000 (mobile) (GA104)
    • RTX A4000 (mobile) (GA104)
    • RTX A4000 (GA104)
    • RTX A5000 (mobile) (GA104)
    • RTX A5500 (mobile) (GA103)
    • RTX A4500 (GA102)
    • RTX A5000 (GA102)
    • RTX A5500 (GA102)
    • RTX A6000 (GA102)
    • A800 Active
  • Nvidia Data Center GPUs (formerly Tesla)
    • Nvidia A2 (GA107)
    • Nvidia A10 (GA102)
    • Nvidia A16 (4 × GA107)
    • Nvidia A30 (GA100)
    • Nvidia A40 (GA102)
    • Nvidia A100 (GA100)
    • Nvidia A100 80 GB (GA100)
    • Nvidia A100X
    • NVIDIA A30X
Products using Ampere (per Chip)
Type GA10B GA107 GA106 GA104 GA103 GA102 GA100
GeForce MX series N/a GeForce MX570 (mobile) N/a N/a N/a N/a N/a
GeForce 20 series N/a GeForce RTX 2050 (mobile) N/a N/a N/a N/a N/a
GeForce 30 series N/a GeForce RTX 3050 Laptop
GeForce RTX 3050
GeForce RTX 3050 Ti Laptop
GeForce RTX 3050
GeForce RTX 3060 Laptop
GeForce RTX 3060
GeForce RTX 3060
GeForce RTX 3060 Ti
GeForce RTX 3070 Laptop
GeForce RTX 3070
GeForce RTX 3070 Ti Laptop
GeForce RTX 3070 Ti
GeForce RTX 3080 Laptop
GeForce RTX 3060 Ti
GeForce RTX 3080 Ti Laptop
GeForce RTX 3070 Ti
GeForce RTX 3080
GeForce RTX 3080 Ti
GeForce RTX 3090
GeForce RTX 3090 Ti
N/a
Nvidia Workstation GPUs N/a RTX A1000 (mobile) RTX A2000 (mobile)
RTX A2000
RTX A3000 (mobile)
RTX A4000 (mobile)
RTX A4000
RTX A5000 (mobile)
RTX A5500 (mobile) RTX A4500
RTX A5000
RTX A5500
RTX A6000
N/a
Nvidia Data Center GPUs N/a Nvidia A2
Nvidia A16
N/a N/a N/a Nvidia A10
Nvidia A40
Nvidia A30
Nvidia A100
Tegra SoCs AGX Orin
Orin NX
Orin Nano
N/a N/a N/a N/a N/a N/a

See also

[edit]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Ampere is a graphics processing unit (GPU) microarchitecture developed by Nvidia and introduced in May 2020 as the successor to the Turing microarchitecture for consumer and professional graphics as well as the Volta microarchitecture for data center and high-performance computing applications. Fabricated on TSMC's 7 nm process for compute variants and Samsung's 8 nm process for graphics variants, it emphasizes advancements in artificial intelligence acceleration, ray tracing, and scalable computing, with the flagship A100 GPU featuring over 54 billion transistors and up to 40 GB of HBM2e memory delivering 1.5 TB/s bandwidth. The architecture builds on prior generations by increasing the number of streaming multiprocessors (SMs) per GPU, enhancing parallelism for diverse workloads including deep learning training, inference, and scientific simulations. A defining aspect of Ampere is its third-generation Tensor Cores, which introduce support for the TF32 for AI training—offering single-precision performance at half-precision compute rates—alongside bfloat16, FP16, FP64, INT8, and sparse INT4/INT8 formats, enabling up to 312 teraFLOPS of TF32 and 19.5 teraFLOPS of FP64 tensor performance in the A100. These cores also incorporate structured sparsity , reducing computational overhead by up to 2x for compatible neural networks. Complementing this, the second-generation RT Cores deliver up to 2x the ray-triangle intersection throughput of Turing's first-generation cores, supporting real-time ray tracing in and tasks with improved efficiency. Additionally, Ampere introduces Multi-Instance GPU (MIG) partitioning, allowing a single GPU to be securely divided into up to seven isolated instances for multi-user environments, enhancing resource utilization in and enterprise settings. Ampere powers a wide range of products, including the data center-focused A100 and A40 Tensor Core GPUs for AI and HPC, the RTX A-series for workstations, and the consumer RTX 30-series graphics cards such as the RTX 3090 and RTX 3080, which provide gamers with enhanced 4K ray-traced performance and DLSS 2.0 upscaling. Subsequent variants like the A10 extend its applicability to edge inference and virtual desktops, contributing to the role of GPUs, many based on , in powering a significant portion of the world's top supercomputers as of 2023 through systems like those in the platform. has been succeeded by the Hopper architecture for data center applications and for graphics, but remains widely deployed as of 2025.

Overview and History

Architectural Innovations

The Ampere microarchitecture introduces third-generation Tensor Cores, which accelerate matrix multiply-accumulate operations for workloads, supporting formats such as TF32 for training and BF16 for inference to enable higher precision without sacrificing performance. These cores incorporate structured sparsity, allowing the hardware to skip computations on zero values in matrices following a 2:4 pattern, thereby doubling effective math throughput and reducing data footprint and bandwidth requirements by up to 2x compared to dense operations. This innovation significantly boosts AI training and inference efficiency, with sparse INT8 operations on the A100 achieving up to 20x faster performance than equivalent INT8 on the prior-generation V100. Ampere's second-generation RT Cores enhance ray tracing capabilities by delivering up to 2x the ray-triangle intersection throughput of the first-generation RT Cores in the Turing architecture, enabling more realistic lighting, shadows, and reflections in and applications. Integrated within each streaming multiprocessor, these cores process traversals and intersection tests more efficiently, supporting advanced features like real-time ray tracing in professional visualization and gaming. High-end Ampere GPUs, such as those based on the GA102 die, feature up to 84 RT Cores, contributing to substantial gains in ray-traced scene complexity and frame rates. A key data center innovation in Ampere is Multi-Instance GPU (MIG) technology, which securely partitions a single GPU into up to seven isolated instances, each with dedicated compute, memory, and cache resources for multi-tenant environments. This hardware-enforced isolation prevents resource contention and data leakage, ideal for and secure AI deployments, while maintaining near-native performance per instance. Complementing this, the third-generation interconnect provides 600 GB/s of bidirectional bandwidth per GPU via 12 links, facilitating seamless scaling across multi-GPU systems for large-scale simulations and training. Ampere achieves improved power efficiency through advanced techniques including fine-grained and dynamic voltage scaling, which optimize energy use during varying workloads. In the GA100 GPU, these enhancements target 19.5 TFLOPS of FP32 performance, balancing high compute density with reduced power draw compared to predecessors, particularly in sustained HPC and AI tasks.

Development Timeline

The microarchitecture was announced by on May 14, 2020, during its GTC 2020 keynote, positioning it as the successor to the Turing architecture with an initial emphasis on data center applications. The first implementation, the GA100 GPU powering the A100 accelerator, entered full production in the second quarter of 2020, marking the architecture's transition from development to deployment. Consumer-oriented variants of Ampere, including the GA102 and GA104 GPUs for the RTX 30 series, were revealed on September 1, 2020, at a dedicated event. Shipments of these graphics cards began in September 2020, though global semiconductor supply chain shortages delayed widespread availability into 2021. Key manufacturing partnerships shaped Ampere's production: compute-focused dies like the GA100 were fabricated on TSMC's , achieving over 54 billion transistors for enhanced scalability in AI and HPC workloads. In contrast, graphics-oriented dies such as GA102 utilized Samsung's 8 nm process, reflecting 's strategy to optimize yields and costs across foundries. Following the launch, NVIDIA matured the software ecosystem with CUDA 11.0, announced on May 14, 2020, to support Ampere-specific features including third-generation Tensor Core advancements for improved AI performance. By 2025, Ampere had seen no major hardware revisions, and had been succeeded in contexts by the Hopper architecture, announced in March 2022 with production ramp-up by 2023. As of 2025, Ampere-based GPUs continue to power a significant portion of AI, HPC, and applications, with full software support ongoing.

Core Architecture

Streaming Multiprocessors

The Streaming Multiprocessor (SM) serves as processing unit within the Ampere microarchitecture, responsible for executing parallel thread groups in a SIMD fashion to accelerate compute and workloads. Unlike previous architectures, Ampere's SM design emphasizes scalability for both AI / and real-time rendering, with optimizations for higher instruction throughput and . Each SM is partitioned into four processing blocks, enabling concurrent execution of multiple warps while sharing resources like caches and units. This structure allows for up to 2048 resident threads per SM, distributed across 64 warps of 32 threads each, with four independent warp schedulers that select and dispatch instructions from ready warps to minimize stalls. In graphics-focused variants of , such as the GA102 die, each SM incorporates 128 cores dedicated to FP32 and INT32 operations, complemented by 4 units for handling texture fetches and filtering in rendering pipelines. These cores are organized into dual FP32 datapaths per processing block, allowing the full SM to deliver up to 256 FP32 operations per clock cycle through fused multiply-add (FMA) instructions. Compute-focused variants, like the GA100 in the A100 GPU, adjust this to 64 FP32 cores per SM to allocate more resources to FP64 units, yielding 128 FP32 operations per clock cycle while maintaining compatibility with programming models. Instruction throughput further includes 64 INT32 ALUs in graphics variants for integer computations, with support for independent thread scheduling (ITS) that desynchronizes threads within a warp to hide memory latency and improve utilization on divergent code paths. Ampere enhances memory access within the SM through an integrated L1 cache and subsystem, configurable from 0 to 128 KB per SM in graphics variants (with the remainder allocated to L1 cache) and up to 192 KB in compute variants like GA100. This represents a doubling of capacity over Turing's maximum 64 KB per SM, paired with higher bandwidth—up to 3x improvement in certain access patterns—thanks to wider datapaths and better bank conflict resolution. The design prioritizes low-latency data sharing among threads in a block, crucial for algorithms like matrix multiplications. Clock speeds for SM execution vary by implementation and constraints; for instance, the GA100 achieves boosts up to 1.41 GHz, where (TDP) limits directly impact sustained SM utilization under heavy loads. Briefly, Ampere SMs integrate with third-generation Tensor Cores for mixed-precision acceleration, though detailed Tensor operations are handled separately.

Specialized Cores

The Ampere microarchitecture incorporates dedicated hardware accelerators tailored for AI, , and media processing workloads, enhancing efficiency in specialized computations beyond general-purpose parallel processing. These include third-generation Tensor Cores for matrix operations in , second-generation RT Cores for ray tracing, advanced media engines for video encode and decode, and DMA engines for memory data movement. Third-generation Tensor Cores extend support to a broader set of precisions, including FP16, BF16, TF32, FP64, INT8, INT4, allowing seamless acceleration of AI and as well as HPC simulations without modifications. A key innovation is sparsity acceleration through 2:4 structured pruning, where every other element and channel in weight matrices can be zeroed out while preserving model accuracy, effectively doubling throughput for compatible sparse networks. In representative compute dies like the GA100 used in the A100 GPU, these cores achieve up to 156 TFLOPS in TF32 precision for dense matrix multiply-accumulate operations (312 TFLOPS sparse), providing up to 5x the performance over prior generations. This sparsity support, combined with improved operand sharing, reduces data movement overhead and boosts efficiency for large-scale neural networks. Second-generation RT Cores focus on real-time ray tracing by accelerating (BVH) traversal and ray-triangle intersection tests, critical for photorealistic rendering in graphics and applications. Compared to the first-generation RT Cores in the Turing architecture, Ampere's design doubles the ray-triangle intersection rate, enabling higher fidelity scenes with reduced latency. This improvement stems from enhanced efficiency and parallel processing of intersection tests, supporting workloads like motion blur and complex lighting in professional visualization. Media engine enhancements in Ampere include the seventh-generation NVENC for hardware-accelerated video encoding, capable of 8K60 output in formats like HEVC and H.264, which reduces CPU load for streaming and . Complementing this, the fifth-generation NVDEC provides multi-codec decoding acceleration, including native support up to 8K60, alongside H.264, HEVC, , and others, enabling efficient playback of high-resolution video in multi-stream scenarios. These engines handle multiple simultaneous sessions with minimal overhead, optimizing for transcoding and consumer media applications. DMA engines in Ampere facilitate high-bandwidth data transfers between HBM memory and the shared L2 cache, sustaining peak throughput in compute-intensive tasks. In compute-focused dies such as the GA100, this supports up to 1.555 TB/s bidirectional bandwidth, minimizing bottlenecks in AI and HPC pipelines by enabling rapid loading of large datasets into cache for Tensor Core and SM access. The enlarged 40 MB L2 cache further aids these transfers by caching frequently accessed HBM data, improving overall memory subsystem efficiency.

Chip Designs

Compute-Focused Dies

The GA100 die represents the primary compute-focused design in NVIDIA's Ampere microarchitecture, optimized for (HPC) and (AI) applications. Fabricated on TSMC's 7N ( node, it measures 826 mm² and integrates 54.2 billion transistors. This die incorporates 108 streaming multiprocessors (SMs), delivering 6,912 cores for general-purpose parallel processing and 432 third-generation Tensor Cores for accelerated matrix operations in AI training and inference. The GA100 supports high-bandwidth memory configurations tailored for data-intensive workloads, featuring up to 80 GB of HBM2e memory with error-correcting code (ECC) support for enhanced data integrity in mission-critical environments. The 40 GB variant provides 1.6 TB/s bandwidth, while the 80 GB version achieves 2 TB/s, enabling efficient handling of large datasets in deep learning and scientific simulations. Additionally, the die includes a 40 MB L2 cache to reduce memory access latency and improve overall throughput. Power consumption and interconnect capabilities are scaled for deployment, with a (TDP) of 400 W in the SXM form factor for dense server integration and 300 W in the PCIe variant for broader compatibility. The GA100 supports 3.0 with up to 12 bidirectional links, offering aggregate bandwidth exceeding 600 GB/s for multi-GPU scaling in clustered environments. The GA100 entered production in 2020 without significant respins, though early manufacturing ramps faced yield challenges typical of advanced nodes, limiting full disclosure on production details.

Graphics-Focused Dies

The graphics-focused dies in the Ampere microarchitecture are designed primarily for gaming, , and professional visualization workloads, emphasizing rasterization, ray tracing, and AI-accelerated rendering while prioritizing cost-effective production for high-volume markets. These variants differ from compute-oriented dies by incorporating dedicated ray-tracing (RT) and tensor cores optimized for real-time graphics effects and upscaling technologies like DLSS, without the high-bandwidth memory interfaces suited for tasks. The flagship GA102 die serves as the foundation for high-end graphics products, featuring a die area of 628.4 mm² fabricated on Samsung's 8N process node, which integrates 28.3 billion transistors. In its fullest configuration, GA102 supports up to 84 streaming multiprocessors (), delivering 10,752 cores for parallel processing, alongside 336 third-generation Tensor Cores for matrix operations and 84 second-generation RT Cores for hardware-accelerated ray tracing. Configurations for GPUs typically enable 68 to 82 , balancing and power within limits of desktop and systems. Smaller variants like GA104 and GA106 target mid-range and entry-level graphics segments, offering scaled-down capabilities for broader accessibility. The GA104 die measures 392 mm² on the same 8N process, with up to 48 providing 6,144 cores, while GA106 is more compact at 276 mm², supporting up to 30 and 3,840 cores. These dies pair with GDDR6 or GDDR6X memory interfaces, such as 8 GB to 16 GB configurations on a 256-bit bus achieving bandwidths up to 608 GB/s, enabling efficient handling of high-resolution textures and frame buffers without the complexity of HBM. For instance, GA104-based designs like the RTX 3070 Ti use 8 GB GDDR6X at 19 Gbps to support gaming at high frame rates. Samsung's 8N process was selected for these graphics dies to achieve in high-volume consumer and professional production, allowing NVIDIA to scale output for the RTX 30-series and RTX A-series while maintaining competitive yields compared to more advanced nodes used in compute . Unlike compute-focused architectures, these dies eschew HBM in favor of GDDR6/GDDR6X to reduce per-unit and simplify integration into cards with standard PCB designs. Post-2021 global disruptions, including shortages, led to variable die yields for GA102 and its variants, impacting availability of overclocking-optimized SKUs, though mitigated this through binning strategies.

Products and Implementations

Data Center Accelerators

The NVIDIA A100 Tensor Core GPU, released in June 2020, serves as the flagship data center accelerator in the Ampere lineup, designed for high-performance computing (HPC), artificial intelligence (AI) training, and inference workloads. Available in PCIe form factor with a 300 W thermal design power (TDP) and SXM4 module with up to 400 W TDP, the A100 delivers 9.7 TFLOPS of FP64 performance and 19.5 TFLOPS of FP64 Tensor Core performance, enabling efficient handling of double-precision computations critical for scientific simulations. Its architecture supports Multi-Instance GPU (MIG) partitioning into up to seven isolated instances, allowing secure multi-tenancy in cloud environments. Professional variants like the A40, introduced in late 2020 with a product brief in January 2021, target visual and AI tasks in data centers, featuring 48 GB of GDDR6 memory with error-correcting code (ECC) and a 300 W TDP in a dual-slot PCIe Gen4 configuration. The A40 supports MIG for workload isolation and provides 696 GB/s , making it suitable for rendering, , and at scale. Complementing this, the A30, announced in 2021, is optimized for AI inference with a lower 165 W TDP and 24 GB of HBM2 memory delivering 933 GB/s bandwidth, enabling efficient deployment in mainstream servers for precisions from FP64 to INT4. The A10 Tensor Core GPU, announced in April 2021, provides a cost-effective option for AI inference, graphics virtualization, and cloud gaming in mainstream servers, with 24 GB of GDDR6 memory, a 150 W TDP, 600 GB/s memory bandwidth, and a single-slot PCIe Gen4 form factor. To address U.S. export compliance restrictions, NVIDIA introduced the A800 in November 2022 specifically for the Chinese market, offering variants like the 40 GB and 80 GB PCIe models based on the GA100 die but with reduced NVLink interconnect bandwidth to 400 GB/s (from 600 GB/s on the A100) to limit inter-GPU communication performance. These adjustments ensure compliance while maintaining core compute capabilities for AI and HPC applications in restricted regions. Ampere-based accelerators power key applications in AI and scientific computing, such as training large language models; for instance, a 140-node DGX SuperPOD configuration with A100 GPUs can train a 175-billion-parameter GPT-3 model in approximately 34 days, a significant reduction compared to prior generations. In HPC, systems like the Perlmutter supercomputer at Lawrence Berkeley National Laboratory, deployed in 2021 with thousands of A100 GPUs, accelerate simulations in climate modeling and materials science, achieving up to 5x energy efficiency gains in accelerated applications through optimized Tensor Core operations. By 2025, GPUs have transitioned to legacy status in many data centers, with migrations to the successor Hopper-based H100 architecture underway to leverage up to 3x performance improvements and enhanced energy efficiency, such as 4x better throughput per watt in MLPerf benchmarks, amid growing demands for sustainable AI scaling. This shift reflects Ampere's foundational role in establishing GPU-accelerated AI while highlighting the rapid evolution toward more efficient architectures.

Consumer and Professional GPUs

The RTX 30 series, based on the Ampere microarchitecture, marked NVIDIA's entry into the consumer graphics market with high-performance cards optimized for gaming and . The flagship RTX 3090, utilizing the GA102 GPU die, featured 24 GB of GDDR6X , a 350 W TDP, and was released in September 2020, enabling enthusiasts to handle demanding workloads like 8K gaming and . Similarly, the RTX 3080 offered 10 GB of GDDR6X and a 320 W TDP, providing strong rasterization and ray tracing capabilities for 4K gaming at high frame rates. These cards introduced second-generation RT cores and third-generation Tensor cores, which powered real-time ray tracing effects and AI-accelerated upscaling via DLSS 2.0, significantly enhancing visual fidelity in titles such as , where path-traced lighting and reflections could be rendered at playable frame rates with DLSS enabled. In the professional segment, the RTX A6000 served as a direct successor to the lineup, targeting workflows in (CAD), (VFX), and . Launched in October 2020, it incorporated 48 GB of GDDR6 with error-correcting code (ECC) support for in mission-critical applications, alongside a 300 W TDP and compatibility with for multi-GPU scaling up to 96 GB. This configuration proved particularly effective for complex VFX pipelines in and simulations, where large datasets and precise computations are essential, outperforming prior generations in memory-bound tasks like viewport rendering in software such as . Mobile implementations extended Ampere's reach to laptops, with the RTX 3080 Mobile variant based on the GA104 GPU die debuting in early 2021. It supported up to 16 GB of GDDR6 memory and leveraged Dynamic Boost 2.0 to dynamically allocate power, reaching up to 165 W in high-end configurations for sustained performance in thin-and-light gaming notebooks. This allowed creators and gamers to achieve near-desktop levels of ray-traced graphics and AI features on the go, though constraints often required careful power tuning. The RTX 30 series dominated the consumer GPU market from 2021 to 2022, driven by surging demand from mining, which accounted for a significant portion of sales and strained global supply chains. However, widespread shortages from 2020 to 2022—exacerbated by the and chip fabrication bottlenecks—delayed adoption among gamers and professionals, leading many to retain older hardware or turn to scalped markets at inflated prices. By 2025, the series had been succeeded by the architecture in NVIDIA's RTX 40 lineup, yet mid-range models like the RTX 3060 and 3070 remained viable in budget PC builds for and gaming, offering cost-effective performance amid stabilizing prices.

References

Add your contribution
Related Hubs
User Avatar
No comments yet.