Recent from talks
Nothing was collected or created yet.
Graphics Core Next
View on Wikipedia
This article provides insufficient context for those unfamiliar with the subject. (October 2020) |
Graphics Core Next (GCN)[1] is the codename for a series of microarchitectures and an instruction set architecture that were developed by AMD for its GPUs as the successor to its TeraScale microarchitecture. The first product featuring GCN was launched on January 9, 2012.[2]
GCN is a reduced instruction set SIMD microarchitecture contrasting the very long instruction word SIMD architecture of TeraScale.[3] GCN requires considerably more transistors than TeraScale, but offers advantages for general-purpose GPU (GPGPU) computation due to a simpler compiler.
GCN graphics chips were fabricated with CMOS at 28 nm, and with FinFET at 14 nm (by Samsung Electronics and GlobalFoundries) and 7 nm (by TSMC), available on selected models in AMD's Radeon HD 7000, HD 8000, 200, 300, 400, 500 and Vega series of graphics cards, including the separately released Radeon VII. GCN was also used in the graphics portion of Accelerated Processing Units (APUs), including those in the PlayStation 4 and Xbox One.
GCN was succeeded by the RDNA microarchitecture and instruction set architecture in 2019.
Instruction set
[edit]The GCN instruction set is owned by AMD and was developed specifically for GPUs. It has no micro-operation for division.
Documentation is available for:
- the Graphics Core Next 1 instruction set,
- the Graphics Core Next 2 instruction set,
- the Graphics Core Next 3 and 4 instruction sets,[4]
- the Graphics Core Next 5 instruction set, and
- the "Vega" 7nm instruction set architecture (also referred to as Graphics Core Next 5.1).
An LLVM compiler back end is available for the GCN instruction set.[5] It is used by Mesa 3D.
GNU Compiler Collection 9 supports GCN 3 and GCN 5 since 2019[6] for single-threaded, stand-alone programs, with GCC 10 also offloading via OpenMP and OpenACC.[7]
MIAOW is an open-source RTL implementation of the AMD Southern Islands GPGPU microarchitecture.
In November 2015, AMD announced its Boltzmann Initiative, which aims to enable the porting of CUDA-based applications to a common C++ programming model.[8]
At the Super Computing 15 event, AMD displayed a Heterogeneous Compute Compiler (HCC), a headless Linux driver and HSA runtime infrastructure for cluster-class high-performance computing, and a Heterogeneous-compute Interface for Portability (HIP) tool for porting CUDA applications to the aforementioned common C++ model.
Microarchitectures
[edit]As of July 2017, the Graphics Core Next instruction set has seen five iterations. The differences between the first four generations are rather minimal, but the fifth-generation GCN architecture features heavily modified stream processors to improve performance and support the simultaneous processing of two lower-precision numbers in place of a single higher-precision number.[9]
Command processing
[edit]
Graphics Command Processor
[edit]The Graphics Command Processor (GCP) is a functional unit of the GCN microarchitecture. Among other tasks, it is responsible for the handling of asynchronous shaders.[10]
Asynchronous Compute Engine
[edit]The Asynchronous Compute Engine (ACE) is a distinct functional block serving computing purposes, whose purpose is similar to that of the Graphics Command Processor.[ambiguous]
Schedulers
[edit]Since the third iteration of GCN, the hardware contains two schedulers: one to schedule "wavefronts" during shader execution (the CU Scheduler, or Compute Unit Scheduler) and the other to schedule execution of draw and compute queues. The latter helps performance by executing compute operations when the compute units (CUs) are underutilized due to graphics commands limited by fixed function pipeline speed or bandwidth. This functionality is known as Async Compute.
For a given shader, the GPU drivers may also schedule instructions on the CPU to minimize latency.
Geometric processor
[edit]
The geometry processor contains a Geometry Assembler, a Tesselator, and a Vertex Assembler.
The Tesselator is capable of doing tessellation in hardware as defined by Direct3D 11 and OpenGL 4.5 (see AMD January 21, 2017),[11] and succeeded ATI TruForm and hardware tessellation in TeraScale as AMD's then-latest semiconductor intellectual property core.
Compute units
[edit]One compute unit (CU) combines 64 shader processors with 4 texture mapping units (TMUs).[12][13] The compute units are separate from, but feed into, the render output units (ROPs).[13] Each compute unit consists of the following:
- a CU scheduler
- a Branch & Message Unit
- 4 16-lane-wide SIMD Vector Units (SIMD-VUs)
- 4 64 KiB vector general-purpose register (VGPR) files
- 1 scalar unit (SU)
- a 8 KiB scalar GPR file[14]
- a local data share of 64 KiB
- 4 Texture Filter Units
- 16 Texture Fetch Load/Store Units
- a 16 KiB level 1 (L1) cache
Four Compute units are wired to share a 16KiB L1 instruction cache and a 32KiB L1 data cache, both of which are read-only. A SIMD-VU operates on 16 elements at a time (per cycle), while a SU can operate on one a time (one/cycle). In addition, the SU handles some other operations, such as branching.[15]
Every SIMD-VU has some private memory where it stores its registers. There are two types of registers: scalar registers (S0, S1, etc.), which hold 4 bytes number each, and vector registers (V0, V1, etc.), which each represent a set of 64 4-byte numbers. On the vector registers, every operation is done in parallel on the 64 numbers. which correspond to 64 inputs. For example, it may work on 64 different pixels at a time (for each of them the inputs are slightly different, and thus you get slightly different color at the end).
Every SIMD-VU has room for 512 scalar registers and 256 vector registers.
AMD has claimed that each GCN compute unit (CU) has 64 KiB Local Data Share (LDS).[16]
CU scheduler
[edit]The CU scheduler is the hardware functional block, choosing which wavefronts the SIMD-VU executes. It picks one SIMD-VU per cycle for scheduling. This is not to be confused with other hardware or software schedulers.
Wavefront
[edit]A shader is a small program written in GLSL that performs graphics processing, and a kernel is a small program written in OpenCL that performs GPGPU processing. These processes don't need that many registers, but they do need to load data from system or graphics memory. This operation comes with significant latency. AMD and Nvidia chose similar approaches to hide this unavoidable latency: the grouping of multiple threads. AMD calls such a group a "wavefront", whereas Nvidia calls it a "warp". A group of threads is the most basic unit of scheduling of GPUs that implement this approach to hide latency. It is the minimum size of the data processed in SIMD fashion, the smallest executable unit of code, and the way to processes a single instruction over all of the threads in it at the same time.
In all GCN GPUs, a "wavefront" consists of 64 threads, and in all Nvidia GPUs, a "warp" consists of 32 threads.
AMD's solution is to attribute multiple wavefronts to each SIMD-VU. The hardware distributes the registers to the different wavefronts, and when one wavefront is waiting on some result, which lies in memory, the CU Scheduler assigns the SIMD-VU another wavefront. Wavefronts are attributed per SIMD-VU. SIMD-VUs do not exchange wavefronts. A maximum of 10 wavefronts can be attributed per SIMD-VU (thus 40 per CU).
AMD CodeXL shows tables with the relationship between number of SGPRs and VGPRs to the number of wavefronts, but essentially, for SGPRS it is between 104 and 512 per number of wavefronts, and for VGPRS it is 256 per number of wavefronts.
Note that in conjunction with the SSE instructions, this concept of the most basic level of parallelism is often called a "vector width". The vector width is characterized by the total number of bits in it.
SIMD Vector Unit
[edit]Each SIMD Vector Unit has:
- a 16-lane integer and floating point vector Arithmetic Logic Unit (ALU)
- 64 KiB Vector General Purpose Register (VGPR) file
- 10× 48-bit Program Counters
- Instruction buffer for 10 wavefronts (each wavefront is a group of 64 threads, or the size of one logical VGPR)
- A 64-thread wavefront issues to a 16-lane SIMD Unit over four cycles
Each SIMD-VU has 10 wavefront instruction buffers, and it takes 4 cycles to execute one wavefront.
Audio and video acceleration blocks
[edit]Many implementations of GCN are typically accompanied by several of AMD's other ASIC blocks. Including but not limited to the Unified Video Decoder, Video Coding Engine, and AMD TrueAudio.
Video Coding Engine
[edit]The Video Coding Engine is a video encoding ASIC, first introduced with the Radeon HD 7000 series.[17]
The initial version of the VCE added support for encoding I and P frames H.264 in the YUV420 pixel format, along with SVE temporal encode and Display Encode Mode, while the second version added B-frame support for YUV420 and YUV444 I-frames.
VCE 3.0 formed a part of the third generation of GCN, adding high-quality video scaling and the HEVC (H.265) codec.
VCE 4.0 was part of the Vega architecture, and was subsequently succeeded by Video Core Next.
TrueAudio
[edit]This section is empty. You can help by adding to it. (August 2018) |
Unified virtual memory
[edit]In a preview in 2011, AnandTech wrote about the unified virtual memory, supported by Graphics Core Next.[18]
-
Classical desktop computer architecture with a distinct graphics card over PCI Express. CPU and GPU have their distinct physical memory, with different address spaces. The entire data needs to be copied over the PCIe bus. Note: the diagram shows bandwidths, but not the memory latency.
-
Integrated graphics-solutions (and AMD APUs with TeraScale graphics) suffer under partitioned main memory: a part of the system memory is allocated to the GPU exclusively. Zero-copy is not possible, data has to be copied (over the system memory bus) from one partition to the other.
Heterogeneous System Architecture (HSA)
[edit]
Some of the specific HSA features implemented in the hardware need support from the operating system's kernel (its subsystems) and/or from specific device drivers. For example, in July 2014, AMD published a set of 83 patches to be merged into Linux kernel mainline 3.17 for supporting their Graphics Core Next-based Radeon graphics cards. The so-called HSA kernel driver resides in the directory /drivers/gpu/hsa, while the DRM graphics device drivers reside in /drivers/gpu/drm[21] and augment the already existing DRM drivers for Radeon cards.[22] This very first implementation focuses on a single "Kaveri" APU and works alongside the existing Radeon kernel graphics driver (kgd).
Lossless Delta Color Compression
[edit]This section needs expansion. You can help by adding to it. (August 2018) |
Hardware schedulers
[edit]Hardware schedulers are used to perform scheduling[23] and offload the assignment of compute queues to the ACEs from the driver to hardware, by buffering these queues until there is at least one empty queue in at least one ACE. This causes the HWS to immediately assign buffered queues to the ACEs until all queues are full or there are no more queues to safely assign.[24]
Part of the scheduling work performed includes prioritized queues which allow critical tasks to run at a higher priority than other tasks without requiring the lower priority tasks to be preempted to run the high priority task, therefore allowing the tasks to run concurrently with the high priority tasks scheduled to hog the GPU as much as possible while letting other tasks use the resources that the high priority tasks are not using.[23] These are essentially Asynchronous Compute Engines that lack dispatch controllers.[23] They were first introduced in the fourth generation GCN microarchitecture,[23] but were present in the third generation GCN microarchitecture for internal testing purposes.[25] A driver update has enabled the hardware schedulers in third generation GCN parts for production use.[23]
Primitive Discard Accelerator
[edit]This unit discards degenerate triangles before they enter the vertex shader and triangles that do not cover any fragments before they enter the fragment shader.[26] This unit was introduced with the fourth generation GCN microarchitecture.[26]
Generations
[edit]Graphics Core Next 1
[edit]| Release date | January 2012[citation needed] |
|---|---|
| History | |
| Predecessor | TeraScale 3 |
| Successor | Graphics Core Next 2 |
| Support status | |
| Unsupported since mid-2022 (final Windows driver version 22.6.1 for Windows 7 and 10) | |
The GCN 1 microarchitecture was used in several Radeon HD 7000 series graphics cards.

- support for 64-bit addressing (x86-64 address space) with unified address space for CPU and GPU[18]
- support for PCIe 3.0[27]
- GPU sends interrupt requests to CPU on various events (such as page faults)
- support for Partially Resident Textures,[28] which enable virtual memory support through DirectX and OpenGL extensions
- AMD PowerTune support, which dynamically adjusts performance to stay within a specific TDP[29]
- support for Mantle (API)
There are Asynchronous Compute Engines controlling computation and dispatching.[15][30]
ZeroCore Power
[edit]ZeroCore Power is a long idle power saving technology, shutting off functional units of the GPU when not in use.[31] AMD ZeroCore Power technology supplements AMD PowerTune.
Chips
[edit]Discrete GPUs (Southern Islands family):
- Hainan
- Oland
- Cape Verde
- Pitcairn
- Tahiti
Graphics Core Next 2
[edit]| Release date | September 2013[citation needed] |
|---|---|
| History | |
| Predecessor | Graphics Core Next 1 |
| Successor | Graphics Core Next 3 |
| Support status | |
| Unsupported since mid-2022 (final Windows driver version 22.6.1 for Windows 7 and 10) | |


The 2nd generation of GCN was introduced with the Radeon HD 7790 and is also found in the Radeon HD 8770, R7 260/260X, R9 290/290X, R9 295X2, R7 360, and R9 390/390X, as well as Steamroller-based desktop "Kaveri" APUs and mobile "Kaveri" APUs and in the Puma-based "Beema" and "Mullins" APUs. It has multiple advantages over the original GCN, including FreeSync support, AMD TrueAudio and a revised version of AMD PowerTune technology.
GCN 2nd generation introduced an entity called "Shader Engine" (SE). A Shader Engine comprises one geometry processor, up to 44 CUs (Hawaii chip), rasterizers, ROPs, and L1 cache. Not part of a Shader Engine is the Graphics Command Processor, the 8 ACEs, the L2 cache and memory controllers as well as the audio and video accelerators, the display controllers, the 2 DMA controllers and the PCIe interface.
The A10-7850K "Kaveri" contains 8 CUs (compute units) and 8 Asynchronous Compute Engines for independent scheduling and work item dispatching.[32]
At AMD Developer Summit (APU) in November 2013 Michael Mantor presented the Radeon R9 290X.[33]
Chips
[edit]Discrete GPUs (Sea Islands family):
- Bonaire
- Hawaii
integrated into APUs:
- Temash
- Kabini
- Liverpool (i.e. the APU found in the PlayStation 4)
- Durango (i.e. the APU found in the Xbox One and Xbox One S)
- Kaveri
- Godavari
- Mullins
- Beema
- Carrizo-L
Graphics Core Next 3
[edit]| Release date | June 2015[citation needed] |
|---|---|
| History | |
| Predecessor | Graphics Core Next 2 |
| Successor | Graphics Core Next 4 |
| Support status | |
| Supported, with less regular Windows driver update schedule | |

GCN 3rd generation[34] was introduced in 2014 with the Radeon R9 285 and R9 M295X, which have the "Tonga" GPU. It features improved tessellation performance, lossless delta color compression to reduce memory bandwidth usage, an updated and more efficient instruction set, a new high quality scaler for video, HEVC encoding (VCE 3.0) and HEVC decoding (UVD 6.0), and a new multimedia engine (video encoder/decoder). Delta color compression is supported in Mesa.[35] However, its double precision performance is worse compared to previous generation.[36]
Chips
[edit]discrete GPUs:
- Tonga (Volcanic Islands family), comes with UVD 5.0 (Unified Video Decoder)
- Fiji (Pirate Islands family), comes with UVD 6.0 and High Bandwidth Memory (HBM 1)
integrated into APUs:
Graphics Core Next 4
[edit]| Release date | June 2016[citation needed] |
|---|---|
| History | |
| Predecessor | Graphics Core Next 3 |
| Successor | Graphics Core Next 5 |
| Support status | |
| Supported, with less regular Windows driver update schedule | |


GPUs of the Arctic Islands-family were introduced in Q2 of 2016 with the AMD Radeon 400 series. The 3D-engine (i.e. GCA (Graphics and Compute array) or GFX) is identical to that found in the Tonga-chips.[38] But Polaris feature a newer Display Controller engine, UVD version 6.3, etc.
All Polaris-based chips other than the Polaris 30 are produced on the 14 nm FinFET process, developed by Samsung Electronics and licensed to GlobalFoundries.[39] The slightly newer refreshed Polaris 30 is built on the 12 nm LP FinFET process node, developed by Samsung and GlobalFoundries. The fourth generation GCN instruction set architecture is compatible with the third generation. It is an optimization for 14 nm FinFET process enabling higher GPU clock speeds than with the 3rd GCN generation.[40] Architectural improvements include new hardware schedulers, a new primitive discard accelerator, a new display controller, and an updated UVD that can decode HEVC at 4K resolutions at 60 frames per second with 10 bits per color channel.
Chips
[edit]discrete GPUs:[41]
- Polaris 10 (also codenamed Ellesmere) found on "Radeon RX 470" and "Radeon RX 480"-branded graphics cards
- Polaris 11 (also codenamed Baffin) found on "Radeon RX 460"-branded graphics cards (also Radeon RX 560D)
- Polaris 12 (also codenamed Lexa) found on "Radeon RX 550" and "Radeon RX 540"-branded graphics cards
- Polaris 20, which is a refreshed (14 nm LPP Samsung/GloFo FinFET process) Polaris 10 with higher clocks, used for "Radeon RX 570" and "Radeon RX 580"-branded graphics cards[42]
- Polaris 21, which is a refreshed (14 nm LPP Samsung/GloFo FinFET process) Polaris 11, used for "Radeon RX 560"-branded graphics cards
- Polaris 22, found on "Radeon RX Vega M GH" and "Radeon RX Vega M GL"-branded graphics cards (as part of Kaby Lake-G)
- Polaris 23, which is a refreshed (14 nm LPP Samsung/GloFo FinFET process) Polaris 12, used for "Radeon Pro WX 3200" and "Radeon RX 540X"-branded graphics cards (also Radeon RX 640)[43]
- Polaris 30, which is a refreshed (12 nm LP GloFo FinFET process) Polaris 20 with higher clocks, used for "Radeon RX 590"-branded graphics cards[44]
In addition to dedicated GPUs, Polaris is utilized in the APUs of the PlayStation 4 Pro and Xbox One X, titled "Neo" and "Scorpio", respectively.
Precision Performance
[edit]FP64 performance of all GCN 4th generation GPUs is 1/16 of FP32 performance.
Graphics Core Next 5
[edit]| Release date | June 2017[citation needed] |
|---|---|
| History | |
| Predecessor | Graphics Core Next 4 |
| Successor | CDNA 1, RDNA 1 |
| Support status | |
| Supported, with less regular Windows driver update schedule | |

AMD began releasing details of their next generation of GCN Architecture, termed the 'Next-Generation Compute Unit', in January 2017.[40][45][46] The new design was expected to increase instructions per clock, higher clock speeds, support for HBM2, a larger memory address space. The discrete graphics chipsets also include "HBCC (High Bandwidth Cache Controller)", but not when integrated into APUs.[47] Additionally, the new chips were expected to include improvements in the Rasterisation and Render output units. The stream processors are heavily modified from the previous generations to support packed math Rapid Pack Math technology for 8-bit, 16-bit, and 32-bit numbers. With this there is a significant performance advantage when lower precision is acceptable (for example: processing two half-precision numbers at the same rate as a single single precision number).
Nvidia introduced tile-based rasterization and binning with Maxwell,[48] and this was a big reason for Maxwell's efficiency increase. In January, AnandTech assumed that Vega would finally catch up with Nvidia regarding energy efficiency optimizations due to the new "DSBR (Draw Stream Binning Rasterizer)" to be introduced with Vega.[49]
It also added support for a new shader stage – Primitive Shaders.[50][51] Primitive shaders provide more flexible geometry processing and replace the vertex and geometry shaders in a rendering pipeline. As of December 2018, the Primitive shaders can't be used because required API changes are yet to be done.[52]
Vega 10 and Vega 12 use the 14 nm FinFET process, developed by Samsung Electronics and licensed to GlobalFoundries. Vega 20 uses the 7 nm FinFET process developed by TSMC.
Chips
[edit]discrete GPUs:
- Vega 10 (14 nm Samsung/GloFo FinFET process) (also codenamed Greenland[53]) found on "Radeon RX Vega 64", "Radeon RX Vega 56", "Radeon Vega Frontier Edition", "Radeon Pro V340", Radeon Pro WX 9100, and Radeon Pro WX 8200 graphics cards[54]
- Vega 12 (14 nm Samsung/GloFo FinFET process) found on "Radeon Pro Vega 20" and "Radeon Pro Vega 16"-branded mobile graphics cards[55]
- Vega 20 (7 nm TSMC FinFET process) found on "Radeon Instinct MI50" and "Radeon Instinct MI60"-branded accelerator cards,[56] "Radeon Pro Vega II", and "Radeon VII"-branded graphics cards.[57]
integrated into APUs:
- Raven Ridge[58] came with VCN 1 which supersedes VCE and UVD and allows full fixed-function VP9 decode.
- Picasso
- Renoir
- Cezanne
Precision performance
[edit]Double-precision floating-point (FP64) performance of all GCN 5th generation GPUs, except for Vega 20, is one-sixteenth of FP32 performance. For Vega 20 with Radeon Instinct this is half of FP32 performance. For Vega 20 with Radeon VII this is a quarter of FP32 performance.[59] All GCN 5th generation GPUs support half-precision floating-point (FP16) calculations which is twice of FP32 performance.
Comparison of GCN GPUs
[edit]- Table contains only discrete GPUs (including mobile). APU(IGP) and console SoCs are not listed.
| Microarchitecture[60] | GCN 1 | GCN 2 | GCN 3 | GCN 4 | GCN 5 | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Die | Tahiti[61] | Pitcairn[62] | Cape Verde[63] | Oland[64] | Hainan[65] | Bonaire[66] | Hawaii[67] | Topaz[68] | Tonga[69] | Fiji[70] | Ellesmere[71] | Baffin[72] | Lexa[73] | Vega 10[74] | Vega 12[75] | Vega 20[76] | |
| Code name1 | ? | ? | ? | Tiran | ? | ? | Ibiza | Iceland | ? | ? | Polaris 10 | Polaris 11 | Polaris 12 | Greenland | Treasure Refresh | Moonshot | |
| Variant(s) | New Zealand Malta |
Wimbledon Curaçao Neptune Trinidad |
Chelsea Heathrow Venus Tropo |
Mars Opal Litho |
Sun Jet Exo Banks |
Saturn Tobago Strato Emerald |
Vesuvius Grenada |
Meso Weston Polaris 24 |
Amethyst Antigua |
Capsaicin | Polaris 20 Polaris 30 |
Polaris 21 | Polaris 23 | — | — | — | |
| Fab | TSMC 28 nm | GlobalFoundries 14 nm / 12 nm (Polaris 30) | TSMC 7 nm | ||||||||||||||
| Die size (mm2) | 352 / 365 (Malta) | 212 | 123 | 77 | 56 | 160 | 438 | 125 | 366 | 596 | 232 | 123 | 103 | 495 | Unknown | 331 | |
| Transistors (million) | 4,313 | 2,800 | 1,500 | 950 | 690 | 2,080 | 6,200 | 1,550 | 5,000 | 8,900 | 5,700 | 3,000 | 2,200 | 12,500 | Unknown | 13,230 | |
| Transistor density (MTr/mm2) | 12.3 / 12.8 (Malta) | 13.2 | 12.2 | 12.3 | 13.0 | 14.2 | 12.4 | 13.7 | 14.9 | 24.6 | 24.4 | 21.4 | 25.3 | Unknown | 40.0 | ||
| Asynchronous compute engines | 2 | 8 | ? | 8 | 4 | ? | 4 | ||||||||||
| Geometry engines | 2 | 1 | 2 | — | ? | — | 4 | ? | 4 | ||||||||
| Shader engines | — | 4 | ? | 4 | 2 | — | |||||||||||
| Hardware schedulers | — | 2 | ? | 2 | |||||||||||||
| Compute units | 32 | 20 | 10 / 8 (Chelsea) | 6 | 5 / 6 (Jet) | 14 | 44 | 6 | 32 | 64 | 36 | 16 | 10 | 64 | 20 | 64 | |
| Stream processors | 2048 | 1280 | 640 / 512 (Chelsea) | 384 | 320 / 384 (Jet) | 896 | 2816 | 384 | 2048 | 4096 | 2304 | 1024 | 640 | 4096 | 1280 | 4096 | |
| Texture mapping units | 128 | 80 | 40 / 32 (Chelsea) | 24 | 20 / 24 (Jet) | 56 | 176 | 24 | 128 | 256 | 144 | 64 | 40 | 256 | 80 | 256 | |
| Render output units | 32 | 16 | 8 | 16 | 64 | 8 | 32 | 64 | 32 | 16 | 64 | 32 | 64 | ||||
| Z/Stencil OPS | 128 | 64 | 16 | 64 | 256 | 16 | 128 | 256 | — | ||||||||
| L1 cache (KB) | 16 per Compute unit (CU) | ||||||||||||||||
| L2 cache (KB) | 768 | 512 | 256 | 128 / 256 (Jet) | 256 | 1024 | 256 | 768 | 2048 | 1024 | 512 | 4096 | 1024 | 4096 | |||
| Display Core Engine | 6.0 | 6.4 | — | 8.2 | 8.5 | — | 10.0 | 11.2 | 12.0 | 12.1 | |||||||
| Unified Video Decoder | 3.2 | 4.0 | — | 4.2 | — | 5.0 | 6.0 | 6.3 | 7.0 | 7.2 | |||||||
| Video Coding Engine | 1.0 | — | 2.0 | — | 3.0 | 3.4 | 4.0 | 4.1 | |||||||||
| Launch2 | Dec 2011 | Mar 2012 | Feb 2012 | Jan 2013 | May 2015 | Mar 2013 | Oct 2013 | 2014 | Aug 2014 | Jun 2015 | Jun 2016 | Aug 2016 | Apr 2017 | Jun 2017 | Nov 2018 | Nov 2018 | |
| Series (Family) | Southern Islands | Sea Islands | Volcanic Islands | Pirate Islands | Arctic Islands | Vega | Vega II | ||||||||||
| Notes | mobile/OEM | mobile/OEM | mobile | ||||||||||||||
1 Old code names such as Treasure (Lexa) or Hawaii Refresh (Ellesmere) are not listed.
2 Initial launch date. Launch dates of variant chips such as Polaris 20 (April 2017) are not listed.
See also
[edit]External links
[edit]References
[edit]- ^ AMD Developer Central (January 31, 2014). "GS-4106 The AMD GCN Architecture – A Crash Course, by Layla Mah". Slideshare.net.
- ^ "AMD Launches World's Fastest Single-GPU Graphics Card – the AMD Radeon HD 7970" (Press release). AMD. December 22, 2011. Archived from the original on January 20, 2015. Retrieved January 20, 2015.
- ^ Gulati, Abheek (November 11, 2019). "An Architectural Deep-Dive into AMD's TeraScale, GCN & RDNA GPU Architectures". Medium. Retrieved December 12, 2021.
- ^ "AMD community forums". Community.amd.com. July 15, 2016.
- ^ "LLVM back-end amdgpu". Llvm.org.
- ^ "GCC 9 Release Series Changes, New Features, and Fixes". Retrieved November 13, 2019.
- ^ "AMD GCN Offloading Support". Retrieved November 13, 2019.
- ^ "AMD Boltzmann Initiative – Heterogeneous-compute Interface for Portability (HIP)". November 16, 2015. Archived from the original on January 26, 2016. Retrieved December 8, 2019.
- ^ Smith, Ryan (January 5, 2017). "The AMD Vega GPU Architecture Preview". Anandtech.com. Archived from the original on January 6, 2017. Retrieved July 11, 2017.
- ^ Smith, Ryan. "AMD Dives Deep On Asynchronous Shading". Anandtech.com. Archived from the original on April 2, 2015.
- ^ "Conformant Products". Khronos.org. October 26, 2017.
- ^ Compute Cores Whitepaper (PDF). AMD. 2014. p. 5.
- ^ a b Smith, Ryan (December 21, 2011). "AMD's Graphics Core Next Preview". Anandtech.com. Archived from the original on May 18, 2014. Retrieved April 18, 2017.
- ^ "AMD's Graphics Core Next (GCN) Architecture" (PDF). TechPowerUp. Retrieved February 26, 2024.
- ^ a b Mantor, Michael; Houston, Mike (June 15, 2011). "AMD Graphics Core Next" (PDF). AMD. p. 40. Retrieved July 15, 2014.
Asynchronous Compute Engine (ACE)
- ^ "Optimizing GPU occupancy and resource usage with large thread groups". AMD GPUOpen. Retrieved January 1, 2024.
- ^ "White Paper AMD UnifiedVideoDecoder (UVD)" (PDF). June 15, 2012. Retrieved May 20, 2017.
- ^ a b "Not Just A New Architecture, But New Features Too". AnandTech. December 21, 2011. Archived from the original on June 19, 2011. Retrieved July 11, 2014.
- ^ "Kaveri microarchitecture". SemiAccurate. January 15, 2014.
- ^ Airlie, Dave (November 26, 2014). "Merge AMDKFD". freedesktop.org. Retrieved January 21, 2015.
- ^ "/drivers/gpu/drm". Kernel.org.
- ^ "[PATCH 00/83] AMD HSA kernel driver". LKML. July 10, 2014. Retrieved July 11, 2014.
- ^ a b c d e Angelini, Chris (June 29, 2016). "AMD Radeon RX 480 8GB Review". Tom's Hardware. p. 1. Retrieved August 11, 2016.
- ^ "Dissecting the Polaris Architecture" (PDF). 2016. Archived from the original (PDF) on September 20, 2016. Retrieved August 12, 2016.
- ^ Shrout, Ryan (June 29, 2016). "The AMD Radeon RX 480 Review – The Polaris Promise". PC Perspective. p. 2. Archived from the original on October 10, 2016. Retrieved August 12, 2016.
- ^ a b Smith, Ryan (June 29, 2016). "The AMD Radeon RX 480 Preview: Polaris Makes Its Mainstream Mark". AnandTech. p. 3. Archived from the original on July 2, 2016. Retrieved August 11, 2016.
- ^ "AMD Radeon HD 7000 Series to be PCI-Express 3.0 Compliant". TechPowerUp. Retrieved July 21, 2011.
- ^ "AMD Details Next Gen. GPU Architecture". Archived from the original on March 28, 2012. Retrieved August 3, 2011.
- ^ Tony Chen; Jason Greaves, "AMD's Graphics Core Next (GCN) Architecture" (PDF), AMD, archived from the original (PDF) on January 18, 2023, retrieved August 13, 2016
- ^ "AMD's Graphics Core Next Preview: AMD's New GPU, Architected For Compute". AnandTech. December 21, 2011. Archived from the original on May 18, 2014. Retrieved July 15, 2014.
AMD's new Asynchronous Compute Engines serve as the command processors for compute operations on GCN. The principal purpose of ACEs will be to accept work and to dispatch it off to the CUs for processing.
- ^ "Managing Idle Power: Introducing ZeroCore Power". AnandTech.com. December 22, 2011. Archived from the original on January 7, 2012. Retrieved April 29, 2015.
- ^ "AMD's Kaveri A10-7850K tested". AnandTech. January 14, 2014. Archived from the original on January 16, 2014. Retrieved July 7, 2014.
- ^ "AMD Radeon R9-290X". November 21, 2013.
- ^ "Carrizo Overview". Images.anandtech.com. Archived from the original (PNG) on March 4, 2016. Retrieved July 20, 2018.
- ^ "Add DCC Support". Freedesktop.org. October 11, 2015.
- ^ Smith, Ryan (September 10, 2014). "AMD Radeon R9 285 Review". Anandtech.com. Archived from the original on September 12, 2014. Retrieved March 13, 2017.
- ^ a b Cutress, Ian (June 1, 2016). "AMD Announces 7th Generation APU". Anandtech.com. Archived from the original on June 2, 2016. Retrieved June 1, 2016.
- ^ "RadeonFeature". www.x.org.
- ^ "Radeon Technologies Group – January 2016 – AMD Polaris Architecture". Guru3d.com.
- ^ a b Smith, Ryan (January 5, 2017). "The AMD Vega Architecture Teaser: Higher IPC, Tiling, & More, coming in H1'2017". Anandtech.com. Archived from the original on January 6, 2017. Retrieved January 10, 2017.
- ^ WhyCry (March 24, 2016). "AMD confirms Polaris 10 is Ellesmere and Polaris 11 is Baffin". VideoCardz. Retrieved April 8, 2016.
- ^ "Fast vollständige Hardware-Daten zu AMDs Radeon RX 500 Serie geleakt". www.3dcenter.org.
- ^ "AMD Polaris 23". TechPowerUp. Retrieved May 12, 2022.
- ^ Oh, Nate (November 15, 2018). "The AMD Radeon RX 590 Review, feat. XFX & PowerColor: Polaris Returns (Again)". anandtech.com. Archived from the original on November 15, 2018. Retrieved November 24, 2018.
- ^ Kampman, Jeff (January 5, 2017). "The curtain comes up on AMD's Vega architecture". TechReport.com. Retrieved January 10, 2017.
- ^ Shrout, Ryan (January 5, 2017). "AMD Vega GPU Architecture Preview: Redesigned Memory Architecture". PC Perspective. Retrieved January 10, 2017.
- ^ Kampman, Jeff (October 26, 2017). "AMD's Ryzen 7 2700U and Ryzen 5 2500U APUs revealed". Techreport.com. Retrieved October 26, 2017.
- ^ Raevenlord (March 1, 2017). "On NVIDIA's Tile-Based Rendering". techPowerUp.
- ^ "Vega Teaser: Draw Stream Binning Rasterizer". Anandtech.com. January 5, 2017. Archived from the original on January 7, 2017.
- ^ "Radeon RX Vega Revealed: AMD promises 4K gaming performance for $499 – Trusted Reviews". Trustedreviews.com. July 31, 2017. Archived from the original on July 14, 2017. Retrieved March 20, 2017.
- ^ "The curtain comes up on AMD's Vega architecture". Techreport.com. Archived from the original on September 1, 2017. Retrieved March 20, 2017.
- ^ Kampman, Jeff (January 23, 2018). "Radeon RX Vega primitive shaders will need API support". Techreport.com. Retrieved December 29, 2018.
- ^ "ROCm-OpenCL-Runtime/libUtils.cpp at master · RadeonOpenCompute/ROCm-OpenCL-Runtime". github.com. May 3, 2017. Retrieved November 10, 2018.
- ^ "The AMD Radeon RX Vega 64 & RX Vega 56 Review: Vega Burning Bright". Anandtech.com. August 14, 2017. Archived from the original on August 14, 2017. Retrieved November 16, 2017.
- ^ "AMD's Vega Mobile Lives: Vega Pro 20 & 16 in Updated MacBook Pros In November". Anandtech.com. October 30, 2018. Archived from the original on October 31, 2018. Retrieved November 10, 2018.
- ^ "AMD Announces Radeon Instinct MI60 & MI50 Accelerators: Powered By 7nm Vega". Anandtech.com. November 6, 2018. Archived from the original on November 7, 2018. Retrieved November 10, 2018.
- ^ "AMD Unveils World's First 7nm Gaming GPU – Delivering Exceptional Performance and Incredible Experiences for Gamers, Creators and Enthusiasts" (Press release). Las Vegas, Nevada: AMD. January 9, 2019. Retrieved January 12, 2019.
- ^ Ferreira, Bruno (May 16, 2017). "Ryzen Mobile APUs are coming to a laptop near you". Tech Report. Retrieved May 16, 2017.
- ^ "AMD Unveils World's First 7nm Datacenter GPUs – Powering the Next Era of Artificial Intelligence, Cloud Computing and High Performance Computing (HPC) | AMD". AMD.com (Press release). November 6, 2018. Retrieved November 10, 2018.
- ^ "RadeonFeature". x.Org. Retrieved November 21, 2022.
- ^ "AMD Tahiti GPU Specs". TechPowerUp. Retrieved November 20, 2022.
- ^ "AMD Pitcairn GPU Specs". TechPowerUp. Retrieved November 20, 2022.
- ^ "AMD Cape Verde GPU Specs". TechPowerUp. Retrieved November 20, 2022.
- ^ "AMD Oland GPU Specs". TechPowerUp. Retrieved November 20, 2022.
- ^ "AMD Hainan GPU Specs". TechPowerUp. Retrieved November 20, 2022.
- ^ "AMD Bonaire GPU Specs". TechPowerUp. Retrieved November 21, 2022.
- ^ "AMD Hawaii GPU Specs". TechPowerUp. Retrieved November 21, 2022.
- ^ "AMD Topaz GPU Specs". TechPowerUp. Retrieved November 21, 2022.
- ^ "AMD Tonga GPU Specs". TechPowerUp. Retrieved November 21, 2022.
- ^ "AMD Fiji GPU Specs". TechPowerUp. Retrieved November 21, 2022.
- ^ "AMD Ellesmere GPU Specs". TechPowerUp. Retrieved November 21, 2022.
- ^ "AMD Baffin GPU Specs". TechPowerUp. Retrieved November 21, 2022.
- ^ "AMD Lexa GPU Specs". TechPowerUp. Retrieved November 21, 2022.
- ^ "AMD Vega 10 GPU Specs". TechPowerUp. Retrieved November 21, 2022.
- ^ "AMD Vega 12 GPU Specs". TechPowerUp. Retrieved November 21, 2022.
- ^ "AMD Vega 20 GPU Specs". TechPowerUp. Retrieved November 21, 2022.
Graphics Core Next
View on GrokipediaOverview
Development and history
AMD's development of the Graphics Core Next (GCN) architecture was rooted in its 2006 acquisition of ATI Technologies, which expanded its graphics expertise and spurred internal research and development focused on parallel computing for GPUs.[6] The acquisition, completed on October 25, 2006, for approximately $5.4 billion, integrated ATI's graphics IP with AMD's CPU technology, enabling advancements in heterogeneous computing and setting the stage for a unified approach to CPU-GPU integration.[6] Following this, AMD shifted its GPU design from the VLIW-based TeraScale architecture to a SIMD-based model with GCN, aiming to improve programmability, power efficiency, and performance consistency for both graphics and general-purpose compute workloads.[1] In 2011, AMD demonstrated its next-generation 28 nm graphics processor, previewing the GCN architecture as a successor to TeraScale to deliver enhanced compute performance and full DirectX 11 support.[7] The architecture was formally detailed in December 2011, emphasizing its design for scalable compute capabilities in discrete GPUs and integrated solutions.[8] Initial silicon for the Radeon HD 7000 series (Southern Islands family) was taped out in 2011, with the first product, the Radeon HD 7970, launching on December 22, 2011, as AMD's flagship single-GPU card built on GCN 1.0.[9] GCN evolved through several iterations, starting with the Southern Islands (GCN 1.0) in 2011-2012, followed by Sea Islands (GCN 2.0) in 2013 with products like the Radeon R9 290 series, and Volcanic Islands (GCN 3.0) in 2015 via the Radeon R9 300 series. Later generations included Vega (GCN 5.0) launched in August 2017 with the Radeon RX Vega series, and Vega 20 (GCN 5.1) in November 2018, marking the final major update before the transition to the RDNA architecture in 2019.[10] These milestones reflected incremental improvements in efficiency, feature support, and process nodes while maintaining the core SIMD design.[1] GCN played a pivotal role in AMD's strategy for integrated accelerated processing units (APUs) and data center GPUs, enabling seamless CPU-GPU collaboration through features like unified virtual memory. First integrated into APUs with the Kabini/Temash series in 2013, GCN powered subsequent designs like Kaveri (2014) and later Ryzen APUs, enhancing everyday computing and thin-client applications. In the data center, GCN underpinned professional GPUs such as FirePro and the Instinct series, with the MI25 (Vega-based) launching in June 2017 to target high-performance computing and deep learning workloads. This versatility solidified GCN's importance in AMD's push toward heterogeneous systems and expanded market presence beyond consumer graphics.[1]Key innovations and design goals
Graphics Core Next (GCN) represented a fundamental shift in AMD's GPU design philosophy, moving away from the Very Long Instruction Word (VLIW) architecture of the preceding TeraScale generation to a single-instruction multiple-data (SIMD) model. This transition aimed to enhance efficiency across diverse workloads by enabling better utilization of hardware resources through wavefront-based execution, where groups of 64 threads (a wavefront) are processed in a more predictable manner. The SIMD approach allowed for issuing up to five instructions per clock cycle across vector and scalar pipelines, improving instruction throughput and reducing the complexity associated with VLIW's multi-issue dependencies.[1][11][2] A core design goal of GCN was to elevate general-purpose GPU (GPGPU) computing, with full support for OpenCL 1.2 and later standards, alongside DirectCompute 11.1 and C++ AMP, to facilitate heterogeneous computing applications. This emphasis targeted at least 2x the shader performance of TeraScale architectures, achieved through optimized compute units that balanced graphics and parallel processing demands. The architecture integrated graphics and compute pipelines into a unified framework, supporting DirectX 11 and preparing for DirectX 12 feature levels, while enabling compatibility with AMD's Heterogeneous System Architecture (HSA) for seamless CPU-GPU collaboration via shared virtual memory.[1][11][2] Power efficiency was another paramount objective, addressed through innovations like ZeroCore Power, which powers down idle GPU components to under 3W during long idle periods, a feature first implemented in GCN 1.0. Complementary technologies such as fine-grained clock gating and PowerTune for dynamic voltage and frequency scaling further optimized energy use, enabling configurations from low-power APUs consuming 2-3W to high-end discrete GPUs delivering over 3 TFLOPS at 250W. This scalability was inherent in GCN's modular compute unit design, allowing flexible integration across market segments while maintaining consistent architectural principles.[1][11]Core Microarchitecture
Instruction set
The Graphics Core Next (GCN) instruction set architecture (ISA) is a 32-bit RISC-like design optimized for both graphics and general-purpose computing workloads, featuring distinct scalar (S) and vector (V) instruction types that enable efficient ALU operations across wavefronts.[3] Scalar instructions operate on a single value per wavefront for control flow and address calculations, while vector instructions process one value per thread, supporting up to three operands in formats like VOP2 (two inputs) and VOP3 (up to three inputs, including 64-bit operations).[3] This separation allows scalar units to handle program control independently from vector units focused on data-parallel computation.[1] Key instruction categories encompass arithmetic operations such asS_ADD_I32 for scalar integer addition and V_ADD_F32 or V_ADD_F64 for vector floating-point addition; bitwise operations including S_AND_B32 and V_AND_B32; and transcendental functions like V_SIN_F32 or V_LOG_F32 for approximations of sine, cosine, and logarithms in the vector ALU (VALU).[3] Control flow is managed primarily through scalar instructions such as S_BRANCH for unconditional jumps and S_CBRANCH for conditional branches based on wavefront execution masks, alongside barriers and synchronization primitives to coordinate thread groups.[3] These categories support a wavefront-based execution model where each wavefront comprises 64 threads (organized as 16 work-items across 4 components for vector4 operations), enabling SIMD processing of instructions across the group.[3][1]
From GCN 1.0 onward, the ISA includes native support for 64-bit integers (e.g., via V_ADD_U64) and double-precision floating-point operations (e.g., V_FMA_F64 for fused multiply-add), ensuring IEEE-754 compliance for compute-intensive tasks.[3][1] Starting with GCN 3.0, the ISA includes half-precision floating-point (FP16) instructions like V_ADD_F16 and V_FMA_F16 for improved efficiency in machine learning workloads, alongside packed math features in GCN 4.0 such as V_CVT_PK_U8_F32 for converting multiple low-precision values in a single operation.[3] The ISA maintains broad compatibility across GCN generations (1.0 through 5.0), with new capabilities added via minor opcode extensions rather than breaking changes, facilitating binary portability for shaders and kernels.[3]
Command processing and schedulers
The Graphics Command Processor (GCP) serves as the front-end unit in the Graphics Core Next (GCN) architecture responsible for parsing high-level API commands from the driver, such as draw calls and state changes, and mapping them to the appropriate processing elements in the graphics pipeline.[1] It coordinates the traditional rendering workflow by distributing workloads across shader stages and fixed-function hardware units, enabling efficient handling of graphics-specific tasks like vertex processing and rasterization setup.[1] The GCP processes separate command streams for different shader types, which facilitates multitasking and improves overall pipeline utilization by allowing concurrent execution of graphics operations.[1] Complementing the GCP, Asynchronous Compute Engines (ACEs) manage independent compute queues, allowing compute workloads to execute in parallel with graphics tasks for better resource overlap.[12] Each ACE fetches commands from dedicated queues, forms prioritized task lists ranging from background to real-time levels, and dispatches workgroups to compute units (CUs) while checking for resource availability.[1] GCN supports up to eight ACEs in later generations, enabling multiple independent queues that share hardware with the graphics pipeline but operate asynchronously, with graphics typically holding priority during contention.[1] This design reduces idle time on CUs by interleaving compute shaders with graphics rendering, though it incurs a small overhead known as the "async tax" due to synchronization and context switching.[12] The scheduler hierarchy in GCN begins with a global command processor that dispatches work packets from user-visible queues in DRAM to workload managers, which then distribute tasks across shader engines and CUs.[13] These managers route commands to per-SIMD schedulers within each CU, where four SIMD units per CU each maintain a scheduler partition buffering up to 10 wavefronts for round-robin execution.[13] This tiered structure supports dispatching one wavefront per cycle per ACE or GCP, with up to five instructions issued per CU cycle across multiple wavefronts to maximize throughput.[2] Hardware schedulers within the ACEs and per-SIMD units handle thread management by prioritizing queues and enabling preemption for efficient workload balancing.[1] Priority queuing allows higher-priority tasks to preempt lower ones by flushing active workgroups and switching contexts via a dedicated cache, supporting out-of-order completion while ensuring synchronization through fences or shared memory.[1] This mechanism accommodates up to 81,920 in-flight work items across 32 CUs, promoting high occupancy and reducing latency in heterogeneous workloads.[1] Introduced in the fourth generation of GCN (GCN 4.0), the Primitive Discard Accelerator (PDA) enhances command processing by early rejection of degenerate or small primitives before they reach the vertex shader or rasterizer.[14] It filters triangles with zero area or no sample coverage during input assembly, reducing unnecessary vertex fetches and geometry workload by up to 3.5 times in high-density tessellation scenarios.[15] The PDA integrates into the front-end pipeline to cull non-contributing primitives efficiently, improving energy efficiency and performance in graphics-heavy applications without impacting valid geometry.[15]Compute units and wavefront execution
The compute unit (CU) serves as the fundamental processing element in the Graphics Core Next (GCN) architecture, comprising 64 shader processors organized into four 16-wide SIMD units.[1] Each SIMD unit handles 16 work-items simultaneously, enabling the CU to process a full wavefront of 64 threads by executing it across four clock cycles in a lockstep manner.[1] This structure emphasizes massive parallelism while maintaining scalar control for divergence handling. At the heart of execution is the wavefront, the basic scheduling unit consisting of 64 threads that operate in lockstep across the SIMD units.[16] These threads execute vector instructions synchronously, with the hardware decomposing each wavefront into four groups of 16 lanes processed sequentially over four cycles to accommodate the 16-wide SIMD width.[16] GCN supports dual-issue capability, allowing the scheduler to dispatch one scalar instruction alongside a vector instruction in the same cycle, which enhances throughput for mixed workloads involving uniform operations and per-thread computations.[16] The CU scheduler oversees wavefront dispatch using round-robin arbitration across up to six execution pipelines, managing instruction buffers and ensuring balanced utilization while tracking outstanding operations like vector ALU counts.[1][3] The SIMD vector arithmetic logic unit (VALU) within each CU performs core floating-point and integer operations, supporting full IEEE-754 compliance for FP32 and INT32 at a rate of one operation per lane per cycle, yielding 64 FP32 operations per CU clock in the base configuration.[1] Export units integrated into the CU handle output from wavefronts, facilitating memory stores to global buffers via vector memory instructions and raster operations such as exporting pixel colors or positions to render targets.[3] These units support compression for efficiency and are shared across wavefronts to synchronize data flow with downstream graphics or compute pipelines.[3] Double-precision floating-point performance evolved significantly across GCN generations to better support scientific computing. In GCN 1.0, double-precision operations ran at 1/16 the rate of single-precision due to shared hardware resources prioritizing FP32 workloads.[1] Subsequent iterations, starting with GCN 2.0, improved this to 1/4 the single-precision rate through dedicated ALU enhancements and optimized instructions like V_FMA_F64, enabling higher throughput for applications requiring FP64 arithmetic without compromising the core scalar-vector balance.[3][1]Graphics and Compute Pipeline
Geometric processing
In the Graphics Core Next (GCN) architecture, geometric processing encompasses the initial stages of the graphics pipeline, handling vertex data ingestion, programmable shading for transformations, and fixed-function optimization to prepare primitives for rasterization. This pipeline begins with vertex fetch, where vertex attributes are retrieved from vertex buffers stored in system memory using buffer load instructions such as TBUFFER_LOAD_FORMAT, which access data through a unified read/write cache hierarchy including a 16 KB L1 cache per compute unit (CU) and a shared 768 KB L2 cache.[17][11] Primitive assembly follows, where fetched vertices are grouped into primitives (such as triangles, lines, or points) by dual geometry engines capable of processing up to two primitives per clock cycle, enabling high throughput—for instance, 1.85 billion primitives per second on the Radeon HD 7970 at 925 MHz.[1][11] The programmable vertex shader stage transforms these vertices using shaders executed on the scalable array of CUs, where each CU contains four 16-wide SIMD units that process 64-element wavefronts in parallel via a non-VLIW instruction set architecture (ISA) with vector ALU (VALU) operations for tasks like position calculations and attribute interpolation. This design allows flexible control flow and IEEE-754 compliant floating-point arithmetic, distributing workloads across up to 32 CUs for efficient parallel execution without the rigid bundling of prior VLIW architectures.[17][1] Tessellation and geometry shaders extend this programmability, with a dedicated hardware tessellator performing efficient domain subdivision—generating 2 to 64 patches per invocation, up to four times faster than previous generations through improved parameter caching and vertex reuse that spills to the coherent L2 cache when needed.[1][11] Geometry shaders, also run on CUs, enable primitive amplification and manipulation using instructions like S_SENDMSG for task signaling, supporting advanced effects such as fur or grass generation.[17] Fixed-function clipping and culling stages then optimize the pipeline by rejecting unnecessary geometry, including backface culling to discard primitives facing away from the viewer and view-frustum culling to eliminate those outside the camera's field of view, reducing downstream computational load.[1][11] The setup engine concludes pre-raster processing by converting assembled primitives into a standardized topology—typically triangles, but also points or lines—for handover to the rasterizer, which generates up to 16 pixels per cycle per primitive while integrating hierarchical Z-testing for early occlusion detection.[1] These stages collectively leverage GCN's unified virtual addressing and scalable design, supporting up to 1 terabyte of addressable memory to handle complex scenes efficiently across generations.[1]Rasterization and pixel processing
In the Graphics Core Next (GCN) architecture, the rasterization stage converts primitives into fragments by scanning screen space tiles, with each rasterizer unit processing one triangle per clock cycle and generating up to 16 pixels per cycle.[1] This target-independent rasterization offloads anti-aliasing computations to fixed-function hardware, reducing overhead on programmable shaders.[1] Hierarchical Z-testing is integrated early in the pipeline, performing coarse depth comparisons on tile-level buffers to cull occluded fragments before they reach the shading stage, thereby improving efficiency by avoiding unnecessary pixel shader invocations.[1] Fragment shading occurs within the compute units (CUs), where pixel shaders execute as 64-wide wavefronts, leveraging the same SIMD hardware as vertex and compute shaders for unified processing.[2] GCN supports multi-sample anti-aliasing (MSAA) up to 8x coverage, with render back-ends (RBEs) equipped with 16 KB color caches per RBE for sample storage and compression, enabling efficient handling of anti-aliased pixels without excessive memory bandwidth demands.[1] Enhanced quality AA (EQAA) extends this to 16x in some configurations using 4 KB depth caches per pixel quad.[1] Texture sampling is managed by texture fetch units (TFUs) integrated into each CU, typically four per CU in first-generation implementations, which compute up to 16 sampling addresses per cycle and fetch texels from the L1 cache.[17] These units support bilinear, trilinear, and anisotropic filtering up to 16x, with anisotropic modes incurring up to N times the cost of bilinear filtering based on the anisotropy factor to enhance texture clarity at oblique angles.[18] Following shading, fragments undergo depth and stencil testing in the RBEs, which apply configurable tests to determine visibility and resolve multi-sample coverage.[1] Blending operations then combine fragment colors with framebuffer data using coverage-weighted accumulation, supporting formats like RGBA8 and advanced blending modes for final pixel output.[1] Pixel exports from CUs route directly to these RBEs, bypassing the L2 cache in some cases for optimized framebuffer access.[2] GCN integrates dedicated multimedia accelerators for audio and video processing. The Video Coding Engine (VCE) provides hardware-accelerated encoding and decoding, starting with H.264/AVC support at 1080p/60 fps in first-generation GCN via VCE 1.0, and evolving to include HEVC (H.265) in VCE 3.0 (third-generation) and VCE 4.0 (fifth-generation Vega).[19] TrueAudio, introduced in second-generation GCN, is a dedicated ASIC co-processor that simulates spatial audio effects, enhancing realism by processing 3D soundscapes in real-time alongside graphics rendering.[20]Compute and asynchronous operations
Graphics Core Next (GCN) architectures introduced robust support for compute shaders, enabling general-purpose computing on graphics processing units (GPGPU) through APIs such as OpenCL 1.2 and DirectCompute 11, which provide CUDA-like programmability for parallel workloads.[1] These compute shaders incorporate synchronization primitives including barriers for intra-work-group coordination and atomic operations (e.g., compare-and-swap, max, min) on local and global memory to ensure data consistency across threads.[1] Barriers are implemented via the S_BARRIER instruction supporting up to 16 wavefronts per work-group, while atomics leverage the 64 KB local data share (LDS) with 32-bit wide entries for efficient thread-level operations.[1] A key innovation in GCN is the Asynchronous Compute Engines (ACEs), which manage compute workloads independently from graphics processing to enable overlapping execution of graphics and compute tasks on the same hardware resources.[1] Each ACE handles multiple task queues with priority-based scheduling (ranging from background to real-time), each supporting up to 8 queues, with high-end implementations featuring multiple ACEs for greater parallelism (up to 64 queues total), facilitating concurrent dispatch without stalling the graphics pipeline.[12] This asynchronous model supports out-of-order completion of tasks, synchronized through mechanisms like cache coherence, LDS, or the global data share (GDS), thereby maximizing CU utilization during idle periods in graphics rendering.[1] Compute wavefronts—groups of 64 threads executed in lockstep—are dispatched directly to CUs by the ACEs, bypassing the graphics command processor and fixed-function stages to streamline non-graphics workloads.[1] Each CU can schedule up to 40 wavefronts (10 per SIMD unit across 4 SIMDs), enabling high throughput for compute-intensive kernels while sharing resources with graphics shaders when possible.[1] This direct path allows for efficient multitasking, where compute operations fill gaps left by graphics latency, such as during vertex or pixel processing waits. GCN supports large work-group sizes of up to 1024 threads per group, divided into multiple wavefronts for execution, providing flexibility for algorithms requiring extensive intra-group communication.[12] Shared memory is facilitated by the 64 KB LDS per CU, banked into 16 or 32 partitions to minimize contention and support fast atomic accesses within a work-group.[1] Occupancy is tuned by factors like vector general-purpose register (VGPR) usage, with maximum waves per SIMD reaching 10 for low-register kernels (≤24 VGPRs) but dropping to 1 for high-register ones (>128 VGPRs).[12] These features enable diverse applications in GPGPU tasks, such as physics simulations in game engines that leverage async queues for real-time particle effects and collision detection.[1] In machine learning, GCN facilitates inference workloads through compute shaders, though performance is limited without dedicated tensor cores, relying instead on general matrix multiplications via OpenCL or DirectCompute.[12] Overall, the asynchronous model enhances efficiency in heterogeneous computing scenarios, allowing seamless integration with CPU-driven systems via brief references to shared memory models like those in Heterogeneous System Architecture (HSA).[1]Memory and System Features
Unified virtual memory
Graphics Core Next (GCN) introduces unified virtual memory (UVM) to enable seamless sharing of a single address space between the CPU and GPU, eliminating the need for explicit data copies in heterogeneous computing applications. This system allows pointers allocated by the CPU to be directly accessed by GPU kernels, facilitating fine-grained data sharing and improving programmability. Implemented starting with the first-generation GCN architecture, UVM leverages hardware and driver support to manage memory virtualization, supporting up to a 40-bit virtual address space that accommodates 1 TiB of addressable memory for 3D resources and textures.[1] The GPU's memory management unit (MMU) handles page table management, using 4 KB pages compatible with x86 addressing for transparent translation of virtual to physical addresses. This setup supports variable page sizes, including optional 4 KB sub-pages within 64 KB frames, ensuring efficient mapping for frame buffers and other resources. Page tables are populated by the driver, with the GPU MMU performing on-demand translations to maintain compatibility with the host system's virtual memory model.[1] Pointer swapping is facilitated by the scalar ALU, which processes 64-bit pointer values from registers to enable dynamic address manipulation during kernel execution. This allows for fine-grained memory access patterns, where vector memory instructions operate at granularities ranging from 32 bits to 128 bits, supporting atomic operations and variable data structures without fixed alignment constraints. Such mechanisms ensure that CPU-allocated data structures can be directly referenced on the GPU, promoting zero-copy semantics for enhanced efficiency.[1] Cache coherency in GCN's UVM is maintained through the L2 cache hierarchy and integration with the input-output memory management unit (IOMMU), which translates x86 virtual addresses for direct memory access (DMA) transfers between CPU and GPU. The IOMMU ensures consistent visibility of shared memory pools across the system, preventing stale data issues by coordinating cache invalidations and flushes. This hardware-assisted coherency model supports system-level memory pools, allowing the GPU to access host memory transparently while minimizing synchronization overhead.[1] From GCN 1.0 onward, UVM has been a core feature. Integration with the Heterogeneous System Architecture (HSA) further extends UVM capabilities for coherent, multi-device environments.[1] The primary benefit of GCN's UVM lies in heterogeneous computing, where it drastically cuts data transfer overhead by enabling direct pointer-based sharing compared to traditional copy-based models. This not only boosts application performance but also simplifies development by abstracting memory management complexities.[1]Heterogeneous System Architecture
Heterogeneous System Architecture (HSA) serves as the foundational framework for Graphics Core Next (GCN) to enable unified computing between CPUs and GPUs, allowing seamless integration and task orchestration across heterogeneous agents without traditional operating system intervention.[21] Developed by the HSA Foundation in collaboration with AMD, this architecture defines specifications for user-mode operations, shared memory models, and a portable intermediate language, optimizing GCN for applications requiring tight CPU-GPU collaboration.[1] By abstracting hardware differences, HSA facilitates efficient workload distribution, reducing latency and power overhead in systems like AMD's Accelerated Processing Units (APUs).[21] At the core of HSA's integration model are user-level queues, known as hqueues, which allow direct signaling between CPU and GPU agents in user space, bypassing kernel-mode switches for lower-latency communication.[21] These queues are runtime-allocated memory structures that hold command packets, enabling applications to enqueue tasks efficiently without OS involvement, as specified in the HSA Platform System Architecture.[21] In GCN implementations, hqueues support priority-based scheduling, from background to real-time tasks, enhancing multi-tasking in heterogeneous environments.[1] Dispatch from the CPU to the GPU occurs through Architected Queuing Language (AQL) packets enqueued on these user-level queues, supporting fine-grained work dispatch for kernels and agents.[21] AQL packets, such as kernel dispatch types, specify launch dimensions, code handles, arguments, and completion signals, allowing agents to build and enqueue their own commands for fast, low-power execution on GCN hardware.[21] This mechanism reduces launch latency by enabling direct enqueuing of tasks to kernel agents, with support for dependencies and out-of-order completion.[1] HSA leverages shared virtual memory with coherent caching to enable zero-copy data sharing between CPU and GPU, utilizing the unified virtual address space for direct access without data movement.[21] All agents access global memory coherently, with automatic cache maintenance ensuring consistency across the system, as mandated by HSA specifications.[21] This model, compatible with GCN's virtual addressing, promotes efficient data-parallel computing by allowing pointers to be passed directly between processing elements.[1] AMD's HSA Intermediate Language (HSAIL) provides a portable virtual ISA that is compiled to the native GCN instruction set architecture (ISA) via a finalizer, ensuring hardware-agnostic code generation for heterogeneous execution.[22] HSAIL, a RISC-like language supporting data-parallel kernels with grids, work-groups, and work-items, translates operations like arithmetic, memory loads/stores, and synchronization into optimized GCN instructions, with features like relaxed memory ordering and acquire/release semantics.[22] The finalizer handles optimizations such as register allocation and wavefront packing tailored to GCN's SIMD execution model.[22] HSA adoption in GCN-based APUs began with the Kaveri series (GCN 2.0), the first to implement full HSA features including hqueues and shared memory for seamless CPU-GPU task assignment.[23] Later generations extended this to Ryzen APUs with Vega graphics (GCN 5.0), supporting advanced HSA capabilities through the ROCm software stack (hsaROCm), which builds on HSA for high-performance computing workloads.[24] These implementations enable features like heterogeneous queuing and unified memory in consumer and professional systems, driving applications in compute-intensive domains.[23]Lossless compression and accelerators
Graphics Core Next (GCN) incorporates Delta Color Compression (DCC) as a lossless compression technique specifically designed for color buffers in 3D rendering pipelines. DCC exploits data coherence by dividing color buffers into blocks and encoding one full-precision pixel value per block, with the remaining pixels represented as deltas using fewer bits when colors are similar. This delta encoding method enables compression ratios that can reduce memory bandwidth usage by up to 2x in scenarios with coherent data, such as skies or gradients, while remaining fully lossless to preserve rendering accuracy. Introduced in GCN 1.2 architectures, DCC allows shader cores to read compressed data directly, bypassing decompression overhead in render-to-texture operations and improving overall efficiency.[25] The Primitive Discard Accelerator (PDA) serves as a hardware mechanism to cull inefficient primitives early in the graphics pipeline, particularly benefiting tessellation-heavy workloads. PDA identifies and discards small or degenerate (zero-area) triangles that do not contribute to the final image, preventing unnecessary processing in compute units and reducing cycle waste. This accelerator becomes increasingly effective as triangle density rises, enabling up to 3.5x higher geometry throughput in dense scenes compared to prior implementations. Debuting in GCN 4.0 (Polaris), PDA enhances pre-rasterization efficiency by filtering occluded or irrelevant geometry without impacting visible output.[15] GCN supports standard block-based texture compression formats, including BCn (Block Compression) variants like BC1 through BC7, which reduce texture memory footprint by encoding 4x4 pixel blocks into fixed-size outputs of 64 or 128 bits. These formats are decompressed on-the-fly within the texture mapping units (TMUs), allowing efficient sampling of up to four texels per clock while minimizing bandwidth demands from main memory. Complementing this, fast clear operations optimize framebuffer initialization by rapidly setting surfaces to common values like 0.0 or 1.0, leveraging compression to avoid full buffer writes and achieving significantly higher speeds than traditional clears—often orders of magnitude faster in bandwidth-constrained scenarios. This combination is integral to GCN's render back-ends, where hierarchical Z-testing further aids in discarding occluded pixels post-clear.[1][25] To enhance power efficiency, GCN implements ZeroCore Power, a power gating technology that aggressively reduces leakage in idle components. When the GPU enters long idle mode—such as during static screen states—ZeroCore gates clocks and powers down compute units, caches, and other blocks, dropping idle power draw from around 15W to under 3W. Available from GCN 1.0 (Southern Islands successors like Tahiti), this feature achieves up to 90% reduction in static power leakage by isolating unused hardware, promoting sustainability in discrete GPU deployments without compromising resume latency.[1][26]Generations
First generation (GCN 1.0)
The first generation of the Graphics Core Next (GCN 1.0) architecture, codenamed Southern Islands, debuted with AMD's Radeon HD 7000 series GPUs in late 2011, marking a shift to a more compute-oriented design compared to prior VLIW-based architectures. Announced on December 22, 2011, and available starting January 9, 2012, these GPUs were fabricated on a 28 nm process node by TSMC, enabling higher transistor density and improved power efficiency. The architecture introduced foundational support for unified virtual memory (UVM), allowing shared virtual address spaces between CPU and GPU for simplified heterogeneous computing, though limited to 64 KB pages with 4 KB sub-pages in initial implementations.[9][1] Key innovations included the ZeroCore Power technology, which dynamically powers down idle compute units to reduce leakage power during low-activity periods, a feature exclusive to the Radeon HD 7900, 7800, and 7700 series. Double-precision floating-point (FP64) performance was configured at 1/4 the rate of single-precision (FP32) for consumer GPUs, prioritizing graphics workloads over high-end compute tasks and resulting in latencies up to 4 times higher for DP operations. The architecture supported DirectX 11 and OpenCL 1.2, enabling advanced tessellation, compute shaders, and general-purpose GPU computing, but lacked full asynchronous compute optimization in early drivers, relying on two asynchronous compute engines (ACEs) for basic concurrent execution.[1][2][1] Representative implementations included the flagship Tahiti GPU in the Radeon HD 7970, featuring 32 compute units (CUs), 2048 stream processors, and 3.79 TFLOPS of FP32 performance at a 250 W TDP, paired with 3 GB of GDDR5 memory on a 384-bit bus. Lower-end models used the Cape Verde GPU, as in the Radeon HD 7770 GHz Edition with 10 CUs, 640 stream processors, over 1 TFLOPS FP32 at a 1000 MHz core clock, and an 80 W TDP, targeting mainstream desktops with 1 GB GDDR5 on a 128-bit bus. These discrete GPUs powered high-end gaming and early professional visualization, emphasizing PCI Express 3.0 connectivity and features like AMD Eyefinity for multi-display support up to 4K resolutions.[9][27][1]Second generation (GCN 2.0)
The second generation of Graphics Core Next (GCN 2.0), known as the Sea Islands architecture, was introduced in 2013 with the launch of the AMD Radeon R9 200 series graphics cards.[28] This generation built upon the foundational GCN design by incorporating optimizations for compute workloads, including the initial implementation of Asynchronous Compute Engines (ACEs) that enable up to eight independent compute queues per pipeline for concurrent graphics and compute operations.[29] These enhancements allowed for more efficient multi-tasking, with support for advanced instruction sets such as 64-bit floating-point operations (e.g., V_ADD_F64 and V_MUL_F64) and improved memory addressing via unified system and device spaces.[29] Key discrete GPU implementations included the high-end Hawaii chip in the Radeon R9 290X, featuring 44 compute units (2,816 stream processors), a peak single-precision compute performance of up to 5.6 TFLOPS at an engine clock of 1 GHz, and fabrication on a 28 nm process node.[28] Mid-range offerings utilized the Bonaire GPU, as seen in the Radeon R9 270, while low-end models like the Radeon R7 240 employed the Oland chip, all leveraging the 28 nm process for improved power efficiency over prior generations through refined power gating and clock management.[29] Additionally, Sea Islands introduced Video Core Next (VCE) 2.0 hardware for H.264 encoding, supporting features like B-frames and YUV 4:4:4 intra-frame encoding to accelerate video compression tasks.[30] Integrated graphics in APUs previewed Heterogeneous System Architecture (HSA) capabilities, with the Kaveri family (launched in early 2014) incorporating up to eight GCN 2.0 compute units alongside Steamroller CPU cores for unified memory access and seamless CPU-GPU task offloading.[31] This generation also added support for DirectX 11.2 and OpenCL 2.0, enabling broader compatibility with emerging compute standards while maintaining the 1:8 ratio of double- to single-precision floating-point performance.[28]Third generation (GCN 3.0)
The third generation of Graphics Core Next (GCN 3.0), codenamed Volcanic Islands, was released in 2015 as part of AMD's Radeon R9 300 series and Fury lineup, introducing refinements aimed at improving efficiency and scaling for mid-range to high-end applications.[32] This iteration built on prior generations by enhancing arithmetic precision and resource management, with key architectural updates including fused multiply-add (FMA) operations for FP32 computations to boost floating-point throughput without intermediate rounding errors.[3] Additionally, it introduced the Primitive Discard Accelerator (PDA), a hardware feature that optimizes geometry processing by early culling of off-screen primitives, contributing to overall efficiency gains in rasterization workloads.[33] Prominent implementations included the Tonga GPU, used in cards like the Radeon R9 285, fabricated on a 28 nm process with 32 compute units for mid-range performance scaling, and the flagship Fiji GPU in the Radeon R9 Fury X, featuring 64 compute units, 8.6 TFLOPS of single-precision compute performance, 4 GB of HBM1 memory, and a 275 W TDP.[34] The Fiji variant, also on 28 nm, emphasized high-bandwidth memory integration for reduced latency in demanding scenarios, while the series as a whole supported partial H.265 (HEVC) video decode acceleration, enabling improved handling of 4K content through enhanced format conversions and buffer operations.[3] These chips delivered notable efficiency improvements, with power-optimized designs allowing sustained performance in 4K gaming environments.[32] GCN 3.0 also extended to accelerated processing units (APUs), notably in the Carrizo family, where up to 12 compute units provided discrete-like graphics capabilities integrated with Excavator CPU cores on a 28 nm process, supporting DirectX 12 and heterogeneous computing for mainstream laptops.[35] The Fury X's liquid-cooled thermal solution further exemplified refinements, maintaining lower temperatures under load compared to air-cooled predecessors, which aided in stable clock speeds and reduced throttling during extended sessions. Overall, these advancements focused on balancing compute density with power efficiency, enabling broader adoption in gaming and multimedia without significant node shrinks.[2]Fourth generation (GCN 4.0)
The fourth generation of the Graphics Core Next (GCN 4.0) architecture, codenamed Polaris, was introduced in 2016 with the Radeon RX 400 series graphics cards, emphasizing substantial improvements in power efficiency and mainstream performance. Fabricated on a 14 nm FinFET process by GlobalFoundries, Polaris delivered up to 2.5 times the performance per watt compared to the previous generation, enabling better thermal management and lower power consumption for gaming and compute tasks. Key enhancements included refined clock gating, improved branch handling in compute units, and support for DirectX 12, Vulkan, and asynchronous shaders, alongside FreeSync for adaptive sync displays and HDR10 for enhanced visuals. The architecture maintained a 1:16 FP64 to FP32 ratio for consumer products, with full hardware acceleration for HEVC (H.265) encode and decode up to 4K resolution via Video Core Next (VCN) 1.0.[36][37][38] Prominent discrete implementations featured the Polaris 10 GPU in the Radeon RX 480, with 36 compute units (2,304 stream processors), up to 5.8 TFLOPS of single-precision performance at a boost clock of 1,266 MHz, 8 GB GDDR5 memory on a 256-bit bus delivering 224 GB/s bandwidth, and a 150 W TDP. Higher-end variants like the RX 580 (Polaris 20 refresh) achieved 6.17 TFLOPS at 1,340 MHz boost with similar memory configurations, targeting 1080p gaming. Mid-range options used Polaris 11 in the RX 470, with 24 CUs (1,536 SPs) and around 3.1 TFLOPS, while entry-level Polaris 12 powered the RX 460 with 16 CUs and 2.2 TFLOPS, all supporting PCIe 3.0 and multi-monitor setups up to 5 displays. The RX 500 series in 2017 refreshed these designs with higher clocks for modest performance uplifts.[39][40] GCN 4.0 also integrated into APUs like the Bristol Ridge family (launched mid-2016), featuring up to 8 compute units paired with Excavator CPU cores on 28 nm for laptops and desktops, enabling 1080p gaming without discrete GPUs and HSA-compliant task sharing. These advancements positioned Polaris as a cost-effective solution for VR-ready computing and 4K video playback, bridging the gap to higher-end architectures.[41]Fifth generation (GCN 5.0)
The fifth generation of the Graphics Core Next (GCN 5.0) architecture, codenamed Vega, was introduced by AMD in 2017, debuting with the consumer-oriented Radeon RX Vega series, professional-grade Vega 20 GPUs on 7 nm, and integrated variants in Ryzen APUs. This generation focused on high-bandwidth memory integration, enhanced compute density for AI and HPC, and compatibility with Heterogeneous System Architecture (HSA), while supporting DirectX 12 and emerging machine learning workloads. Key implementations spanned 14 nm and 7 nm processes, with FP64 ratios varying: 1:16 for consumer products and up to 1:2 for professional accelerators.[42][43][44] The flagship consumer model, Radeon RX Vega 64 based on Vega 10 (14 nm FinFET), featured 64 compute units, 4,096 stream processors, peak single-precision performance of 13.7 TFLOPS at a 1,546 MHz boost clock, and a 295 W TDP for air-cooled variants. It utilized 8 GB of High Bandwidth Memory 2 (HBM2) on a 2,048-bit interface for up to 484 GB/s bandwidth, addressing data bottlenecks in 1440p and 4K gaming. Innovations like enhanced Delta Color Compression reduced render target bandwidth by exploiting pixel coherence, while Rapid Packed Math doubled FP16 and INT8 throughput to 27.5 TFLOPS, aiding half-precision tasks without dedicated tensor cores. Vega excelled in bandwidth-limited scenarios but faced thermal challenges in sustained loads.[45][46] Professional extensions included the 7 nm Vega 20 in the Radeon Instinct MI50 (November 2018), with 60 CUs (3,840 stream processors), 13.3 TFLOPS FP32 and 6.7 TFLOPS FP64 at a 1,725 MHz peak clock, 16/32 GB HBM2 on a 4,096-bit interface (1 TB/s bandwidth), and 300 W TDP. The MI60 variant binned 64 CUs for 14.7 TFLOPS FP32 and 7.4 TFLOPS FP64, optimized for datacenter simulations and ML with a 1:2 FP32:FP64 ratio. Video Core Next (VCN) 2.0 enabled full HEVC/H.265 4K@60fps encode/decode with 10-bit support, while High Bandwidth Cache Controller (HBCC) extended unified virtual memory to 48 bits, accessing up to 512 TB for large datasets.[47] Integrated graphics in Ryzen APUs, such as Raven Ridge (2018, 14 nm) with Radeon Vega 8–11 (8–11 CUs, up to 1.07 TFLOPS FP32 at 1,250 MHz sharing DDR4), and the 12 nm Picasso refresh (2019), provided discrete-level performance for mainstream tasks. These solutions highlighted GCN 5.0's versatility in heterogeneous computing, paving the way for architecture transitions while ensuring backward compatibility.[48]Performance and Implementations
Chip implementations across generations
The Graphics Core Next (GCN) architecture powered a wide array of AMD GPU implementations from 2012 to 2021, encompassing discrete graphics cards, integrated graphics processing units (iGPUs) in accelerated processing units (APUs), and professional-grade accelerators. These chips were fabricated primarily on TSMC and GlobalFoundries process nodes ranging from 28 nm to 7 nm, with memory configurations evolving from GDDR5 to high-bandwidth memory (HBM) and HBM2 for enhanced performance in compute-intensive applications. Over 50 distinct chip variants were released, reflecting AMD's strategy to scale GCN across consumer, mobile, and enterprise segments.[49]Discrete GPUs
Discrete GCN implementations targeted gaming and high-performance computing, featuring large die sizes to accommodate numerous compute units (CUs). Key examples include the first-generation Tahiti die, used in the Radeon HD 7970 series, which utilized a 28 nm process node, measured 352 mm² in die size, and contained 4.31 billion transistors while supporting GDDR5 memory.[50] In the third generation, the Fiji die, employed in the Radeon R9 Fury series, represented a significant scale-up on the same 28 nm node with a 596 mm² die size and 8.9 billion transistors, paired with 4 GB of HBM for superior bandwidth in professional workloads. Fifth-generation Vega 10, found in the Radeon RX Vega 64, shifted to a 14 nm GlobalFoundries process, achieving a 486 mm² die with 12.5 billion transistors and up to 8 GB HBM2 memory to boost compute throughput.[51] Other notable discrete dies spanned generations, such as Bonaire (GCN 2.0) and Polaris 10 (GCN 4.0, 230 mm² die on 14 nm with GDDR5).[52]| Generation | Key Die | Process Node | Die Size (mm²) | Transistors (Billions) | Memory Type |
|---|---|---|---|---|---|
| GCN 1.0 | Tahiti | 28 nm | 352 | 4.31 | GDDR5 |
| GCN 3.0 | Fiji | 28 nm | 596 | 8.9 | HBM |
| GCN 5.0 | Vega 10 | 14 nm | 486 | 12.5 | HBM2 |
Integrated APUs
GCN iGPUs were embedded in AMD's A-Series, Ryzen, and other APUs to enable heterogeneous computing on mainstream platforms, typically with fewer CUs than discrete counterparts for power efficiency. Early low-power examples include the Kabini APU (e.g., A4-5000 series, 2013), integrating up to 6 CUs of GCN 3.0 on a 28 nm process with shared DDR3 memory.[53] For desktop, the Kaveri APUs, such as the A10-7850K (2014), featured an 8-CU Radeon R7 iGPU on a 28 nm GPU process, supporting up to 2133 MHz DDR3 for improved graphics performance in compact systems. By the fifth generation, Raven Ridge APUs like the Ryzen 5 2400G (2018) incorporated up to 11 CUs in a Vega-based iGPU on a 14 nm process, utilizing dual-channel DDR4 memory to deliver discrete-level graphics for gaming and content creation. These integrated solutions prioritized shared memory access over dedicated VRAM, enabling seamless CPU-GPU collaboration.[54]Professional GPUs
AMD extended GCN to workstation and data center markets through the FirePro and Instinct lines, optimizing for stability and parallel processing. The FirePro W9000, based on the GCN 1.0 Tahiti die, offered 6 GB GDDR5 on a 28 nm process for CAD and visualization tasks, delivering up to 3.9 TFLOPS of single-precision compute.[55] Later, the Instinct MI series leveraged GCN 5.0, with the MI25 using a Vega 10 die (16 GB HBM2, 14 nm) for deep learning acceleration, and the MI50 employing Vega 20 (32 GB HBM2, 7 nm) to support high-performance computing clusters.[47] These professional variants emphasized ECC memory support and multi-GPU scaling, distinct from consumer-focused discrete cards.[56]Comparison of key specifications
The key specifications of Graphics Core Next (GCN) architectures evolved across generations, with progressive advancements in compute density, memory subsystems, and power efficiency driven by process node shrinks and architectural refinements. Flagship implementations, selected for their representative high-end performance in consumer or compute roles, demonstrate these trends through increased compute units (CUs), higher floating-point throughput, and enhanced memory bandwidth, while maintaining compatibility with the unified GCN instruction set.[27][57][34][58][59]| Generation | Flagship Chip | CUs | FP32 TFLOPS | FP64 TFLOPS | Memory Bandwidth (GB/s) | Process Node | TDP (W) |
|---|---|---|---|---|---|---|---|
| GCN 1.0 | Radeon HD 7970 | 32 | 3.79 | 0.95 (1:4 ratio) | 264 | 28 nm | 250 |
| GCN 2.0 | Radeon R9 290X | 44 | 5.63 | 1.41 (1:4 ratio) | 320 | 28 nm | 290 |
| GCN 3.0 | Radeon R9 Fury X | 64 | 8.60 | 0.54 (1:16 ratio) | 512 | 28 nm | 275 |
| GCN 4.0 | Radeon RX 480 | 36 | 5.83 | 0.36 (1:16 ratio) | 256 | 14 nm | 150 |
| GCN 5.0 | Radeon Instinct MI25 | 64 | 24.6 | 12.3 (1:2 ratio) | 484 | 14 nm | 300 |
