Hubbry Logo
Graphics Core NextGraphics Core NextMain
Open search
Graphics Core Next
Community hub
Graphics Core Next
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Graphics Core Next
Graphics Core Next
from Wikipedia

Graphics Core Next (GCN)[1] is the codename for a series of microarchitectures and an instruction set architecture that were developed by AMD for its GPUs as the successor to its TeraScale microarchitecture. The first product featuring GCN was launched on January 9, 2012.[2]

GCN is a reduced instruction set SIMD microarchitecture contrasting the very long instruction word SIMD architecture of TeraScale.[3] GCN requires considerably more transistors than TeraScale, but offers advantages for general-purpose GPU (GPGPU) computation due to a simpler compiler.

GCN graphics chips were fabricated with CMOS at 28 nm, and with FinFET at 14 nm (by Samsung Electronics and GlobalFoundries) and 7 nm (by TSMC), available on selected models in AMD's Radeon HD 7000, HD 8000, 200, 300, 400, 500 and Vega series of graphics cards, including the separately released Radeon VII. GCN was also used in the graphics portion of Accelerated Processing Units (APUs), including those in the PlayStation 4 and Xbox One.

GCN was succeeded by the RDNA microarchitecture and instruction set architecture in 2019.

Instruction set

[edit]

The GCN instruction set is owned by AMD and was developed specifically for GPUs. It has no micro-operation for division.

Documentation is available for:

An LLVM compiler back end is available for the GCN instruction set.[5] It is used by Mesa 3D.

GNU Compiler Collection 9 supports GCN 3 and GCN 5 since 2019[6] for single-threaded, stand-alone programs, with GCC 10 also offloading via OpenMP and OpenACC.[7]

MIAOW is an open-source RTL implementation of the AMD Southern Islands GPGPU microarchitecture.

In November 2015, AMD announced its Boltzmann Initiative, which aims to enable the porting of CUDA-based applications to a common C++ programming model.[8]

At the Super Computing 15 event, AMD displayed a Heterogeneous Compute Compiler (HCC), a headless Linux driver and HSA runtime infrastructure for cluster-class high-performance computing, and a Heterogeneous-compute Interface for Portability (HIP) tool for porting CUDA applications to the aforementioned common C++ model.

Microarchitectures

[edit]

As of July 2017, the Graphics Core Next instruction set has seen five iterations. The differences between the first four generations are rather minimal, but the fifth-generation GCN architecture features heavily modified stream processors to improve performance and support the simultaneous processing of two lower-precision numbers in place of a single higher-precision number.[9]

Command processing

[edit]
GCN command processing: Each Asynchronous Compute Engines (ACE) can parse incoming commands and dispatch work to the Compute Units (CUs). Each ACE can manage up to 8 independent queues. The ACEs can operate in parallel with the graphics command processor and two DMA engines. The graphics command processor handles graphics queues, the ACEs handle compute queues, and the DMA engines handle copy queues. Each queue can dispatch work items without waiting for other tasks to complete, allowing independent command streams to be interleaved on the GPU's Shader.

Graphics Command Processor

[edit]

The Graphics Command Processor (GCP) is a functional unit of the GCN microarchitecture. Among other tasks, it is responsible for the handling of asynchronous shaders.[10]

Asynchronous Compute Engine

[edit]

The Asynchronous Compute Engine (ACE) is a distinct functional block serving computing purposes, whose purpose is similar to that of the Graphics Command Processor.[ambiguous]

Schedulers

[edit]

Since the third iteration of GCN, the hardware contains two schedulers: one to schedule "wavefronts" during shader execution (the CU Scheduler, or Compute Unit Scheduler) and the other to schedule execution of draw and compute queues. The latter helps performance by executing compute operations when the compute units (CUs) are underutilized due to graphics commands limited by fixed function pipeline speed or bandwidth. This functionality is known as Async Compute.

For a given shader, the GPU drivers may also schedule instructions on the CPU to minimize latency.

Geometric processor

[edit]
Geometry processor

The geometry processor contains a Geometry Assembler, a Tesselator, and a Vertex Assembler.

The Tesselator is capable of doing tessellation in hardware as defined by Direct3D 11 and OpenGL 4.5 (see AMD January 21, 2017),[11] and succeeded ATI TruForm and hardware tessellation in TeraScale as AMD's then-latest semiconductor intellectual property core.

Compute units

[edit]

One compute unit (CU) combines 64 shader processors with 4 texture mapping units (TMUs).[12][13] The compute units are separate from, but feed into, the render output units (ROPs).[13] Each compute unit consists of the following:

  • a CU scheduler
  • a Branch & Message Unit
  • 4 16-lane-wide SIMD Vector Units (SIMD-VUs)
  • 4 64 KiB vector general-purpose register (VGPR) files
  • 1 scalar unit (SU)
  • a 8 KiB scalar GPR file[14]
  • a local data share of 64 KiB
  • 4 Texture Filter Units
  • 16 Texture Fetch Load/Store Units
  • a 16 KiB level 1 (L1) cache

Four Compute units are wired to share a 16KiB L1 instruction cache and a 32KiB L1 data cache, both of which are read-only. A SIMD-VU operates on 16 elements at a time (per cycle), while a SU can operate on one a time (one/cycle). In addition, the SU handles some other operations, such as branching.[15]

Every SIMD-VU has some private memory where it stores its registers. There are two types of registers: scalar registers (S0, S1, etc.), which hold 4 bytes number each, and vector registers (V0, V1, etc.), which each represent a set of 64 4-byte numbers. On the vector registers, every operation is done in parallel on the 64 numbers. which correspond to 64 inputs. For example, it may work on 64 different pixels at a time (for each of them the inputs are slightly different, and thus you get slightly different color at the end).

Every SIMD-VU has room for 512 scalar registers and 256 vector registers.

AMD has claimed that each GCN compute unit (CU) has 64 KiB Local Data Share (LDS).[16]

CU scheduler

[edit]

The CU scheduler is the hardware functional block, choosing which wavefronts the SIMD-VU executes. It picks one SIMD-VU per cycle for scheduling. This is not to be confused with other hardware or software schedulers.

Wavefront

[edit]

A shader is a small program written in GLSL that performs graphics processing, and a kernel is a small program written in OpenCL that performs GPGPU processing. These processes don't need that many registers, but they do need to load data from system or graphics memory. This operation comes with significant latency. AMD and Nvidia chose similar approaches to hide this unavoidable latency: the grouping of multiple threads. AMD calls such a group a "wavefront", whereas Nvidia calls it a "warp". A group of threads is the most basic unit of scheduling of GPUs that implement this approach to hide latency. It is the minimum size of the data processed in SIMD fashion, the smallest executable unit of code, and the way to processes a single instruction over all of the threads in it at the same time.

In all GCN GPUs, a "wavefront" consists of 64 threads, and in all Nvidia GPUs, a "warp" consists of 32 threads.

AMD's solution is to attribute multiple wavefronts to each SIMD-VU. The hardware distributes the registers to the different wavefronts, and when one wavefront is waiting on some result, which lies in memory, the CU Scheduler assigns the SIMD-VU another wavefront. Wavefronts are attributed per SIMD-VU. SIMD-VUs do not exchange wavefronts. A maximum of 10 wavefronts can be attributed per SIMD-VU (thus 40 per CU).

AMD CodeXL shows tables with the relationship between number of SGPRs and VGPRs to the number of wavefronts, but essentially, for SGPRS it is between 104 and 512 per number of wavefronts, and for VGPRS it is 256 per number of wavefronts.

Note that in conjunction with the SSE instructions, this concept of the most basic level of parallelism is often called a "vector width". The vector width is characterized by the total number of bits in it.

SIMD Vector Unit

[edit]

Each SIMD Vector Unit has:

  • a 16-lane integer and floating point vector Arithmetic Logic Unit (ALU)
  • 64 KiB Vector General Purpose Register (VGPR) file
  • 10× 48-bit Program Counters
  • Instruction buffer for 10 wavefronts (each wavefront is a group of 64 threads, or the size of one logical VGPR)
  • A 64-thread wavefront issues to a 16-lane SIMD Unit over four cycles

Each SIMD-VU has 10 wavefront instruction buffers, and it takes 4 cycles to execute one wavefront.

Audio and video acceleration blocks

[edit]

Many implementations of GCN are typically accompanied by several of AMD's other ASIC blocks. Including but not limited to the Unified Video Decoder, Video Coding Engine, and AMD TrueAudio.

Video Coding Engine

[edit]

The Video Coding Engine is a video encoding ASIC, first introduced with the Radeon HD 7000 series.[17]

The initial version of the VCE added support for encoding I and P frames H.264 in the YUV420 pixel format, along with SVE temporal encode and Display Encode Mode, while the second version added B-frame support for YUV420 and YUV444 I-frames.

VCE 3.0 formed a part of the third generation of GCN, adding high-quality video scaling and the HEVC (H.265) codec.

VCE 4.0 was part of the Vega architecture, and was subsequently succeeded by Video Core Next.

TrueAudio

[edit]

Unified virtual memory

[edit]

In a preview in 2011, AnandTech wrote about the unified virtual memory, supported by Graphics Core Next.[18]

Heterogeneous System Architecture (HSA)

[edit]
GCN includes special purpose function blocks to be used by HSA. Support for these function blocks is available through amdkfd since Linux kernel 3.19.[20]

Some of the specific HSA features implemented in the hardware need support from the operating system's kernel (its subsystems) and/or from specific device drivers. For example, in July 2014, AMD published a set of 83 patches to be merged into Linux kernel mainline 3.17 for supporting their Graphics Core Next-based Radeon graphics cards. The so-called HSA kernel driver resides in the directory /drivers/gpu/hsa, while the DRM graphics device drivers reside in /drivers/gpu/drm[21] and augment the already existing DRM drivers for Radeon cards.[22] This very first implementation focuses on a single "Kaveri" APU and works alongside the existing Radeon kernel graphics driver (kgd).

Lossless Delta Color Compression

[edit]

Hardware schedulers

[edit]

Hardware schedulers are used to perform scheduling[23] and offload the assignment of compute queues to the ACEs from the driver to hardware, by buffering these queues until there is at least one empty queue in at least one ACE. This causes the HWS to immediately assign buffered queues to the ACEs until all queues are full or there are no more queues to safely assign.[24]

Part of the scheduling work performed includes prioritized queues which allow critical tasks to run at a higher priority than other tasks without requiring the lower priority tasks to be preempted to run the high priority task, therefore allowing the tasks to run concurrently with the high priority tasks scheduled to hog the GPU as much as possible while letting other tasks use the resources that the high priority tasks are not using.[23] These are essentially Asynchronous Compute Engines that lack dispatch controllers.[23] They were first introduced in the fourth generation GCN microarchitecture,[23] but were present in the third generation GCN microarchitecture for internal testing purposes.[25] A driver update has enabled the hardware schedulers in third generation GCN parts for production use.[23]

Primitive Discard Accelerator

[edit]

This unit discards degenerate triangles before they enter the vertex shader and triangles that do not cover any fragments before they enter the fragment shader.[26] This unit was introduced with the fourth generation GCN microarchitecture.[26]

Generations

[edit]

Graphics Core Next 1

[edit]
AMD Graphics Core Next 1
Release dateJanuary 2012; 13 years ago (January 2012)[citation needed]
History
PredecessorTeraScale 3
SuccessorGraphics Core Next 2
Support status
Unsupported since mid-2022 (final Windows driver version 22.6.1 for Windows 7 and 10)

The GCN 1 microarchitecture was used in several Radeon HD 7000 series graphics cards.

Die shot of the Tahiti GPU used in Radeon HD 7950 GHz Edition graphics cards

There are Asynchronous Compute Engines controlling computation and dispatching.[15][30]

ZeroCore Power

[edit]

ZeroCore Power is a long idle power saving technology, shutting off functional units of the GPU when not in use.[31] AMD ZeroCore Power technology supplements AMD PowerTune.

Chips

[edit]

Discrete GPUs (Southern Islands family):

  • Hainan
  • Oland
  • Cape Verde
  • Pitcairn
  • Tahiti

Graphics Core Next 2

[edit]
AMD Graphics Core Next 2
Release dateSeptember 2013; 12 years ago (September 2013)[citation needed]
History
PredecessorGraphics Core Next 1
SuccessorGraphics Core Next 3
Support status
Unsupported since mid-2022 (final Windows driver version 22.6.1 for Windows 7 and 10)
AMD PowerTune "Bonaire"
Die shot of the Hawaii GPU used in Radeon R9 290 graphics cards

The 2nd generation of GCN was introduced with the Radeon HD 7790 and is also found in the Radeon HD 8770, R7 260/260X, R9 290/290X, R9 295X2, R7 360, and R9 390/390X, as well as Steamroller-based desktop "Kaveri" APUs and mobile "Kaveri" APUs and in the Puma-based "Beema" and "Mullins" APUs. It has multiple advantages over the original GCN, including FreeSync support, AMD TrueAudio and a revised version of AMD PowerTune technology.

GCN 2nd generation introduced an entity called "Shader Engine" (SE). A Shader Engine comprises one geometry processor, up to 44 CUs (Hawaii chip), rasterizers, ROPs, and L1 cache. Not part of a Shader Engine is the Graphics Command Processor, the 8 ACEs, the L2 cache and memory controllers as well as the audio and video accelerators, the display controllers, the 2 DMA controllers and the PCIe interface.

The A10-7850K "Kaveri" contains 8 CUs (compute units) and 8 Asynchronous Compute Engines for independent scheduling and work item dispatching.[32]

At AMD Developer Summit (APU) in November 2013 Michael Mantor presented the Radeon R9 290X.[33]

Chips

[edit]

Discrete GPUs (Sea Islands family):

  • Bonaire
  • Hawaii

integrated into APUs:

  • Temash
  • Kabini
  • Liverpool (i.e. the APU found in the PlayStation 4)
  • Durango (i.e. the APU found in the Xbox One and Xbox One S)
  • Kaveri
  • Godavari
  • Mullins
  • Beema
  • Carrizo-L

Graphics Core Next 3

[edit]
AMD Graphics Core Next 3
Release dateJune 2015; 10 years ago (June 2015)[citation needed]
History
PredecessorGraphics Core Next 2
SuccessorGraphics Core Next 4
Support status
Supported, with less regular Windows driver update schedule
Die shot of the Fiji GPU used in Radeon R9 Nano graphics cards

GCN 3rd generation[34] was introduced in 2014 with the Radeon R9 285 and R9 M295X, which have the "Tonga" GPU. It features improved tessellation performance, lossless delta color compression to reduce memory bandwidth usage, an updated and more efficient instruction set, a new high quality scaler for video, HEVC encoding (VCE 3.0) and HEVC decoding (UVD 6.0), and a new multimedia engine (video encoder/decoder). Delta color compression is supported in Mesa.[35] However, its double precision performance is worse compared to previous generation.[36]

Chips

[edit]

discrete GPUs:

  • Tonga (Volcanic Islands family), comes with UVD 5.0 (Unified Video Decoder)
  • Fiji (Pirate Islands family), comes with UVD 6.0 and High Bandwidth Memory (HBM 1)

integrated into APUs:

  • Carrizo, comes with UVD 6.0
  • Bristol Ridge[37]
  • Stoney Ridge[37]

Graphics Core Next 4

[edit]
AMD Graphics Core Next 4
Release dateJune 2016; 9 years ago (June 2016)[citation needed]
History
PredecessorGraphics Core Next 3
SuccessorGraphics Core Next 5
Support status
Supported, with less regular Windows driver update schedule
Die shot of the Polaris 11 GPU used in Radeon RX 460 graphics cards
Die shot of the Polaris 10 GPU used in Radeon RX 470 graphics cards

GPUs of the Arctic Islands-family were introduced in Q2 of 2016 with the AMD Radeon 400 series. The 3D-engine (i.e. GCA (Graphics and Compute array) or GFX) is identical to that found in the Tonga-chips.[38] But Polaris feature a newer Display Controller engine, UVD version 6.3, etc.

All Polaris-based chips other than the Polaris 30 are produced on the 14 nm FinFET process, developed by Samsung Electronics and licensed to GlobalFoundries.[39] The slightly newer refreshed Polaris 30 is built on the 12 nm LP FinFET process node, developed by Samsung and GlobalFoundries. The fourth generation GCN instruction set architecture is compatible with the third generation. It is an optimization for 14 nm FinFET process enabling higher GPU clock speeds than with the 3rd GCN generation.[40] Architectural improvements include new hardware schedulers, a new primitive discard accelerator, a new display controller, and an updated UVD that can decode HEVC at 4K resolutions at 60 frames per second with 10 bits per color channel.

Chips

[edit]

discrete GPUs:[41]

  • Polaris 10 (also codenamed Ellesmere) found on "Radeon RX 470" and "Radeon RX 480"-branded graphics cards
  • Polaris 11 (also codenamed Baffin) found on "Radeon RX 460"-branded graphics cards (also Radeon RX 560D)
  • Polaris 12 (also codenamed Lexa) found on "Radeon RX 550" and "Radeon RX 540"-branded graphics cards
  • Polaris 20, which is a refreshed (14 nm LPP Samsung/GloFo FinFET process) Polaris 10 with higher clocks, used for "Radeon RX 570" and "Radeon RX 580"-branded graphics cards[42]
  • Polaris 21, which is a refreshed (14 nm LPP Samsung/GloFo FinFET process) Polaris 11, used for "Radeon RX 560"-branded graphics cards
  • Polaris 22, found on "Radeon RX Vega M GH" and "Radeon RX Vega M GL"-branded graphics cards (as part of Kaby Lake-G)
  • Polaris 23, which is a refreshed (14 nm LPP Samsung/GloFo FinFET process) Polaris 12, used for "Radeon Pro WX 3200" and "Radeon RX 540X"-branded graphics cards (also Radeon RX 640)[43]
  • Polaris 30, which is a refreshed (12 nm LP GloFo FinFET process) Polaris 20 with higher clocks, used for "Radeon RX 590"-branded graphics cards[44]

In addition to dedicated GPUs, Polaris is utilized in the APUs of the PlayStation 4 Pro and Xbox One X, titled "Neo" and "Scorpio", respectively.

Precision Performance

[edit]

FP64 performance of all GCN 4th generation GPUs is 1/16 of FP32 performance.

Graphics Core Next 5

[edit]
AMD Graphics Core Next 5
Release dateJune 2017; 8 years ago (June 2017)[citation needed]
History
PredecessorGraphics Core Next 4
SuccessorCDNA 1, RDNA 1
Support status
Supported, with less regular Windows driver update schedule
Die shot of the Vega 10 GPU used in Radeon RX Vega 64 graphics cards

AMD began releasing details of their next generation of GCN Architecture, termed the 'Next-Generation Compute Unit', in January 2017.[40][45][46] The new design was expected to increase instructions per clock, higher clock speeds, support for HBM2, a larger memory address space. The discrete graphics chipsets also include "HBCC (High Bandwidth Cache Controller)", but not when integrated into APUs.[47] Additionally, the new chips were expected to include improvements in the Rasterisation and Render output units. The stream processors are heavily modified from the previous generations to support packed math Rapid Pack Math technology for 8-bit, 16-bit, and 32-bit numbers. With this there is a significant performance advantage when lower precision is acceptable (for example: processing two half-precision numbers at the same rate as a single single precision number).

Nvidia introduced tile-based rasterization and binning with Maxwell,[48] and this was a big reason for Maxwell's efficiency increase. In January, AnandTech assumed that Vega would finally catch up with Nvidia regarding energy efficiency optimizations due to the new "DSBR (Draw Stream Binning Rasterizer)" to be introduced with Vega.[49]

It also added support for a new shader stage – Primitive Shaders.[50][51] Primitive shaders provide more flexible geometry processing and replace the vertex and geometry shaders in a rendering pipeline. As of December 2018, the Primitive shaders can't be used because required API changes are yet to be done.[52]

Vega 10 and Vega 12 use the 14 nm FinFET process, developed by Samsung Electronics and licensed to GlobalFoundries. Vega 20 uses the 7 nm FinFET process developed by TSMC.

Chips

[edit]

discrete GPUs:

  • Vega 10 (14 nm Samsung/GloFo FinFET process) (also codenamed Greenland[53]) found on "Radeon RX Vega 64", "Radeon RX Vega 56", "Radeon Vega Frontier Edition", "Radeon Pro V340", Radeon Pro WX 9100, and Radeon Pro WX 8200 graphics cards[54]
  • Vega 12 (14 nm Samsung/GloFo FinFET process) found on "Radeon Pro Vega 20" and "Radeon Pro Vega 16"-branded mobile graphics cards[55]
  • Vega 20 (7 nm TSMC FinFET process) found on "Radeon Instinct MI50" and "Radeon Instinct MI60"-branded accelerator cards,[56] "Radeon Pro Vega II", and "Radeon VII"-branded graphics cards.[57]

integrated into APUs:

  • Raven Ridge[58] came with VCN 1 which supersedes VCE and UVD and allows full fixed-function VP9 decode.
  • Picasso
  • Renoir
  • Cezanne

Precision performance

[edit]

Double-precision floating-point (FP64) performance of all GCN 5th generation GPUs, except for Vega 20, is one-sixteenth of FP32 performance. For Vega 20 with Radeon Instinct this is half of FP32 performance. For Vega 20 with Radeon VII this is a quarter of FP32 performance.[59] All GCN 5th generation GPUs support half-precision floating-point (FP16) calculations which is twice of FP32 performance.

Comparison of GCN GPUs

[edit]
  • Table contains only discrete GPUs (including mobile). APU(IGP) and console SoCs are not listed.
Microarchitecture[60] GCN 1 GCN 2 GCN 3 GCN 4 GCN 5
Die Tahiti[61] Pitcairn[62] Cape Verde[63] Oland[64] Hainan[65] Bonaire[66] Hawaii[67] Topaz[68] Tonga[69] Fiji[70] Ellesmere[71] Baffin[72] Lexa[73] Vega 10[74] Vega 12[75] Vega 20[76]
Code name1 ? ? ? Tiran ? ? Ibiza Iceland ? ? Polaris 10 Polaris 11 Polaris 12 Greenland Treasure Refresh Moonshot
Variant(s) New Zealand
Malta
Wimbledon
Curaçao
Neptune
Trinidad
Chelsea
Heathrow
Venus
Tropo
Mars
Opal
Litho
Sun
Jet
Exo
Banks
Saturn
Tobago
Strato
Emerald
Vesuvius
Grenada
Meso
Weston
Polaris 24
Amethyst
Antigua
Capsaicin Polaris 20
Polaris 30
Polaris 21 Polaris 23
Fab TSMC 28 nm GlobalFoundries 14 nm / 12 nm (Polaris 30) TSMC 7 nm
Die size (mm2) 352 / 365 (Malta) 212 123 77 56 160 438 125 366 596 232 123 103 495 Unknown 331
Transistors (million) 4,313 2,800 1,500 950 690 2,080 6,200 1,550 5,000 8,900 5,700 3,000 2,200 12,500 Unknown 13,230
Transistor density (MTr/mm2) 12.3 / 12.8 (Malta) 13.2 12.2 12.3 13.0 14.2 12.4 13.7 14.9 24.6 24.4 21.4 25.3 Unknown 40.0
Asynchronous compute engines 2 8 ? 8 4 ? 4
Geometry engines 2 1 2 ? 4 ? 4
Shader engines 4 ? 4 2
Hardware schedulers 2 ? 2
Compute units 32 20 10 / 8 (Chelsea) 6 5 / 6 (Jet) 14 44 6 32 64 36 16 10 64 20 64
Stream processors 2048 1280 640 / 512 (Chelsea) 384 320 / 384 (Jet) 896 2816 384 2048 4096 2304 1024 640 4096 1280 4096
Texture mapping units 128 80 40 / 32 (Chelsea) 24 20 / 24 (Jet) 56 176 24 128 256 144 64 40 256 80 256
Render output units 32 16 8 16 64 8 32 64 32 16 64 32 64
Z/Stencil OPS 128 64 16 64 256 16 128 256
L1 cache (KB) 16 per Compute unit (CU)
L2 cache (KB) 768 512 256 128 / 256 (Jet) 256 1024 256 768 2048 1024 512 4096 1024 4096
Display Core Engine 6.0 6.4 8.2 8.5 10.0 11.2 12.0 12.1
Unified Video Decoder 3.2 4.0 4.2 5.0 6.0 6.3 7.0 7.2
Video Coding Engine 1.0 2.0 3.0 3.4 4.0 4.1
Launch2 Dec 2011 Mar 2012 Feb 2012 Jan 2013 May 2015 Mar 2013 Oct 2013 2014 Aug 2014 Jun 2015 Jun 2016 Aug 2016 Apr 2017 Jun 2017 Nov 2018 Nov 2018
Series (Family) Southern Islands Sea Islands Volcanic Islands Pirate Islands Arctic Islands Vega Vega II
Notes mobile/OEM mobile/OEM mobile

1 Old code names such as Treasure (Lexa) or Hawaii Refresh (Ellesmere) are not listed.
2 Initial launch date. Launch dates of variant chips such as Polaris 20 (April 2017) are not listed.

See also

[edit]
[edit]

References

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Graphics Core Next (GCN) is a family of graphics processing unit (GPU) microarchitectures developed by Advanced Micro Devices (AMD), introduced in 2011 with the Radeon HD 7000 series (Southern Islands) graphics cards. It marked a significant redesign from AMD's prior TeraScale architectures, shifting from vector-oriented very long instruction word (VLIW) processing to a scalar, CPU-like instruction set to improve programmability and performance predictability for both graphics and general-purpose computing tasks. Key innovations include the introduction of dedicated Compute Units (CUs) as the core building blocks, support for unified virtual addressing, and coherent L1 and L2 caching to enable seamless data sharing between CPU and GPU in heterogeneous systems. GCN's Compute Unit design features four 16-wide (SIMD) engines, delivering 64 stream processors per CU, along with a 64 KB scalar local data share (LDS) for fast thread-local memory access and dedicated hardware for branch execution and scalar operations. Each CU includes 16 KB of read/write vector caches and supports up to 40 concurrent wavefronts (groups of 64 threads), optimizing for high-throughput parallel workloads while maintaining compliance for . The architecture also incorporates Asynchronous Compute Engines (ACEs), allowing independent execution of graphics and compute pipelines to boost overall system efficiency. Over its lifespan, GCN evolved across five generations, starting with the first-generation implementation in 28 nm process technology for products like the Radeon HD 7970 (Tahiti GPU), and progressing to more efficient variants in later nodes, including integrated graphics in Ryzen APUs up to the Cezanne APU in 2021. Subsequent generations, such as GCN 2.0 (Sea Islands, 2013), GCN 3.0 (Volcanic Islands, 2015), GCN 4.0 (2016), and GCN 5.0 (2017), introduced refinements like improved power efficiency, higher clock speeds, and enhanced support for APIs including OpenCL 1.2, DirectCompute 11, and C++ AMP. This progression powered AMD's discrete GPUs through the Vega series (2017) and Radeon VII (2019) and served as the foundation for compute-focused derivatives like the CDNA architecture in Instinct accelerators. Driver support for GCN-based products ended in mid-2022, though its legacy persists in AMD's ecosystem for backward compatibility and specialized applications. GCN's emphasis on compute density and bandwidth—exemplified by high L2 cache throughput in early implementations like Tahiti's 710 GB/s—excelled in benchmarks, such as FFT operations, though it faced competition in pure graphics rasterization from rivals like NVIDIA's Kepler . The 's memory subsystem features a unified with 64-128 KB L2 cache slices and 40-bit virtual addressing using 64 KB pages, facilitating integration with x86 CPUs for advanced features like GPU-accelerated and scientific simulations. GCN was eventually succeeded by the RDNA in , but its legacy persists in AMD's ecosystem for and specialized applications.

Overview

Development and history

AMD's development of the Graphics Core Next (GCN) architecture was rooted in its 2006 acquisition of , which expanded its graphics expertise and spurred internal research and development focused on for GPUs. The acquisition, completed on October 25, 2006, for approximately $5.4 billion, integrated ATI's graphics IP with AMD's CPU technology, enabling advancements in and setting the stage for a unified approach to CPU-GPU integration. Following this, AMD shifted its GPU design from the VLIW-based TeraScale architecture to a SIMD-based model with GCN, aiming to improve programmability, power efficiency, and performance consistency for both graphics and general-purpose compute workloads. In 2011, demonstrated its next-generation 28 nm graphics processor, previewing the GCN architecture as a successor to TeraScale to deliver enhanced compute performance and full support. The architecture was formally detailed in December 2011, emphasizing its design for scalable compute capabilities in discrete GPUs and integrated solutions. Initial silicon for the (Southern Islands family) was taped out in 2011, with the first product, the , launching on December 22, 2011, as 's flagship single-GPU card built on GCN 1.0. GCN evolved through several iterations, starting with the Southern Islands (GCN 1.0) in 2011-2012, followed by (GCN 2.0) in 2013 with products like the Radeon R9 290 series, and Volcanic Islands (GCN 3.0) in 2015 via the series. Later generations included (GCN 5.0) launched in August 2017 with the , and Vega 20 (GCN 5.1) in November 2018, marking the final major update before the transition to the RDNA architecture in 2019. These milestones reflected incremental improvements in efficiency, feature support, and process nodes while maintaining the core SIMD design. GCN played a pivotal role in AMD's strategy for integrated accelerated processing units () and GPUs, enabling seamless CPU-GPU collaboration through features like unified . First integrated into with the Kabini/Temash series in 2013, GCN powered subsequent designs like (2014) and later , enhancing everyday computing and thin-client applications. In the , GCN underpinned professional GPUs such as FirePro and the series, with the MI25 (Vega-based) launching in June 2017 to target and workloads. This versatility solidified GCN's importance in AMD's push toward heterogeneous systems and expanded market presence beyond consumer graphics.

Key innovations and design goals

Graphics Core Next (GCN) represented a fundamental shift in AMD's GPU design philosophy, moving away from the (VLIW) architecture of the preceding TeraScale generation to a single-instruction multiple-data (SIMD) model. This transition aimed to enhance efficiency across diverse workloads by enabling better utilization of hardware resources through wavefront-based execution, where groups of 64 threads (a ) are processed in a more predictable manner. The SIMD approach allowed for issuing up to five instructions per clock cycle across vector and scalar pipelines, improving instruction throughput and reducing the associated with VLIW's multi-issue dependencies. A core design goal of GCN was to elevate general-purpose GPU (GPGPU) computing, with full support for OpenCL 1.2 and later standards, alongside DirectCompute 11.1 and C++ AMP, to facilitate heterogeneous computing applications. This emphasis targeted at least 2x the shader performance of TeraScale architectures, achieved through optimized compute units that balanced graphics and parallel processing demands. The architecture integrated graphics and compute pipelines into a unified framework, supporting DirectX 11 and preparing for DirectX 12 feature levels, while enabling compatibility with AMD's Heterogeneous System Architecture (HSA) for seamless CPU-GPU collaboration via shared virtual memory. Power efficiency was another paramount objective, addressed through innovations like ZeroCore Power, which powers down idle GPU components to under 3W during long idle periods, a feature first implemented in GCN 1.0. Complementary technologies such as fine-grained and PowerTune for dynamic voltage and further optimized energy use, enabling configurations from low-power consuming 2-3W to high-end discrete GPUs delivering over 3 TFLOPS at 250W. This scalability was inherent in GCN's modular compute unit design, allowing flexible integration across market segments while maintaining consistent architectural principles.

Core Microarchitecture

Instruction set

The Graphics Core Next (GCN) instruction set architecture (ISA) is a 32-bit RISC-like design optimized for both graphics and general-purpose computing workloads, featuring distinct scalar (S) and vector (V) instruction types that enable efficient ALU operations across wavefronts. Scalar instructions operate on a single value per wavefront for control flow and address calculations, while vector instructions process one value per thread, supporting up to three operands in formats like VOP2 (two inputs) and VOP3 (up to three inputs, including 64-bit operations). This separation allows scalar units to handle program control independently from vector units focused on data-parallel computation. Key instruction categories encompass arithmetic operations such as S_ADD_I32 for scalar addition and V_ADD_F32 or V_ADD_F64 for vector floating-point addition; bitwise operations including S_AND_B32 and V_AND_B32; and transcendental functions like V_SIN_F32 or V_LOG_F32 for approximations of sine, cosine, and logarithms in the vector ALU (VALU). is managed primarily through scalar instructions such as S_BRANCH for unconditional jumps and S_CBRANCH for conditional branches based on execution masks, alongside barriers and primitives to coordinate thread groups. These categories support a -based execution model where each comprises 64 threads (organized as 16 work-items across 4 components for vector4 operations), enabling SIMD processing of instructions across the group. From GCN 1.0 onward, the ISA includes native support for 64-bit integers (e.g., via V_ADD_U64) and double-precision floating-point operations (e.g., V_FMA_F64 for fused multiply-add), ensuring IEEE-754 compliance for compute-intensive tasks. Starting with GCN 3.0, the ISA includes half-precision floating-point (FP16) instructions like V_ADD_F16 and V_FMA_F16 for improved efficiency in workloads, alongside packed math features in GCN 4.0 such as V_CVT_PK_U8_F32 for converting multiple low-precision values in a single operation. The ISA maintains broad compatibility across GCN generations (1.0 through 5.0), with new capabilities added via minor extensions rather than breaking changes, facilitating binary portability for shaders and kernels.

Command processing and schedulers

The Graphics Command Processor (GCP) serves as the front-end unit in the Graphics Core Next (GCN) architecture responsible for parsing high-level commands from , such as draw calls and state changes, and mapping them to the appropriate processing elements in the . It coordinates the traditional rendering by distributing workloads across stages and fixed-function hardware units, enabling efficient handling of graphics-specific tasks like vertex processing and rasterization setup. The GCP processes separate command streams for different types, which facilitates multitasking and improves overall pipeline utilization by allowing concurrent execution of graphics operations. Complementing the GCP, Asynchronous Compute Engines (ACEs) manage independent compute queues, allowing compute workloads to execute in parallel with graphics tasks for better resource overlap. Each ACE fetches commands from dedicated queues, forms prioritized task lists ranging from background to real-time levels, and dispatches workgroups to compute units (CUs) while checking for resource availability. GCN supports up to eight ACEs in later generations, enabling multiple independent queues that share hardware with the but operate asynchronously, with graphics typically holding priority during contention. This design reduces idle time on CUs by interleaving compute shaders with graphics rendering, though it incurs a small overhead known as the "async tax" due to and context switching. The scheduler hierarchy in GCN begins with a global command processor that dispatches work packets from user-visible queues in DRAM to workload managers, which then distribute tasks across shader engines and CUs. These managers route commands to per-SIMD schedulers within each CU, where four SIMD units per CU each maintain a scheduler partition buffering up to 10 for round-robin execution. This tiered structure supports dispatching one per cycle per or GCP, with up to five instructions issued per CU cycle across multiple to maximize throughput. Hardware schedulers within the ACEs and per-SIMD units handle thread management by prioritizing queues and enabling preemption for efficient workload balancing. Priority queuing allows higher-priority tasks to lower ones by flushing active workgroups and switching contexts via a dedicated cache, supporting out-of-order completion while ensuring through fences or . This mechanism accommodates up to 81,920 in-flight work items across 32 CUs, promoting high occupancy and reducing latency in heterogeneous s. Introduced in the fourth generation of GCN (GCN 4.0), the Primitive Discard Accelerator (PDA) enhances command processing by early rejection of degenerate or small primitives before they reach the vertex shader or rasterizer. It filters triangles with zero area or no sample coverage during input assembly, reducing unnecessary vertex fetches and geometry workload by up to 3.5 times in high-density scenarios. The PDA integrates into the front-end to cull non-contributing primitives efficiently, improving energy efficiency and performance in graphics-heavy applications without impacting valid geometry.

Compute units and wavefront execution

The compute unit (CU) serves as the fundamental processing element in the Graphics Core Next (GCN) architecture, comprising 64 shader processors organized into four 16-wide SIMD units. Each SIMD unit handles 16 work-items simultaneously, enabling the CU to process a full of 64 threads by executing it across four clock cycles in a manner. This structure emphasizes massive parallelism while maintaining scalar control for handling. At the heart of execution is the , the basic scheduling unit consisting of 64 threads that operate in across the SIMD units. These threads execute vector instructions synchronously, with the hardware decomposing each wavefront into four groups of 16 lanes processed sequentially over four cycles to accommodate the 16-wide SIMD width. GCN supports dual-issue capability, allowing the scheduler to dispatch one scalar instruction alongside a vector instruction in the same cycle, which enhances throughput for mixed workloads involving uniform operations and per-thread computations. The CU scheduler oversees wavefront dispatch using round-robin across up to six execution pipelines, managing instruction buffers and ensuring balanced utilization while tracking outstanding operations like vector ALU counts. The SIMD vector arithmetic logic unit (VALU) within each CU performs core floating-point and integer operations, supporting full IEEE-754 compliance for FP32 and INT32 at a rate of one operation per lane per cycle, yielding 64 FP32 operations per CU clock in the base configuration. Export units integrated into the CU handle output from wavefronts, facilitating stores to global buffers via vector memory instructions and raster operations such as exporting colors or positions to render targets. These units support compression for efficiency and are shared across wavefronts to synchronize data flow with downstream or compute pipelines. Double-precision floating-point performance evolved significantly across GCN generations to better support scientific . In GCN 1.0, double-precision operations ran at 1/16 the rate of single-precision due to shared hardware resources prioritizing FP32 workloads. Subsequent iterations, starting with GCN 2.0, improved this to 1/4 the single-precision rate through dedicated ALU enhancements and optimized instructions like V_FMA_F64, enabling higher throughput for applications requiring FP64 arithmetic without compromising the core scalar-vector balance.

Graphics and Compute Pipeline

Geometric processing

In the Graphics Core Next (GCN) architecture, geometric processing encompasses the initial stages of the , handling vertex data ingestion, programmable shading for transformations, and fixed-function optimization to prepare for rasterization. This pipeline begins with vertex fetch, where vertex attributes are retrieved from vertex buffers stored in system memory using buffer load instructions such as TBUFFER_LOAD_FORMAT, which access data through a unified read/write including a 16 KB L1 cache per compute unit (CU) and a shared 768 KB L2 cache. Primitive assembly follows, where fetched vertices are grouped into (such as triangles, lines, or points) by dual geometry engines capable of processing up to two per clock cycle, enabling high throughput—for instance, 1.85 billion per second on the HD 7970 at 925 MHz. The programmable vertex shader stage transforms these vertices using shaders executed on the scalable array of CUs, where each CU contains four 16-wide SIMD units that process 64-element wavefronts in parallel via a non-VLIW instruction set architecture (ISA) with vector ALU (VALU) operations for tasks like position calculations and attribute interpolation. This design allows flexible control flow and IEEE-754 compliant floating-point arithmetic, distributing workloads across up to 32 CUs for efficient parallel execution without the rigid bundling of prior VLIW architectures. Tessellation and geometry shaders extend this programmability, with a dedicated hardware tessellator performing efficient domain subdivision—generating 2 to 64 patches per invocation, up to four times faster than previous generations through improved parameter caching and vertex reuse that spills to the coherent L2 cache when needed. Geometry shaders, also run on CUs, enable primitive amplification and manipulation using instructions like S_SENDMSG for task signaling, supporting advanced effects such as fur or grass generation. Fixed-function clipping and stages then optimize the by rejecting unnecessary , including backface to discard facing away from the viewer and view-frustum to eliminate those outside the camera's , reducing downstream computational load. The setup engine concludes pre-raster processing by converting assembled into a standardized —typically triangles, but also points or lines—for handover to the rasterizer, which generates up to 16 pixels per cycle per primitive while integrating hierarchical Z-testing for early occlusion detection. These stages collectively leverage GCN's unified virtual addressing and scalable design, supporting up to 1 terabyte of addressable memory to handle complex scenes efficiently across generations.

Rasterization and pixel processing

In the Graphics Core Next (GCN) architecture, the rasterization stage converts primitives into fragments by scanning screen space tiles, with each rasterizer unit processing one triangle per clock cycle and generating up to 16 pixels per cycle. This target-independent rasterization offloads anti-aliasing computations to fixed-function hardware, reducing overhead on programmable shaders. Hierarchical Z-testing is integrated early in the pipeline, performing coarse depth comparisons on tile-level buffers to cull occluded fragments before they reach the shading stage, thereby improving efficiency by avoiding unnecessary pixel shader invocations. Fragment shading occurs within the compute units (CUs), where pixel shaders execute as 64-wide wavefronts, leveraging the same SIMD hardware as vertex and compute shaders for unified processing. GCN supports multi-sample (MSAA) up to 8x coverage, with render back-ends (RBEs) equipped with 16 KB color caches per RBE for sample storage and compression, enabling efficient handling of anti-aliased s without excessive demands. Enhanced quality AA (EQAA) extends this to 16x in some configurations using 4 KB depth caches per quad. Texture sampling is managed by texture fetch units (TFUs) integrated into each CU, typically four per CU in first-generation implementations, which compute up to 16 sampling addresses per cycle and fetch texels from the L1 cache. These units support bilinear, trilinear, and up to 16x, with anisotropic modes incurring up to N times the cost of bilinear filtering based on the factor to enhance texture clarity at oblique angles. Following , fragments undergo depth and testing in the RBEs, which apply configurable tests to determine and resolve multi-sample coverage. Blending operations then combine fragment colors with data using coverage-weighted accumulation, supporting formats like RGBA8 and advanced blending modes for final pixel output. Pixel exports from CUs route directly to these RBEs, bypassing the L2 cache in some cases for optimized access. GCN integrates dedicated multimedia accelerators for audio and video processing. The (VCE) provides hardware-accelerated encoding and decoding, starting with H.264/AVC support at /60 fps in first-generation GCN via VCE 1.0, and evolving to include HEVC (H.265) in VCE 3.0 (third-generation) and VCE 4.0 (fifth-generation Vega). TrueAudio, introduced in second-generation GCN, is a dedicated ASIC co-processor that simulates spatial audio effects, enhancing realism by processing 3D soundscapes in real-time alongside graphics rendering.

Compute and asynchronous operations

Graphics Core Next (GCN) architectures introduced robust support for compute shaders, enabling general-purpose computing on graphics processing units (GPGPU) through APIs such as 1.2 and DirectCompute 11, which provide CUDA-like programmability for parallel workloads. These compute shaders incorporate synchronization primitives including barriers for intra-work-group coordination and atomic operations (e.g., , max, min) on and global to ensure data consistency across threads. Barriers are implemented via the S_BARRIER instruction supporting up to 16 wavefronts per work-group, while atomics leverage the 64 KB data share (LDS) with 32-bit wide entries for efficient thread-level operations. A key innovation in GCN is the Asynchronous Compute Engines (ACEs), which manage compute workloads independently from graphics processing to enable overlapping execution of graphics and compute tasks on the same hardware resources. Each ACE handles multiple task queues with priority-based scheduling (ranging from background to real-time), each supporting up to 8 queues, with high-end implementations featuring multiple ACEs for greater parallelism (up to 64 queues total), facilitating concurrent dispatch without stalling the . This asynchronous model supports out-of-order completion of tasks, synchronized through mechanisms like , LDS, or the global data share (GDS), thereby maximizing CU utilization during idle periods in rendering. Compute wavefronts—groups of 64 threads executed in —are dispatched directly to CUs by the ACEs, bypassing the command processor and fixed-function stages to streamline non- workloads. Each CU can schedule up to 40 wavefronts (10 per SIMD unit across 4 SIMDs), enabling high throughput for compute-intensive kernels while sharing resources with shaders when possible. This direct path allows for efficient multitasking, where compute operations fill gaps left by latency, such as during vertex or processing waits. GCN supports large work-group sizes of up to 1024 threads per group, divided into multiple wavefronts for execution, providing flexibility for algorithms requiring extensive intra-group communication. Shared memory is facilitated by the 64 KB LDS per CU, banked into 16 or 32 partitions to minimize contention and support fast atomic accesses within a work-group. Occupancy is tuned by factors like vector general-purpose register (VGPR) usage, with maximum waves per SIMD reaching 10 for low-register kernels (≤24 VGPRs) but dropping to 1 for high-register ones (>128 VGPRs). These features enable diverse applications in GPGPU tasks, such as physics simulations in game engines that leverage async queues for real-time particle effects and . In , GCN facilitates inference workloads through compute shaders, though performance is limited without dedicated tensor cores, relying instead on general matrix multiplications via or DirectCompute. Overall, the asynchronous model enhances efficiency in scenarios, allowing seamless integration with CPU-driven systems via brief references to models like those in (HSA).

Memory and System Features

Unified virtual memory

Graphics Core Next (GCN) introduces unified virtual memory (UVM) to enable seamless sharing of a single between the CPU and GPU, eliminating the need for explicit data copies in applications. This system allows pointers allocated by the CPU to be directly accessed by GPU kernels, facilitating fine-grained data sharing and improving programmability. Implemented starting with the first-generation GCN , UVM leverages hardware and driver support to manage , supporting up to a 40-bit that accommodates 1 TiB of addressable memory for 3D resources and textures. The GPU's (MMU) handles management, using 4 KB pages compatible with x86 addressing for transparent translation of virtual to physical addresses. This setup supports variable page sizes, including optional 4 KB sub-pages within 64 KB frames, ensuring efficient mapping for frame buffers and other resources. are populated by the driver, with the GPU MMU performing on-demand translations to maintain compatibility with the host system's model. Pointer swapping is facilitated by the scalar ALU, which processes 64-bit pointer values from registers to enable dynamic manipulation during kernel execution. This allows for fine-grained access patterns, where vector instructions operate at granularities ranging from 32 bits to 128 bits, supporting atomic operations and variable structures without fixed alignment constraints. Such mechanisms ensure that CPU-allocated structures can be directly referenced on the GPU, promoting semantics for enhanced efficiency. Cache coherency in GCN's UVM is maintained through the L2 cache hierarchy and integration with the input-output memory management unit (IOMMU), which translates x86 virtual addresses for (DMA) transfers between CPU and GPU. The IOMMU ensures consistent visibility of pools across the system, preventing stale data issues by coordinating cache invalidations and flushes. This hardware-assisted coherency model supports system-level memory pools, allowing the GPU to access host memory transparently while minimizing synchronization overhead. From GCN 1.0 onward, UVM has been a core feature. Integration with the (HSA) further extends UVM capabilities for coherent, multi-device environments. The primary benefit of GCN's UVM lies in , where it drastically cuts data transfer overhead by enabling direct pointer-based sharing compared to traditional copy-based models. This not only boosts application performance but also simplifies development by abstracting complexities.

Heterogeneous System Architecture

Heterogeneous System Architecture (HSA) serves as the foundational framework for Graphics Core Next (GCN) to enable unified computing between CPUs and GPUs, allowing seamless integration and task orchestration across heterogeneous agents without traditional operating system intervention. Developed by the HSA Foundation in collaboration with , this architecture defines specifications for user-mode operations, models, and a portable intermediate language, optimizing GCN for applications requiring tight CPU-GPU collaboration. By abstracting hardware differences, HSA facilitates efficient workload distribution, reducing latency and power overhead in systems like AMD's Accelerated Processing Units (). At the core of HSA's integration model are user-level queues, known as hqueues, which allow direct signaling between CPU and GPU agents in user space, bypassing kernel-mode switches for lower-latency communication. These queues are runtime-allocated structures that hold command packets, enabling applications to enqueue tasks efficiently without OS involvement, as specified in the HSA Platform System Architecture. In GCN implementations, hqueues support priority-based scheduling, from background to real-time tasks, enhancing multi-tasking in heterogeneous environments. Dispatch from the CPU to the GPU occurs through Architected Queuing Language (AQL) packets enqueued on these user-level queues, supporting fine-grained work dispatch for kernels and agents. AQL packets, such as kernel dispatch types, specify launch dimensions, code handles, arguments, and completion signals, allowing agents to build and enqueue their own commands for fast, low-power execution on GCN hardware. This mechanism reduces launch latency by enabling direct enqueuing of tasks to kernel agents, with support for dependencies and out-of-order completion. HSA leverages shared with coherent caching to enable data sharing between CPU and GPU, utilizing the unified for direct access without data movement. All agents access global memory coherently, with automatic cache maintenance ensuring consistency across the system, as mandated by HSA specifications. This model, compatible with GCN's virtual addressing, promotes efficient data-parallel computing by allowing pointers to be passed directly between processing elements. AMD's HSA Intermediate Language (HSAIL) provides a portable virtual ISA that is compiled to the native GCN (ISA) via a finalizer, ensuring hardware-agnostic code generation for heterogeneous execution. HSAIL, a RISC-like supporting data-parallel kernels with grids, work-groups, and work-items, translates operations like arithmetic, memory loads/stores, and into optimized GCN instructions, with features like relaxed and acquire/release semantics. The finalizer handles optimizations such as and wavefront packing tailored to GCN's SIMD execution model. HSA adoption in GCN-based began with the series (GCN 2.0), the first to implement full HSA features including hqueues and for seamless CPU-GPU task assignment. Later generations extended this to with graphics (GCN 5.0), supporting advanced HSA capabilities through the software stack (hsaROCm), which builds on HSA for workloads. These implementations enable features like heterogeneous queuing and unified memory in and systems, driving applications in compute-intensive domains.

Lossless compression and accelerators

Graphics Core Next (GCN) incorporates Delta Color Compression (DCC) as a technique specifically designed for color buffers in pipelines. DCC exploits data coherence by dividing color buffers into blocks and encoding one full-precision pixel value per block, with the remaining pixels represented as using fewer bits when colors are similar. This method enables compression ratios that can reduce usage by up to 2x in scenarios with coherent data, such as skies or gradients, while remaining fully lossless to preserve rendering accuracy. Introduced in GCN 1.2 architectures, DCC allows cores to read compressed data directly, bypassing decompression overhead in render-to-texture operations and improving overall efficiency. The Primitive Discard Accelerator (PDA) serves as a hardware mechanism to cull inefficient primitives early in the , particularly benefiting tessellation-heavy workloads. PDA identifies and discards small or degenerate (zero-area) that do not contribute to the final , preventing unnecessary in compute units and reducing cycle waste. This accelerator becomes increasingly effective as triangle density rises, enabling up to 3.5x higher throughput in dense scenes compared to prior implementations. Debuting in GCN 4.0 (), PDA enhances pre-rasterization efficiency by filtering occluded or irrelevant without impacting visible output. GCN supports standard block-based texture compression formats, including BCn (Block Compression) variants like BC1 through BC7, which reduce texture memory footprint by encoding 4x4 blocks into fixed-size outputs of 64 or 128 bits. These formats are decompressed on-the-fly within the units (TMUs), allowing efficient sampling of up to four texels per clock while minimizing bandwidth demands from main . Complementing this, fast clear operations optimize initialization by rapidly setting surfaces to common values like 0.0 or 1.0, leveraging compression to avoid full buffer writes and achieving significantly higher speeds than traditional clears—often orders of magnitude faster in bandwidth-constrained scenarios. This combination is integral to GCN's render back-ends, where hierarchical Z-testing further aids in discarding occluded s post-clear. To enhance power efficiency, GCN implements ZeroCore Power, a power gating technology that aggressively reduces leakage in idle components. When the GPU enters long idle mode—such as during static screen states—ZeroCore gates clocks and powers down compute units, caches, and other blocks, dropping idle power draw from around 15W to under 3W. Available from GCN 1.0 (Southern Islands successors like ), this feature achieves up to 90% reduction in static power leakage by isolating unused hardware, promoting in discrete GPU deployments without compromising resume latency.

Generations

First generation (GCN 1.0)

The first generation of the Graphics Core Next (GCN 1.0) architecture, codenamed Southern Islands, debuted with AMD's GPUs in late 2011, marking a shift to a more compute-oriented design compared to prior VLIW-based architectures. Announced on December 22, 2011, and available starting January 9, 2012, these GPUs were fabricated on a 28 nm process node by , enabling higher density and improved power efficiency. The architecture introduced foundational support for unified virtual memory (UVM), allowing shared virtual address spaces between CPU and GPU for simplified , though limited to 64 KB pages with 4 KB sub-pages in initial implementations. Key innovations included the ZeroCore Power technology, which dynamically powers down idle compute units to reduce leakage power during low-activity periods, a feature exclusive to the HD 7900, 7800, and 7700 series. Double-precision floating-point (FP64) performance was configured at 1/4 the rate of single-precision (FP32) for consumer GPUs, prioritizing graphics workloads over high-end compute tasks and resulting in latencies up to 4 times higher for DP operations. The supported 11 and 1.2, enabling advanced , compute shaders, and general-purpose GPU computing, but lacked full asynchronous compute optimization in early drivers, relying on two asynchronous compute engines (ACEs) for basic concurrent execution. Representative implementations included the flagship Tahiti GPU in the Radeon HD 7970, featuring 32 compute units (CUs), 2048 stream processors, and 3.79 TFLOPS of FP32 performance at a 250 W TDP, paired with 3 GB of GDDR5 memory on a 384-bit bus. Lower-end models used the Cape Verde GPU, as in the Radeon HD 7770 GHz Edition with 10 CUs, 640 stream processors, over 1 TFLOPS FP32 at a 1000 MHz core clock, and an 80 W TDP, targeting mainstream desktops with 1 GB GDDR5 on a 128-bit bus. These discrete GPUs powered high-end gaming and early professional visualization, emphasizing PCI Express 3.0 connectivity and features like AMD Eyefinity for multi-display support up to 4K resolutions.

Second generation (GCN 2.0)

The second generation of Graphics Core Next (GCN 2.0), known as the Sea Islands architecture, was introduced in 2013 with the launch of the AMD Radeon R9 200 series graphics cards. This generation built upon the foundational GCN design by incorporating optimizations for compute workloads, including the initial implementation of Asynchronous Compute Engines (ACEs) that enable up to eight independent compute queues per pipeline for concurrent graphics and compute operations. These enhancements allowed for more efficient multi-tasking, with support for advanced instruction sets such as 64-bit floating-point operations (e.g., V_ADD_F64 and V_MUL_F64) and improved memory addressing via unified system and device spaces. Key discrete GPU implementations included the high-end chip in the Radeon R9 290X, featuring 44 compute units (2,816 stream processors), a peak single-precision compute performance of up to 5.6 TFLOPS at an engine clock of 1 GHz, and fabrication on a 28 nm process node. Mid-range offerings utilized the GPU, as seen in the Radeon R9 270, while low-end models like the Radeon R7 240 employed the Oland chip, all leveraging the 28 nm process for improved power efficiency over prior generations through refined and clock management. Additionally, introduced (VCE) 2.0 hardware for H.264 encoding, supporting features like B-frames and YUV intra-frame encoding to accelerate video compression tasks. Integrated graphics in previewed (HSA) capabilities, with the family (launched in early 2014) incorporating up to eight GCN 2.0 compute units alongside CPU cores for unified memory access and seamless CPU-GPU task offloading. This generation also added support for 11.2 and 2.0, enabling broader compatibility with emerging compute standards while maintaining the 1:8 ratio of double- to single-precision floating-point performance.

Third generation (GCN 3.0)

The third generation of Graphics Core Next (GCN 3.0), codenamed Volcanic Islands, was released in as part of AMD's R9 300 series and Fury lineup, introducing refinements aimed at improving efficiency and scaling for mid-range to high-end applications. This iteration built on prior generations by enhancing arithmetic precision and resource management, with key architectural updates including fused multiply-add (FMA) operations for FP32 computations to boost floating-point throughput without intermediate rounding errors. Additionally, it introduced the Primitive Discard Accelerator (PDA), a hardware feature that optimizes by early of off-screen primitives, contributing to overall efficiency gains in rasterization workloads. Prominent implementations included the Tonga GPU, used in cards like the Radeon R9 285, fabricated on a 28 nm process with 32 compute units for mid-range performance scaling, and the flagship Fiji GPU in the Radeon R9 Fury X, featuring 64 compute units, 8.6 TFLOPS of single-precision compute performance, 4 GB of HBM1 memory, and a 275 W TDP. The Fiji variant, also on 28 nm, emphasized high-bandwidth memory integration for reduced latency in demanding scenarios, while the series as a whole supported partial H.265 (HEVC) video decode acceleration, enabling improved handling of 4K content through enhanced format conversions and buffer operations. These chips delivered notable efficiency improvements, with power-optimized designs allowing sustained performance in 4K gaming environments. GCN 3.0 also extended to accelerated processing units (), notably in the Carrizo family, where up to 12 compute units provided discrete-like graphics capabilities integrated with CPU cores on a 28 nm process, supporting 12 and for mainstream laptops. The Fury X's liquid-cooled thermal solution further exemplified refinements, maintaining lower temperatures under load compared to air-cooled predecessors, which aided in stable clock speeds and reduced throttling during extended sessions. Overall, these advancements focused on balancing compute density with power efficiency, enabling broader adoption in gaming and multimedia without significant node shrinks.

Fourth generation (GCN 4.0)

The fourth generation of the Graphics Core Next (GCN 4.0) architecture, codenamed Polaris, was introduced in 2016 with the Radeon RX 400 series graphics cards, emphasizing substantial improvements in power efficiency and mainstream performance. Fabricated on a 14 nm FinFET process by GlobalFoundries, Polaris delivered up to 2.5 times the performance per watt compared to the previous generation, enabling better thermal management and lower power consumption for gaming and compute tasks. Key enhancements included refined clock gating, improved branch handling in compute units, and support for DirectX 12, Vulkan, and asynchronous shaders, alongside FreeSync for adaptive sync displays and HDR10 for enhanced visuals. The architecture maintained a 1:16 FP64 to FP32 ratio for consumer products, with full hardware acceleration for HEVC (H.265) encode and decode up to 4K resolution via Video Core Next (VCN) 1.0. Prominent discrete implementations featured the 10 GPU in the Radeon RX 480, with 36 compute units (2,304 stream processors), up to 5.8 TFLOPS of single-precision performance at a boost clock of 1,266 MHz, 8 GB GDDR5 memory on a 256-bit bus delivering 224 GB/s bandwidth, and a 150 W TDP. Higher-end variants like the RX 580 ( 20 refresh) achieved 6.17 TFLOPS at 1,340 MHz boost with similar memory configurations, targeting gaming. Mid-range options used 11 in the RX 470, with 24 CUs (1,536 SPs) and around 3.1 TFLOPS, while entry-level 12 powered the RX 460 with 16 CUs and 2.2 TFLOPS, all supporting PCIe 3.0 and multi-monitor setups up to 5 displays. The RX 500 series in 2017 refreshed these designs with higher clocks for modest performance uplifts. GCN 4.0 also integrated into APUs like the Bristol Ridge family (launched mid-2016), featuring up to 8 compute units paired with Excavator CPU cores on 28 nm for laptops and desktops, enabling 1080p gaming without discrete GPUs and HSA-compliant task sharing. These advancements positioned Polaris as a cost-effective solution for VR-ready computing and 4K video playback, bridging the gap to higher-end architectures.

Fifth generation (GCN 5.0)

The fifth generation of the Graphics Core Next (GCN 5.0) architecture, codenamed , was introduced by in 2017, debuting with the consumer-oriented , professional-grade Vega 20 GPUs on 7 nm, and integrated variants in Ryzen APUs. This generation focused on high-bandwidth memory integration, enhanced compute density for AI and HPC, and compatibility with (HSA), while supporting DirectX 12 and emerging workloads. Key implementations spanned 14 nm and 7 nm processes, with FP64 ratios varying: 1:16 for consumer products and up to 1:2 for professional accelerators. The flagship consumer model, Radeon RX Vega 64 based on Vega 10 (14 nm FinFET), featured 64 compute units, 4,096 stream processors, peak single-precision performance of 13.7 TFLOPS at a 1,546 MHz boost clock, and a 295 W TDP for air-cooled variants. It utilized 8 GB of 2 (HBM2) on a 2,048-bit interface for up to 484 GB/s bandwidth, addressing data bottlenecks in 1440p and 4K gaming. Innovations like enhanced Delta Color Compression reduced render target bandwidth by exploiting pixel coherence, while Rapid Packed Math doubled FP16 and INT8 throughput to 27.5 TFLOPS, aiding half-precision tasks without dedicated tensor cores. Vega excelled in bandwidth-limited scenarios but faced thermal challenges in sustained loads. Professional extensions included the 7 nm Vega 20 in the Radeon Instinct MI50 (November 2018), with 60 CUs (3,840 stream processors), 13.3 TFLOPS FP32 and 6.7 TFLOPS FP64 at a 1,725 MHz peak clock, 16/32 GB HBM2 on a 4,096-bit interface (1 TB/s bandwidth), and 300 W TDP. The MI60 variant binned 64 CUs for 14.7 TFLOPS FP32 and 7.4 TFLOPS FP64, optimized for datacenter simulations and ML with a 1:2 FP32:FP64 ratio. (VCN) 2.0 enabled full HEVC/H.265 4K@60fps encode/decode with 10-bit support, while High Bandwidth Cache Controller (HBCC) extended unified to 48 bits, accessing up to 512 TB for large datasets. Integrated graphics in , such as Raven Ridge (2018, 14 nm) with 8–11 (8–11 CUs, up to 1.07 TFLOPS FP32 at 1,250 MHz DDR4), and the 12 nm Picasso refresh (2019), provided discrete-level for mainstream tasks. These solutions highlighted GCN 5.0's versatility in , paving the way for architecture transitions while ensuring .

Performance and Implementations

Chip implementations across generations

The Graphics Core Next (GCN) architecture powered a wide array of GPU implementations from 2012 to 2021, encompassing discrete graphics cards, integrated graphics processing units (iGPUs) in accelerated processing units (), and professional-grade accelerators. These chips were fabricated primarily on and process nodes ranging from 28 nm to 7 nm, with memory configurations evolving from GDDR5 to high-bandwidth memory (HBM) and HBM2 for enhanced performance in compute-intensive applications. Over 50 distinct chip variants were released, reflecting 's strategy to scale GCN across , mobile, and enterprise segments.

Discrete GPUs

Discrete GCN implementations targeted gaming and , featuring large die sizes to accommodate numerous compute units (CUs). Key examples include the first-generation Tahiti die, used in the Radeon HD 7970 series, which utilized a 28 nm process node, measured 352 mm² in die , and contained 4.31 billion transistors while supporting GDDR5 . In the third generation, the Fiji die, employed in the Radeon R9 Fury series, represented a significant scale-up on the same 28 nm node with a 596 mm² die and 8.9 billion transistors, paired with 4 GB of HBM for superior bandwidth in professional workloads. Fifth-generation Vega 10, found in the Radeon RX Vega 64, shifted to a 14 nm GlobalFoundries process, achieving a 486 mm² die with 12.5 billion transistors and up to 8 GB HBM2 to boost compute throughput. Other notable discrete dies spanned s, such as (GCN 2.0) and 10 (GCN 4.0, 230 mm² die on 14 nm with GDDR5).
GenerationKey DieProcess NodeDie Size (mm²)Transistors (Billions)Memory Type
GCN 1.0Tahiti28 nm3524.31GDDR5
GCN 3.0Fiji28 nm5968.9HBM
GCN 5.0Vega 1014 nm48612.5HBM2

Integrated APUs

GCN iGPUs were embedded in AMD's A-Series, Ryzen, and other APUs to enable heterogeneous computing on mainstream platforms, typically with fewer CUs than discrete counterparts for power efficiency. Early low-power examples include the Kabini APU (e.g., A4-5000 series, 2013), integrating up to 6 CUs of GCN 3.0 on a 28 nm process with shared DDR3 memory. For desktop, the Kaveri APUs, such as the A10-7850K (2014), featured an 8-CU Radeon R7 iGPU on a 28 nm GPU process, supporting up to 2133 MHz DDR3 for improved graphics performance in compact systems. By the fifth generation, Raven Ridge APUs like the Ryzen 5 2400G (2018) incorporated up to 11 CUs in a Vega-based iGPU on a 14 nm process, utilizing dual-channel DDR4 memory to deliver discrete-level graphics for gaming and content creation. These integrated solutions prioritized shared memory access over dedicated VRAM, enabling seamless CPU-GPU collaboration.

Professional GPUs

AMD extended GCN to and markets through the FirePro and lines, optimizing for stability and parallel processing. The FirePro W9000, based on the GCN 1.0 die, offered 6 GB GDDR5 on a 28 nm process for CAD and visualization tasks, delivering up to 3.9 TFLOPS of single-precision compute. Later, the MI series leveraged GCN 5.0, with the MI25 using a 10 die (16 GB HBM2, 14 nm) for acceleration, and the MI50 employing 20 (32 GB HBM2, 7 nm) to support clusters. These professional variants emphasized support and multi-GPU scaling, distinct from consumer-focused discrete cards.

Comparison of key specifications

The key specifications of Graphics Core Next (GCN) architectures evolved across generations, with progressive advancements in compute density, memory subsystems, and power efficiency driven by process node shrinks and architectural refinements. implementations, selected for their representative high-end performance in consumer or compute roles, demonstrate these trends through increased compute units (CUs), higher floating-point throughput, and enhanced , while maintaining compatibility with the unified GCN instruction set.
GenerationFlagship ChipCUsFP32 TFLOPSFP64 TFLOPSMemory Bandwidth (GB/s)Process NodeTDP (W)
GCN 1.0Radeon HD 7970323.790.95 (1:4 ratio)26428 nm250
GCN 2.0Radeon R9 290X445.631.41 (1:4 ratio)32028 nm290
GCN 3.0Radeon R9 Fury X648.600.54 (1:16 ratio)51228 nm275
GCN 4.0Radeon RX 480365.830.36 (1:16 ratio)25614 nm150
GCN 5.0Radeon Instinct MI256424.612.3 (1:2 ratio)48414 nm300
Performance trends in GCN show substantial uplift in single-precision (FP32) compute capability, scaling from approximately 3.8 TFLOPS in the GCN 1.0 HD 7970 to 24.6 TFLOPS in the GCN 5.0 Instinct MI25, representing over a 6x increase enabled by denser CU integration and clock optimizations. Efficiency metrics also improved markedly; for instance, GCN 4.0 () delivered about 36% higher compared to GCN 2.0 through refinements like level-3 data cache (L3CC) and , while GCN 5.0 () further enhanced this via high-bandwidth cache (HBC) structures, yielding up to 2x better perf/W in compute workloads relative to GCN 1.0 baselines. A feature matrix highlights GCN's evolution in concurrent processing and memory technologies: asynchronous compute (via Asynchronous Compute Engines, or ACEs) was available across all generations starting with up to 2 ACEs in GCN 1.0 and scaling to 8 in GCN 2.0 and later for better GPU utilization in heterogeneous workloads; high-bandwidth memory (HBM) support debuted in GCN 3.0 with HBM1 for reduced latency in bandwidth-intensive tasks, followed by HBM2 in GCN 5.0; precision ratios varied by product segment, with consumer GPUs maintaining 1:4 (FP64:FP32) in early generations for balanced graphics/compute and shifting to 1:16 in later consumer models to prioritize FP32 throughput, while compute-oriented chips like the MI25 achieved a full 1:2 ratio for double-precision dominance. GCN's strengths lie in its compute scalability, facilitated by the uniform wavefront execution model and support for APIs like and DirectX 12, enabling seamless integration in (HPC) and pipelines with up to 64 CUs per die in later generations. However, a notable limitation is the absence of dedicated ray-tracing hardware, relying instead on software-emulated methods that incur higher overhead compared to specialized accelerators in subsequent architectures.

References

Add your contribution
Related Hubs
User Avatar
No comments yet.