Recent from talks
Nothing was collected or created yet.
Heterogeneous System Architecture
View on WikipediaHeterogeneous System Architecture (HSA) is a cross-vendor set of specifications that allow for the integration of central processing units and graphics processors on the same bus, with shared memory and tasks.[1] The HSA is being developed by the HSA Foundation, which includes (among many others) AMD and ARM. The platform's stated aim is to reduce communication latency between CPUs, GPUs and other compute devices, and make these various devices more compatible from a programmer's perspective,[2]: 3 [3] relieving the programmer of the task of planning the moving of data between devices' disjoint memories (as must currently be done with OpenCL or CUDA).[4]
CUDA and OpenCL as well as most other fairly advanced programming languages can use HSA to increase their execution performance.[5] Heterogeneous computing is widely used in system-on-chip devices such as tablets, smartphones, other mobile devices, and video game consoles.[6] HSA allows programs to use the graphics processor for floating point calculations without separate memory or scheduling.[7]
Rationale
[edit]The rationale behind HSA is to ease the burden on programmers when offloading calculations to the GPU. Originally driven solely by AMD and called the FSA, the idea was extended to encompass processing units other than GPUs, such as other manufacturers' DSPs, as well.
-
Steps performed when offloading calculations to the GPU on a non-HSA system
-
Steps performed when offloading calculations to the GPU on a HSA system, using the HSA functionality
Modern GPUs are very well suited to perform single instruction, multiple data (SIMD) and single instruction, multiple threads (SIMT), while modern CPUs are still being optimized for branching. etc.
Overview
[edit]This section needs additional citations for verification. (May 2014) |
Originally introduced by embedded systems such as the Cell Broadband Engine, sharing system memory directly between multiple system actors makes heterogeneous computing more mainstream. Heterogeneous computing itself refers to systems that contain multiple processing units – central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), or any type of application-specific integrated circuits (ASICs). The system architecture allows any accelerator, for instance a graphics processor, to operate at the same processing level as the system's CPU.
Among its main features, HSA defines a unified virtual address space for compute devices: where GPUs traditionally have their own memory, separate from the main (CPU) memory, HSA requires these devices to share page tables so that devices can exchange data by sharing pointers. This is to be supported by custom memory management units.[2]: 6–7 To render interoperability possible and also to ease various aspects of programming, HSA is intended to be ISA-agnostic for both CPUs and accelerators, and to support high-level programming languages.
So far, the HSA specifications cover:
HSA Intermediate Layer
[edit]HSAIL (Heterogeneous System Architecture Intermediate Language), a virtual instruction set for parallel programs
- similar[according to whom?] to LLVM Intermediate Representation and SPIR (used by OpenCL and Vulkan)
- finalized to a specific instruction set by a JIT compiler
- make late decisions on which core(s) should run a task
- explicitly parallel
- supports exceptions, virtual functions and other high-level features
- debugging support
HSA memory model
[edit]- compatible with C++11, OpenCL, Java and .NET memory models
- relaxed consistency
- designed to support both managed languages (e.g. Java) and unmanaged languages (e.g. C)
- will make it much easier to develop 3rd-party compilers for a wide range of heterogeneous products programmed in Fortran, C++, C++ AMP, Java, et al.
HSA dispatcher and run-time
[edit]- designed to enable heterogeneous task queueing: a work queue per core, distribution of work into queues, load balancing by work stealing
- any core can schedule work for any other, including itself
- significant reduction of overhead of scheduling work for a core
Mobile devices are one of the HSA's application areas, in which it yields improved power efficiency.[6]
Block diagrams
[edit]The illustrations below compare CPU-GPU coordination under HSA versus under traditional architectures.
-
Standard architecture with a discrete GPU attached to the PCI Express bus. Zero-copy between the GPU and CPU is not possible due to distinct physical memories.
-
HSA brings unified virtual memory and facilitates passing pointers over PCI Express instead of copying the entire data.
-
In partitioned main memory, one part of the system memory is exclusively allocated to the GPU. As a result, zero-copy operation is not possible.
-
Unified main memory, where GPU and CPU are HSA-enabled. This makes zero-copy operation possible.[8]
Software support
[edit]
Some of the HSA-specific features implemented in the hardware need to be supported by the operating system kernel and specific device drivers. For example, support for AMD Radeon and AMD FirePro graphics cards, and APUs based on Graphics Core Next (GCN), was merged into version 3.19 of the Linux kernel mainline, released on 8 February 2015.[10] Programs do not interact directly with amdkfd[further explanation needed], but queue their jobs utilizing the HSA runtime.[11] This very first implementation, known as amdkfd, focuses on "Kaveri" or "Berlin" APUs and works alongside the existing Radeon kernel graphics driver.
Additionally, amdkfd supports heterogeneous queuing (HQ), which aims to simplify the distribution of computational jobs among multiple CPUs and GPUs from the programmer's perspective. Support for heterogeneous memory management (HMM), suited only for graphics hardware featuring version 2 of the AMD's IOMMU, was accepted into the Linux kernel mainline version 4.14.[12]
Integrated support for HSA platforms has been announced for the "Sumatra" release of OpenJDK, due in 2015.[13]
AMD APP SDK is AMD's proprietary software development kit targeting parallel computing, available for Microsoft Windows and Linux. Bolt is a C++ template library optimized for heterogeneous computing.[14]
GPUOpen comprehends a couple of other software tools related to HSA. CodeXL version 2.0 includes an HSA profiler.[15]
Hardware support
[edit]AMD
[edit]As of February 2015[update], only AMD's "Kaveri" A-series APUs (cf. "Kaveri" desktop processors and "Kaveri" mobile processors) and Sony's PlayStation 4 allowed the integrated GPU to access memory via version 2 of the AMD's IOMMU. Earlier APUs (Trinity and Richland) included the version 2 IOMMU functionality, but only for use by an external GPU connected via PCI Express.[citation needed]
Post-2015 Carrizo and Bristol Ridge APUs also include the version 2 IOMMU functionality for the integrated GPU.[citation needed]
The following table shows features of AMD's processors with 3D graphics, including APUs (see also: List of AMD processors with 3D graphics).
| Platform | High, standard and low power | Low and ultra-low power | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Codename | Server | Basic | Toronto | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Micro | Kyoto | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Desktop | Performance | Raphael | Phoenix | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Mainstream | Llano | Trinity | Richland | Kaveri | Kaveri Refresh (Godavari) | Carrizo | Bristol Ridge | Raven Ridge | Picasso | Renoir | Cezanne | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Entry | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Basic | Kabini | Dalí | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Mobile | Performance | Renoir | Cezanne | Rembrandt | Dragon Range | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Mainstream | Llano | Trinity | Richland | Kaveri | Carrizo | Bristol Ridge | Raven Ridge | Picasso | Renoir Lucienne |
Cezanne Barceló |
Phoenix | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Entry | Dalí | Mendocino | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Basic | Desna, Ontario, Zacate | Kabini, Temash | Beema, Mullins | Carrizo-L | Stoney Ridge | Pollock | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Embedded | Trinity | Bald Eagle | Merlin Falcon, Brown Falcon |
Great Horned Owl | Grey Hawk | Ontario, Zacate | Kabini | Steppe Eagle, Crowned Eagle, LX-Family |
Prairie Falcon | Banded Kestrel | River Hawk | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Released | Aug 2011 | Oct 2012 | Jun 2013 | Jan 2014 | 2015 | Jun 2015 | Jun 2016 | Oct 2017 | Jan 2019 | Mar 2020 | Jan 2021 | Jan 2022 | Sep 2022 | Jan 2023 | Jan 2011 | May 2013 | Apr 2014 | May 2015 | Feb 2016 | Apr 2019 | Jul 2020 | Jun 2022 | Nov 2022 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| CPU microarchitecture | K10 | Piledriver | Steamroller | Excavator | "Excavator+"[16] | Zen | Zen+ | Zen 2 | Zen 3 | Zen 3+ | Zen 4 | Bobcat | Jaguar | Puma | Puma+[17] | "Excavator+" | Zen | Zen+ | "Zen 2+" | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ISA | x86-64 v1 | x86-64 v2 | x86-64 v3 | x86-64 v4 | x86-64 v1 | x86-64 v2 | x86-64 v3 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Socket | Desktop | Performance | — | AM5 | — | — | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Mainstream | — | AM4 | — | — | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Entry | FM1 | FM2 | FM2+ | FM2+[a], AM4 | AM4 | — | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Basic | — | — | AM1 | — | FP5 | — | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Other | FS1 | FS1+, FP2 | FP3 | FP4 | FP5 | FP6 | FP7 | FL1 | FP7 FP7r2 FP8 |
FT1 | FT3 | FT3b | FP4 | FP5 | FT5 | FP5 | FT6 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| PCI Express version | 2.0 | 3.0 | 4.0 | 5.0 | 4.0 | 2.0 | 3.0 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| CXL | — | — | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Fab. (nm) | GF 32SHP (HKMG SOI) |
GF 28SHP (HKMG bulk) |
GF 14LPP (FinFET bulk) |
GF 12LP (FinFET bulk) |
TSMC N7 (FinFET bulk) |
TSMC N6 (FinFET bulk) |
CCD: TSMC N5 (FinFET bulk) cIOD: TSMC N6 (FinFET bulk) |
TSMC 4nm (FinFET bulk) |
TSMC N40 (bulk) |
TSMC N28 (HKMG bulk) |
GF 28SHP (HKMG bulk) |
GF 14LPP (FinFET bulk) |
GF 12LP (FinFET bulk) |
TSMC N6 (FinFET bulk) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Die area (mm2) | 228 | 246 | 245 | 245 | 250 | 210[18] | 156 | 180 | 210 | CCD: (2x) 70 cIOD: 122 |
178 | 75 (+ 28 FCH) | 107 | ? | 125 | 149 | ~100 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Min TDP (W) | 35 | 17 | 12 | 10 | 15 | 65 | 35 | 4.5 | 4 | 3.95 | 10 | 6 | 12 | 8 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Max APU TDP (W) | 100 | 95 | 65 | 45 | 170 | 54 | 18 | 25 | 6 | 54 | 15 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Max stock APU base clock (GHz) | 3 | 3.8 | 4.1 | 4.1 | 3.7 | 3.8 | 3.6 | 3.7 | 3.8 | 4.0 | 3.3 | 4.7 | 4.3 | 1.75 | 2.2 | 2 | 2.2 | 3.2 | 2.6 | 1.2 | 3.35 | 2.8 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Max APUs per node[b] | 1 | 1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Max core dies per CPU | 1 | 2 | 1 | 1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Max CCX per core die | 1 | 2 | 1 | 1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Max cores per CCX | 4 | 8 | 2 | 4 | 2 | 4 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Max CPU[c] cores per APU | 4 | 8 | 16 | 8 | 2 | 4 | 2 | 4 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Max threads per CPU core | 1 | 2 | 1 | 2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Integer pipeline structure | 3+3 | 2+2 | 4+2 | 4+2+1 | 1+3+3+1+2 | 1+1+1+1 | 2+2 | 4+2 | 4+2+1 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| i386, i486, i586, CMOV, NOPL, i686, PAE, NX bit, CMPXCHG16B, AMD-V, RVI, ABM, and 64-bit LAHF/SAHF | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| IOMMU[d] | — | v2 | v1 | v2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| BMI1, AES-NI, CLMUL, and F16C | — | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| MOVBE | — | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| AVIC, BMI2, RDRAND, and MWAITX/MONITORX | — | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| SME[e], TSME[e], ADX, SHA, RDSEED, SMAP, SMEP, XSAVEC, XSAVES, XRSTORS, CLFLUSHOPT, CLZERO, and PTE Coalescing | — | — | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| GMET, WBNOINVD, CLWB, QOS, PQE-BW, RDPID, RDPRU, and MCOMMIT | — | — | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| MPK, VAES | — | — | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| SGX | — | — | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| FPUs per core | 1 | 0.5 | 1 | 1 | 0.5 | 1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Pipes per FPU | 2 | 2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| FPU pipe width | 128-bit | 256-bit | 80-bit | 128-bit | 256-bit | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| CPU instruction set SIMD level | SSE4a[f] | AVX | AVX2 | AVX-512 | SSSE3 | AVX | AVX2 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 3DNow! | 3DNow!+ | — | — | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| PREFETCH/PREFETCHW | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| GFNI | — | — | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| AMX | — | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| FMA4, LWP, TBM, and XOP | — | — | — | — | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| FMA3 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| AMD XDNA | — | — | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| L1 data cache per core (KiB) | 64 | 16 | 32 | 32 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| L1 data cache associativity (ways) | 2 | 4 | 8 | 8 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| L1 instruction caches per core | 1 | 0.5 | 1 | 1 | 0.5 | 1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Max APU total L1 instruction cache (KiB) | 256 | 128 | 192 | 256 | 512 | 256 | 64 | 128 | 96 | 128 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| L1 instruction cache associativity (ways) | 2 | 3 | 4 | 8 | 2 | 3 | 4 | 8 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| L2 caches per core | 1 | 0.5 | 1 | 1 | 0.5 | 1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Max APU total L2 cache (MiB) | 4 | 2 | 4 | 16 | 1 | 2 | 1 | 2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| L2 cache associativity (ways) | 16 | 8 | 16 | 8 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Max on-die L3 cache per CCX (MiB) | — | 4 | 16 | 32 | — | 4 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Max 3D V-Cache per CCD (MiB) | — | 64 | — | — | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Max total in-CCD L3 cache per APU (MiB) | 4 | 8 | 16 | 64 | 4 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Max. total 3D V-Cache per APU (MiB) | — | 64 | — | — | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Max. board L3 cache per APU (MiB) | — | — | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Max total L3 cache per APU (MiB) | 4 | 8 | 16 | 128 | 4 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| APU L3 cache associativity (ways) | 16 | 16 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| L3 cache scheme | Victim | Victim | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Max. L4 cache | — | — | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Max stock DRAM support | DDR3-1866 | DDR3-2133 | DDR3-2133, DDR4-2400 | DDR4-2400 | DDR4-2933 | DDR4-3200, LPDDR4-4266 | DDR5-4800, LPDDR5-6400 | DDR5-5200 | DDR5-5600, LPDDR5x-7500 | DDR3L-1333 | DDR3L-1600 | DDR3L-1866 | DDR3-1866, DDR4-2400 | DDR4-2400 | DDR4-1600 | DDR4-3200 | LPDDR5-5500 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Max DRAM channels per APU | 2 | 1 | 2 | 1 | 2 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Max stock DRAM bandwidth (GB/s) per APU | 29.866 | 34.132 | 38.400 | 46.932 | 68.256 | 102.400 | 83.200 | 120.000 | 10.666 | 12.800 | 14.933 | 19.200 | 38.400 | 12.800 | 51.200 | 88.000 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| GPU microarchitecture | TeraScale 2 (VLIW5) | TeraScale 3 (VLIW4) | GCN 2nd gen | GCN 3rd gen | GCN 5th gen[19] | RDNA 2 | RDNA 3 | TeraScale 2 (VLIW5) | GCN 2nd gen | GCN 3rd gen[19] | GCN 5th gen | RDNA 2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| GPU instruction set | TeraScale instruction set | GCN instruction set | RDNA instruction set | TeraScale instruction set | GCN instruction set | RDNA instruction set | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Max stock GPU base clock (MHz) | 600 | 800 | 844 | 866 | 1108 | 1250 | 1400 | 2100 | 2400 | 400 | 538 | 600 | ? | 847 | 900 | 1200 | 600 | 1300 | 1900 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Max stock GPU base GFLOPS[g] | 480 | 614.4 | 648.1 | 886.7 | 1134.5 | 1760 | 1971.2 | 2150.4 | 3686.4 | 102.4 | 86 | ? | ? | ? | 345.6 | 460.8 | 230.4 | 1331.2 | 486.4 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 3D engine[h] | Up to 400:20:8 | Up to 384:24:6 | Up to 512:32:8 | Up to 704:44:16[20] | Up to 512:32:8 | 768:48:8 | 128:8:4 | 80:8:4 | 128:8:4 | Up to 192:12:8 | Up to 192:12:4 | 192:12:4 | Up to 512:?:? | 128:?:? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| IOMMUv1 | IOMMUv2 | IOMMUv1 | ? | IOMMUv2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Video decoder | UVD 3.0 | UVD 4.2 | UVD 6.0 | VCN 1.0[21] | VCN 2.1[22] | VCN 2.2[22] | VCN 3.1 | ? | UVD 3.0 | UVD 4.0 | UVD 4.2 | UVD 6.2 | VCN 1.0 | VCN 3.1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Video encoder | — | VCE 1.0 | VCE 2.0 | VCE 3.1 | — | VCE 2.0 | VCE 3.4 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| AMD Fluid Motion | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| GPU power saving | PowerPlay | PowerTune | PowerPlay | PowerTune[23] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| TrueAudio | — | ? | — | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| FreeSync | 1 2 |
1 2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HDCP[i] | ? | 1.4 | 2.2 | 2.3 | ? | 1.4 | 2.2 | 2.3 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| PlayReady[i] | — | 3.0 not yet | — | 3.0 not yet | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Supported displays[j] | 2–3 | 2–4 | 3 | 3 (desktop) 4 (mobile, embedded) |
4 | 2 | 3 | 4 | 4 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
/drm/radeon[k][26][27] |
— | — | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
/drm/amdgpu[k][28] |
— | — | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
- ^ For FM2+ Excavator models: A8-7680, A6-7480 & Athlon X4 845.
- ^ A PC would be one node.
- ^ An APU combines a CPU and a GPU. Both have cores.
- ^ Requires firmware support.
- ^ a b Requires firmware support.
- ^ No SSE4. No SSSE3.
- ^ Single-precision performance is calculated from the base (or boost) core clock speed based on a FMA operation.
- ^ Unified shaders : texture mapping units : render output units
- ^ a b To play protected video content, it also requires card, operating system, driver, and application support. A compatible HDCP display is also needed for this. HDCP is mandatory for the output of certain audio formats, placing additional constraints on the multimedia setup.
- ^ To feed more than two displays, the additional panels must have native DisplayPort support.[25] Alternatively active DisplayPort-to-DVI/HDMI/VGA adapters can be employed.
- ^ a b DRM (Direct Rendering Manager) is a component of the Linux kernel. Support in this table refers to the most current version.
ARM
[edit]ARM's Bifrost microarchitecture, as implemented in the Mali-G71,[30] is fully compliant with the HSA 1.1 hardware specifications. As of June 2016[update], ARM has not announced software support that would use this hardware feature.
See also
[edit]- General-purpose computing on graphics processing units (GPGPU)
- Non-Uniform Memory Access (NUMA)
- OpenMP
- Shared memory
- Zero-copy
- A technique enabling zero-copy operation for a CPU and a parallel accelerator [31]
References
[edit]- ^ Tarun Iyer (30 April 2013). "AMD Unveils its Heterogeneous Uniform Memory Access (hUMA) Technology". Tom's Hardware.
- ^ a b George Kyriazis (30 August 2012). Heterogeneous System Architecture: A Technical Review (PDF) (Report). AMD. Archived from the original (PDF) on 28 March 2014. Retrieved 26 May 2014.
- ^ "What is Heterogeneous System Architecture (HSA)?". AMD. Archived from the original on 21 June 2014. Retrieved 23 May 2014.
- ^ Joel Hruska (26 August 2013). "Setting HSAIL: AMD explains the future of CPU/GPU cooperation". ExtremeTech. Ziff Davis.
- ^ Linaro (21 March 2014). "LCE13: Heterogeneous System Architecture (HSA) on ARM". slideshare.net.
- ^ a b "Heterogeneous System Architecture: Purpose and Outlook". gpuscience.com. 9 November 2012. Archived from the original on 1 February 2014. Retrieved 24 May 2014.
- ^ "Heterogeneous system architecture: Multicore image processing using a mix of CPU and GPU elements". Embedded Computing Design. Retrieved 23 May 2014.
- ^ "Kaveri microarchitecture". SemiAccurate. 15 January 2014.
- ^ Michael Larabel (21 July 2014). "AMDKFD Driver Still Evolving For Open-Source HSA On Linux". Phoronix. Retrieved 21 January 2015.
- ^ a b "Linux kernel 3.19, Section 1.3. HSA driver for AMD GPU devices". kernelnewbies.org. 8 February 2015. Retrieved 12 February 2015.
- ^ "HSA-Runtime-Reference-Source/README.md at master". github.com. 14 November 2014. Retrieved 12 February 2015.
- ^ "Linux Kernel 4.14 Announced with Secure Memory Encryption and More". 13 November 2017. Archived from the original on 13 November 2017.
- ^ Alex Woodie (26 August 2013). "HSA Foundation Aims to Boost Java's GPU Prowess". HPCwire.
- ^ "Bolt on github". GitHub. 11 January 2022.
- ^ AMD GPUOpen (19 April 2016). "CodeXL 2.0 includes HSA profiler". Archived from the original on 27 June 2018. Retrieved 21 April 2016.
- ^ "AMD Announces the 7th Generation APU: Excavator mk2 in Bristol Ridge and Stoney Ridge for Notebooks". 31 May 2016. Retrieved 3 January 2020.
- ^ "AMD Mobile "Carrizo" Family of APUs Designed to Deliver Significant Leap in Performance, Energy Efficiency in 2015" (Press release). 20 November 2014. Retrieved 16 February 2015.
- ^ "The Mobile CPU Comparison Guide Rev. 13.0 Page 5 : AMD Mobile CPU Full List". TechARP.com. Retrieved 13 December 2017.
- ^ a b "AMD VEGA10 and VEGA11 GPUs spotted in OpenCL driver". VideoCardz.com. Retrieved 6 June 2017.
- ^ Cutress, Ian (1 February 2018). "Zen Cores and Vega: Ryzen APUs for AM4 – AMD Tech Day at CES: 2018 Roadmap Revealed, with Ryzen APUs, Zen+ on 12nm, Vega on 7nm". Anandtech. Retrieved 7 February 2018.
- ^ Larabel, Michael (17 November 2017). "Radeon VCN Encode Support Lands in Mesa 17.4 Git". Phoronix. Retrieved 20 November 2017.
- ^ a b "AMD Ryzen 5000G 'Cezanne' APU Gets First High-Res Die Shots, 10.7 Billion Transistors In A 180mm2 Package". wccftech. 12 August 2021. Retrieved 25 August 2021.
- ^ Tony Chen; Jason Greaves, "AMD's Graphics Core Next (GCN) Architecture" (PDF), AMD, retrieved 13 August 2016
- ^ "A technical look at AMD's Kaveri architecture". Semi Accurate. Retrieved 6 July 2014.
- ^ "How do I connect three or More Monitors to an AMD Radeon™ HD 5000, HD 6000, and HD 7000 Series Graphics Card?". AMD. Retrieved 8 December 2014.
- ^ Airlie, David (26 November 2009). "DisplayPort supported by KMS driver mainlined into Linux kernel 2.6.33". Retrieved 16 January 2016.
- ^ "Radeon feature matrix". freedesktop.org. Retrieved 10 January 2016.
- ^ Deucher, Alexander (16 September 2015). "XDC2015: AMDGPU" (PDF). Retrieved 16 January 2016.
- ^ a b Michel Dänzer (17 November 2016). "[ANNOUNCE] xf86-video-amdgpu 1.2.0". lists.x.org.
- ^ "ARM Bifrost GPU Architecture". 30 May 2016. Archived from the original on 10 September 2016.
- ^ Computer memory architecture for hybrid serial and parallel computing systems, US patents 7,707,388, 2010 and 8,145,879, 2012. Inventor: Uzi Vishkin
External links
[edit]- HSA Heterogeneous System Architecture Overview on YouTube by Vinod Tipparaju at SC13 in November 2013
- HSA and the software ecosystem
- 2012 – HSA by Michael Houston Archived 5 March 2016 at the Wayback Machine
Heterogeneous System Architecture
View on GrokipediaIntroduction
Definition and Scope
Heterogeneous System Architecture (HSA) is an open, cross-vendor industry standard developed to integrate central processing units (CPUs), graphics processing units (GPUs), and other compute accelerators into a single, coherent computing system, enabling seamless parallel processing across diverse hardware components.[1] This architecture addresses the challenges of heterogeneous computing by providing a unified programming model that allows developers to write code once and deploy it across multiple device types without explicit data transfers or device-specific optimizations.[3] The scope of HSA primarily targets applications requiring high-performance parallel computation, such as graphics rendering, artificial intelligence, machine learning, and scientific simulations, where workloads can be dynamically distributed among CPUs, GPUs, and specialized processors like digital signal processors (DSPs).[1] It emphasizes a system-level approach to heterogeneous computing, while abstracting hardware differences to promote portability and efficiency.[3] At its core, HSA relies on principles like cache-coherent shared virtual memory for unified access to system resources, low-latency inter-device communication at the user level without operating system intervention, and hardware abstraction to hide vendor-specific details from programmers.[2] Key specifications defining HSA include version 1.0, released in March 2015, which established foundational elements such as the Heterogeneous System Architecture Intermediate Language (HSAIL)—a portable, virtual instruction set architecture (ISA) that preserves parallelism information—and the Heterogeneous Compute (HC) language for high-level programming support.[5] This version also introduced runtime application programming interfaces (APIs) for resource management and task dispatching.[6] HSA 1.1, released in May 2016, extended these with multi-vendor interoperability interfaces, enhancing support for integrating IP blocks from different manufacturers while maintaining the unified memory model for coherent data sharing across agents.[7] The latest version, 1.2, was released in 2021 and refined aspects of the system architecture, runtime, and programmer's reference manual, with no major updates as of November 2025.[2]Historical Development
The Heterogeneous System Architecture (HSA) initiative originated from efforts to standardize heterogeneous computing, beginning with the formation of the HSA Foundation in June 2012 as a non-profit consortium dedicated to developing open standards for integrating CPUs, GPUs, and other accelerators on a single chip.[8] The founding members included AMD, ARM, Imagination Technologies, MediaTek, Qualcomm, Samsung, and Texas Instruments, with the goal of creating a unified programming model to simplify development for system-on-chip (SoC) designs and reduce reliance on proprietary interfaces.[9][10] Additional early members, such as Vivante Corporation, joined shortly after in August 2012, expanding the consortium's focus on mobile and embedded hybrid compute platforms.[11] Key milestones in HSA's development included the release of the initial Programmer's Reference Manual version 0.95 in May 2013, which outlined the foundational HSA Intermediate Language (HSAIL) and runtime APIs.[12] This progressed to the HSA 1.0 specification in March 2015, enabling certification of compliant systems and marking the first complete standard for unified memory access and task dispatching across heterogeneous processors.[13] The specification advanced further with HSA 1.1 in May 2016, introducing enhancements like finalizer passes for HSAIL to support more flexible code generation and versioning for compiler toolchains.[7] HSA 1.2 followed in 2021 as the most recent major update.[2] HSA evolved from proprietary approaches, notably AMD's Fusion System Architecture announced in 2011, which integrated CPU and GPU cores but lacked broad industry interoperability; the 2012 rebranding to HSA and foundation formation shifted it toward open standards.[14] This transition facilitated integration with open-source compilers like LLVM, enabling HSAIL as a portable intermediate representation for heterogeneous code optimization starting around 2013. However, after peak activity around 2017—including surveys highlighting heterogeneous systems' growing importance—foundation updates slowed, while the HSA Foundation maintains existing specifications.[15]Motivations and Benefits
Rationale for HSA
Prior to the development of Heterogeneous System Architecture (HSA), traditional heterogeneous computing environments, particularly those integrating CPUs and GPUs, suffered from significant inefficiencies in data management and processing workflows. A primary challenge was the requirement for explicit data copying between separate CPU and GPU memory spaces, which treated the GPU as a remote device and incurred substantial overhead in terms of time and power consumption.[16] Additionally, the use of distinct address spaces for each processor led to high latency during data transfers and task dispatching, often involving operating system kernel transitions and driver interventions that disrupted seamless execution.[16] These issues were exacerbated by vendor-specific programming models, such as NVIDIA's CUDA, which offered high performance but locked developers into proprietary ecosystems, and OpenCL, intended as a cross-vendor standard yet requiring tedious and error-prone manual porting efforts between implementations, thereby hindering application portability across diverse hardware.[17] The emergence of these challenges coincided with broader industry trends in the early 2010s, particularly around 2010-2012, as heterogeneous systems gained prominence in mobile, embedded, and high-performance computing domains. The proliferation of power-constrained devices, such as smartphones and tablets, alongside the demands of data centers for energy-efficient scaling, underscored the need for architectures that could harness increasing levels of parallelism without proportional rises in power usage.[18] Innovations like AMD's Accelerated Processing Units (APUs) and ARM's big.LITTLE architecture highlighted the shift toward integrated CPU-GPU designs, but the lack of standardized interfaces limited their potential for widespread adoption in handling complex workloads like multimedia processing and scientific simulations.[18] This period also saw GPUs evolving from specialized graphics accelerators to general-purpose compute engines, amplifying the urgency for unified frameworks to manage diverse processing elements beyond traditional CPUs and GPUs.[16] In response, HSA was designed with core goals to address these pain points by enabling seamless task offloading across processors without constant CPU oversight, thereby minimizing dispatch latency and data movement overhead.[19] It sought to reduce programming complexity through a more unified approach, allowing developers to target multiple accelerators—such as GPUs, DSPs, and future extensions—with greater portability and less vendor dependency.[19] Ultimately, these objectives aimed to foster an ecosystem where heterogeneous computing could be leveraged efficiently for emerging applications, promoting innovations in areas like real-time AI and edge processing.[1]Key Advantages
Heterogeneous System Architecture (HSA) delivers substantial performance benefits by enabling seamless collaboration between CPU and GPU through coherent shared memory, which eliminates the need for explicit data copies and reduces transfer overheads. In benchmarks such as the Haar Face Detect algorithm implemented on an AMD A10 4600M APU, HSA achieved a 2.3x speedup over traditional OpenCL-based CPU/GPU setups by leveraging unified memory and low-overhead task dispatching. This coherent memory model significantly improves data transfer efficiency for workloads involving frequent CPU-GPU data sharing, such as parallel processing tasks, compared to legacy systems requiring manual synchronization and copying. Furthermore, HSA's fine-grained task dispatching via user-level queues reduces dispatch latency in integrated systems, contrasting with higher delays in PCIe-based discrete GPU configurations where kernel launches and data staging add significant overhead.[19] Efficiency gains in HSA stem from optimized resource utilization and reduced overheads in integrated system-on-chips (SoCs), allowing processors to share data pointers without cache flushes or synchronization barriers. For the same Haar Face Detect workload, HSA demonstrated a 2.4x reduction in power consumption relative to conventional CPU/GPU approaches, attributed to minimized memory operations and efficient workload distribution. This leads to better overall system efficiency, particularly in power-constrained environments like mobile devices, where CPU-GPU collaboration avoids redundant computations and enables dynamic load balancing without OS intervention.[19] HSA enhances usability by providing a portable programming model with a unified virtual address space, enabling developers to write vendor-agnostic code that runs across diverse hardware without vendor-specific APIs. This simplifies debugging, as pointers and data structures are shared seamlessly between compute units, reducing errors from memory management. The architecture supports heterogeneous workloads, including machine learning inference, through libraries like AMD's MIGraphX in the ROCm ecosystem, which leverages HSA's runtime for efficient model deployment on integrated CPU-GPU systems.[1][20] Real-world applications illustrate these advantages: in gaming, HSA accelerates graphics rendering on AMD APUs by enabling direct CPU-GPU task handoff, improving frame rates without data staging overheads. Similarly, scientific simulations benefit from faster execution, as unified memory allows iterative computations to proceed without intermediate data transfers, enhancing throughput in fields like computational biology and physics modeling.[21]Core Concepts
Unified Memory Model
The unified memory model in Heterogeneous System Architecture (HSA) establishes a shared virtual address space accessible by all agents, including CPUs, GPUs, and other compute units, enabling seamless data sharing without the need for explicit memory transfers. This model mandates a minimum 48-bit virtual address space for 64-bit systems and 32-bit for 32-bit systems, allowing applications to allocate memory once and access it uniformly across heterogeneous processors.[2] Fine-grained coherence is enforced at the cache-line level for the global memory segment in the base profile, ensuring that modifications by one agent are visible to others in a consistent manner.[2] Central to this model is the use of shared physical memory with a relaxed consistency guarantee, which adopts acquire-release semantics to balance performance and correctness in parallel executions. Under this semantics, loads and stores are ordered relative to synchronization operations, such as atomic instructions, preventing unnecessary barriers while maintaining sequential consistency for properly synchronized code. Synchronization between agents is facilitated through HSA signals and queues, which provide low-overhead mechanisms for notifying completion and coordinating data access without requiring explicit data copies between device and host memory. This eliminates the traditional copy-in/copy-out overheads seen in discrete GPU programming models, allowing developers to treat memory as a unified resource.[2] Coherence protocols in HSA are hardware-managed, supporting mechanisms like snooping or directory-based approaches to maintain consistency across multiple agents in multi-socket or multi-device configurations. In snooping protocols, caches monitor bus traffic to invalidate or update shared lines, while directory-based methods use a central directory to track cache states, reducing bandwidth in scalable systems. The model also accommodates heterogeneous page sizes through the HSA Memory Management Unit (MMU), ensuring compatibility across agents with varying hardware capabilities, though all agents must support the same page sizes for global memory mappings. These features collectively form the foundation for efficient heterogeneous computing, with runtime queues integrating synchronization to dispatch tasks across agents.[2]Intermediate Layer (HSAIL)
The Heterogeneous System Architecture Intermediate Language (HSAIL) serves as a portable intermediate representation for compute kernels in heterogeneous computing environments, functioning as a virtual instruction set architecture (ISA) that abstracts hardware-specific details to enable cross-vendor compatibility.[22] Designed for parallel processing, HSAIL is based on a subset of LLVM Intermediate Representation (IR) augmented with extensions for heterogeneous features, such as support for diverse processor types including CPUs and GPUs.[22] It allows developers to write kernels once and compile them into platform-independent bytecode, which can then be optimized for specific hardware targets without altering the source code.[22] HSAIL includes key instruction categories tailored for efficient kernel execution, such as memory access operations likeld (load) and st (store) that specify address spaces including global, group, private, and flat to manage data locality in heterogeneous systems.[22] Control flow is handled through instructions like brn for unconditional branches and cbr for conditional branches, enabling structured program flow within parallel work-items.[22] Vector operations support packed data manipulation, with instructions such as combine and expand for rearranging elements in vectors, alongside modifiers like width(n) to specify execution granularity and reduce overhead in SIMD-like environments.[22] These components are defined in a RISC-like syntax using registers (e.g., $s0 for scalar values) and directives for pragmas, ensuring a low-level yet abstract representation suitable for optimization.[22]
The compilation process for HSAIL begins with high-level source code, such as C++ or OpenCL, which front-end compilers translate into HSAIL text format.[22] This text is then encoded into BRIG (Binary Representation of HSAIL), a platform-independent bytecode format using little-endian C-style structures for sections like code, directives, and operands, facilitating portability across HSA-compliant systems.[22] Vendor-specific finalizers subsequently apply hardware-optimized passes, translating BRIG into native machine code either statically, at load time, or dynamically, while performing tasks like register allocation and instruction scheduling to match target ISA constraints.[22]
Unique to HSAIL is its support for dynamic parallelism, where kernels can launch additional work-groups or work-items at runtime through scalable data-parallel constructs, using execution widths (e.g., width(64)) and fine-grained barriers for synchronization within wavefronts or subsets of threads.[22] Error handling addresses invalid memory accesses, such as unaligned addresses or out-of-bounds operations, via exception policies like DETECT (to identify issues) or BREAK (to halt execution), ensuring robust behavior in heterogeneous runtime environments.[22] This integration allows HSAIL kernels to interact seamlessly with the HSA runtime for dispatch, though detailed execution mechanics are managed externally.[22]
Runtime System and Dispatcher
The HSA runtime system provides a standardized library interface, defined in the header file hsa.h, that enables developers to initialize execution contexts, manage heterogeneous agents such as CPUs and GPUs, and create command queues for workload orchestration.[6] Initialization occurs through the hsa_init() function, which establishes a reference-counted runtime environment that must precede other API calls, while shutdown is handled by hsa_shut_down() to release resources.[6] Agents, representing compute-capable hardware components, are managed via APIs that allow querying their capabilities, such as kernel dispatch support, ensuring seamless integration across CPU and GPU devices.[6] At the core of dispatch operations is the command queue mechanism, which facilitates asynchronous execution through user-mode queues populated with Architected Queuing Language (AQL) packets.[6] Queues are created using hsa_queue_create(), supporting single-producer (HSA_QUEUE_TYPE_SINGLE) or multi-producer (HSA_QUEUE_TYPE_MULTI) configurations, with sizes as powers of two (e.g., 256 packets) to optimize hardware doorbell signaling.[6] Dispatch involves reserving a packet ID, writing the AQL packet to the queue, and ringing the doorbell to notify the agent, enabling non-blocking submission of workloads.[6] Packet types include kernel dispatch (HSA_PACKET_TYPE_KERNEL_DISPATCH) for launching HSAIL kernels on compute units, and barrier packets such as HSA_PACKET_TYPE_BARRIER_AND (acquire-and) for synchronization waiting on all dependencies or HSA_PACKET_TYPE_BARRIER_OR (acquire-or) for any dependency completion.[6] Priority levels for workloads are managed through queue creation parameters or packet header bits, allowing agents to prioritize tasks based on latency or throughput requirements.[6] Key runtime processes include agent discovery, which uses hsa_iterate_agents() to enumerate available CPUs and GPUs, filtering by features like HSA_AGENT_FEATURE_KERNEL_DISPATCH to identify suitable dispatch targets.[6] Memory allocation is supported via hsa_memory_allocate(), which assigns regions in the global or fine-grained segments associated with specific agents, ensuring coherent access across the heterogeneous system.[6] Signal handling provides completion notification through hsa_signal_create() for generating signals, hsa_signal_add_release() or similar for dependency tracking, and hsa_signal_wait_scacquire() for blocking waits, allowing efficient synchronization without polling.[6] These signals integrate with queue packets to signal dispatch completion, enabling the runtime to orchestrate complex dependency graphs. The runtime's scalability is enhanced by support for agents comprising multiple compute units, queried via hsa_agent_get_info() with HSA_AGENT_INFO_COMPUTE_UNIT_COUNT, allowing kernels to distribute across parallel hardware resources.[6] Load balancing is achieved through the creation of multiple queues per agent and multi-producer support, permitting concurrent submissions from various host threads to distribute workloads dynamically across available compute units.[6] This design enables efficient scaling in multi-agent environments, where HSAIL kernels are dispatched to optimal hardware without host intervention for low-level scheduling.[6]System Architecture
Component Diagrams
Heterogeneous System Architecture (HSA) employs block diagrams to depict the high-level system-on-chip (SoC) layout, illustrating the integration of central processing units (CPUs), graphics processing units (GPUs), input-output memory management units (IOMMUs), and the shared memory hierarchy. A representative simple HSA platform diagram shows a single node configuration where the CPU and integrated GPU act as agents connected via hubs, with unified memory accessible through a flat address space and IOMMU handling translations for coherent access across components.[2] In more advanced topologies, diagrams extend to multi-socket CPUs or application processing units (APUs) paired with discrete multi-board GPUs, incorporating multiple memory nodes and interconnect hubs to manage data movement and synchronization.[2] Central to these diagrams are agents, which represent computational units such as CPUs and GPUs capable of issuing and consuming architecture queue language (AQL) packets for task dispatch, and hubs, which serve as interconnects facilitating communication between agents, memory resources, and I/O devices.[2] HSA defines device profiles to standardize component capabilities: the full profile supports advanced features like multiple active queues and a minimum 4 KB kernarg segment for kernel arguments, while the minimal profile (or base profile) limits devices to one active queue but maintains the same kernarg size for basic compatibility.[2] These elements ensure scalable integration, with diagrams highlighting how agents interact within a unified virtual address space of at least 48 bits on 64-bit systems.[2] Flowcharts in HSA documentation outline the dispatch process from host to agents, beginning with the host allocating an AQL packet slot in a queue by incrementing a write index, populating the packet with task details like kernel objects and arguments, and signaling a doorbell to notify the packet processor.[2] A descriptive walkthrough of data flow from a CPU queue to a GPU execution unit involves the CPU enqueuing a kernel dispatch packet in user-mode queue format, which includes fields for grid and workgroup sizes, private and group segment sizes, kernarg address, and a completion signal; the packet processor then launches the task with an acquire fence for memory ordering, the GPU executes the kernel, and completion triggers a release fence followed by signaling back to the host.[2] For instance, a simple kernel dispatch diagram might illustrate this as a linear flowchart: host packet creation → queue submission → processor launch → agent execution → completion notification, emphasizing the asynchronous nature without CPU intervention during execution.[2] Diagrams also account for variations between integrated and discrete GPU setups. In integrated configurations, a single-node diagram depicts the CPU and GPU sharing low-latency memory directly via hubs, promoting tight coupling for efficient data sharing.[2] Conversely, discrete GPU diagrams show multi-node arrangements where the GPU resides on a separate board, relying on IOMMUs and higher-latency interconnects for memory access across distinct pools, as seen in multi-board topologies.[2] These visual representations underscore HSA's flexibility in supporting diverse hardware layouts while maintaining a coherent system view.[2]Hardware-Software Interfaces
The hardware-software interfaces in Heterogeneous System Architecture (HSA) are defined primarily through the HSA Runtime API and the HSA Platform System Architecture Specification, which provide standardized mechanisms for software to discover, query, and interact with hardware agents such as CPUs and GPUs. Central to these interfaces is agent enumeration, achieved via thehsa_iterate_agents function, which allows applications to traverse all available agents by invoking a user-provided callback for each one, enabling identification of kernel-capable agents through checks like HSA_AGENT_FEATURE_KERNEL_DISPATCH. Once enumerated, the hsa_agent_get_info function queries detailed capabilities, such as agent type (HSA_AGENT_INFO_DEVICE), supported features (HSA_AGENT_INFO_FEATURE), node affiliation (HSA_AGENT_INFO_NODE), and compute unit count (HSA_AGENT_INFO_COMPUTE_UNITS), facilitating topology-aware software configuration without vendor-specific code. These APIs ensure that software can dynamically adapt to the underlying hardware, supporting unified access across heterogeneous components.[6]
HSA specifies two compliance profiles to balance functionality and implementation complexity: the Full Profile and the Minimal Profile. The Full Profile (HSA_PROFILE_FULL) mandates support for advanced features, including coherent shared virtual memory across all agents, fine-grained memory access semantics for kernel arguments from any region, indirect function calls, image objects, and sampler resources, along with the ability to process multiple active queue packets simultaneously and detect floating-point exceptions. In contrast, the Minimal Profile (HSA_PROFILE_BASE) provides core compute capabilities with restrictions, such as limiting fine-grained memory semantics to HSA-allocated buffers, supporting only a single active queue packet per queue, and omitting advanced constructs like images or full exception detection, making it suitable for basic heterogeneous acceleration without requiring platform-wide coherence. Profile support for an agent's instruction set architecture (ISA) is queried via HSA_ISA_INFO_PROFILES using hsa_isa_get_info, allowing software to select compatible code paths. Kernel agents must support floating-point operations compliant with IEEE 754-2008 in both profiles, though the Full Profile requires additional exception handling via the DETECT mode.[6][2]
Extensions in HSA introduce optional features to extend base functionality while maintaining core compatibility, queried through hsa_system_get_info with HSA_SYSTEM_INFO_EXTENSIONS or hsa_system_extension_supported for specific support. Examples include the Images extension for texture handling via hsa_ext_sampler_create, performance counters for runtime profiling, and profile events for tracking execution. Debug support is provided optionally through infrastructure for heterogeneous debugging, such as DWARF extensions integrated with HSA agents. Versioning ensures backward compatibility, with runtime and agent versions accessible via HSA_SYSTEM_INFO_VERSION_MAJOR/MINOR and HSA_AGENT_INFO_VERSION_MAJOR/MINOR in hsa_agent_get_info, while extensions use versioned function tables (e.g., hsa_ext_finalizer_1_00_pfn_t) and macros (e.g., #define hsa_ven_hal_foo 001001) to allow incremental adoption without breaking existing code.[6][2]
These interfaces promote interoperability and portability by standardizing interactions across compliant hardware from multiple vendors, using mechanisms like Architected Queuing Language (AQL) packets for queue-based dispatch (hsa_queue_create), signals for synchronization (hsa_signal_create with consumer agents), and a flat memory model for consistent access. For instance, signals specify consuming agents during creation to enforce visibility and ordering, enabling cross-agent completion notifications without CPU intervention. This design abstracts hardware differences, allowing a single HSA-compliant application to run portably on diverse platforms, such as AMD or ARM-based systems, by relying on runtime queries and standard APIs rather than vendor-specific drivers. Runtime initialization, handled via the HSA dispatcher, briefly leverages these interfaces for initial agent discovery but defers detailed operations to application code.[6][2]
Software Ecosystem
Programming Models and APIs
Heterogeneous System Architecture (HSA) provides programming models that enable developers to write portable code for heterogeneous systems, integrating CPUs, GPUs, and other accelerators through a unified approach. The primary model leverages standard languages like C/C++, with support for parallelism through frameworks such as HIP (Heterogeneous-compute Interface for Portability) and SYCL, which map to HSA runtime APIs. This unified model treats all compute agents uniformly, using shared pointers and a single address space to simplify development across diverse hardware.[22] HSA also supports kernel-based programming reminiscent of OpenCL, where developers define kernels in HSA Intermediate Language (HSAIL) for data-parallel execution. Kernels are structured with work-groups and work-items in up to three dimensions, supporting features like dynamic shared memory allocation in group segments and parallel loop pragmas (e.g.,#pragma hsa loop parallel). These kernels handle vector operations, image processing, and other compute-intensive tasks, with arguments passed via kernel argument blocks for efficient dispatch.[22]
The core HSA runtime APIs form the foundation for application development, providing functions to initialize the environment, manage queues, and load executables. Initialization begins with hsa_init(), which prepares the runtime by incrementing a reference counter, followed by hsa_shut_down() to release resources upon completion. Queue creation uses hsa_queue_create(), specifying an agent, queue size (a power of 2), type (e.g., single or multi), and optional callbacks for event handling. Kernel loading and execution are enabled via hsa_executable_create(), which assembles code objects into an executable for a target profile (e.g., full or base) and state (e.g., unfrozen for loading). These APIs ensure low-overhead dispatch of Architecture Queue Language (AQL) packets for kernels or barriers.[6]
A representative example is dispatching a vector addition kernel, which demonstrates queue setup, packet preparation, and signal-based synchronization. The following C code snippet initializes the runtime, creates a queue on a kernel agent, dispatches the kernel with a 256x256 grid, and waits for completion using a signal:
#include <hsa.h>
hsa_status_t vector_add_example() {
hsa_status_t status = hsa_init();
if (status != HSA_STATUS_SUCCESS) return status;
hsa_agent_t agent;
// Assume agent is populated via hsa_iterate_agents
hsa_queue_t *queue;
status = hsa_queue_create(agent, 1024, HSA_QUEUE_TYPE_SINGLE, NULL, NULL, UINT32_MAX, UINT32_MAX, &queue);
if (status != HSA_STATUS_SUCCESS) {
hsa_shut_down();
return status;
}
uint64_t packet_id = hsa_queue_add_write_index_relaxed(queue, 1);
hsa_kernel_dispatch_packet_t *packet = (hsa_kernel_dispatch_packet_t *)(queue->base_address + HSA_QUEUE_HEADER_SIZE * packet_id);
memset(packet, 0, sizeof(hsa_kernel_dispatch_packet_t));
packet->setup = 1 << HSA_KERNEL_DISPATCH_PACKET_SETUP_DIMENSIONS_X;
packet->workgroup_size_x = 256;
packet->grid_size_x = 256;
packet->kernel_object = 0; // Placeholder for kernel object
packet->private_segment_size = 0;
packet->group_segment_size = 0;
hsa_signal_t signal;
status = hsa_signal_create(1, 0, NULL, &signal);
if (status != HSA_STATUS_SUCCESS) {
hsa_queue_destroy(queue);
hsa_shut_down();
return status;
}
packet->completion_signal = signal;
*((uint16_t *)packet) = HSA_PACKET_TYPE_KERNEL_DISPATCH << HSA_PACKET_HEADER_TYPE_SHIFT;
hsa_signal_store_screlease(queue->doorbell_signal, packet_id);
hsa_signal_wait_scacquire(signal, HSA_SIGNAL_CONDITION_EQ, 0, UINT64_MAX, HSA_WAIT_STATE_ACTIVE);
hsa_signal_destroy(signal);
hsa_queue_destroy(queue);
hsa_shut_down();
return HSA_STATUS_SUCCESS;
}
#include <hsa.h>
hsa_status_t vector_add_example() {
hsa_status_t status = hsa_init();
if (status != HSA_STATUS_SUCCESS) return status;
hsa_agent_t agent;
// Assume agent is populated via hsa_iterate_agents
hsa_queue_t *queue;
status = hsa_queue_create(agent, 1024, HSA_QUEUE_TYPE_SINGLE, NULL, NULL, UINT32_MAX, UINT32_MAX, &queue);
if (status != HSA_STATUS_SUCCESS) {
hsa_shut_down();
return status;
}
uint64_t packet_id = hsa_queue_add_write_index_relaxed(queue, 1);
hsa_kernel_dispatch_packet_t *packet = (hsa_kernel_dispatch_packet_t *)(queue->base_address + HSA_QUEUE_HEADER_SIZE * packet_id);
memset(packet, 0, sizeof(hsa_kernel_dispatch_packet_t));
packet->setup = 1 << HSA_KERNEL_DISPATCH_PACKET_SETUP_DIMENSIONS_X;
packet->workgroup_size_x = 256;
packet->grid_size_x = 256;
packet->kernel_object = 0; // Placeholder for kernel object
packet->private_segment_size = 0;
packet->group_segment_size = 0;
hsa_signal_t signal;
status = hsa_signal_create(1, 0, NULL, &signal);
if (status != HSA_STATUS_SUCCESS) {
hsa_queue_destroy(queue);
hsa_shut_down();
return status;
}
packet->completion_signal = signal;
*((uint16_t *)packet) = HSA_PACKET_TYPE_KERNEL_DISPATCH << HSA_PACKET_HEADER_TYPE_SHIFT;
hsa_signal_store_screlease(queue->doorbell_signal, packet_id);
hsa_signal_wait_scacquire(signal, HSA_SIGNAL_CONDITION_EQ, 0, UINT64_MAX, HSA_WAIT_STATE_ACTIVE);
hsa_signal_destroy(signal);
hsa_queue_destroy(queue);
hsa_shut_down();
return HSA_STATUS_SUCCESS;
}
hsa_signal_create initializes a completion signal, hsa_signal_store_screlease triggers dispatch via the queue doorbell, and hsa_signal_wait_scacquire blocks until the kernel finishes, ensuring ordered memory access across agents.[6]
HSA's APIs promote portability by abstracting hardware variations through agent queries (e.g., via hsa_iterate_agents), standardized memory segments (global, private, group), and profile-based guarantees for features like image support or wavefront sizes. This abstraction allows code to run unchanged across vendors, with integration into higher-level frameworks like HIP or SYCL, which map their dispatches to HSA queues and executables for broader ecosystem compatibility.[22][6]
