Recent from talks
Contribute something
Nothing was collected or created yet.
Advanced Matrix Extensions
View on WikipediaAdvanced Matrix Extensions (AMX), also known as Intel Advanced Matrix Extensions (Intel AMX), are extensions to the x86 instruction set architecture (ISA) for microprocessors from Intel designed to work on matrices to accelerate artificial intelligence (AI) and machine learning (ML) workloads.[1] Especially they perform matrix multiplication at the hardware level, making them apt for problems and algorithms that use matrix multiplication as their core.
Extensions
[edit]AMX was introduced by Intel in June 2020 and first supported by Intel with the Sapphire Rapids microarchitecture for Xeon servers, released in January 2023.[2][3] It introduced 2-dimensional registers called tiles upon which accelerators can perform operations. It is intended as an extensible architecture; the first accelerator implemented is called tile matrix multiply unit (TMUL).[4][5]
In Intel Architecture Instruction Set Extensions and Future Features revision 46, published in September 2022, a new AMX-FP16 extension was documented. This extension adds support for half-precision floating-point numbers. In revision 48 from March 2023, AMX-COMPLEX was documented, adding support for half-precision floating-point complex numbers. Both extensions are available in the Granite Rapids set of server processors (with AMX-COMPLEX support only being available in Granite Rapids-D [6]).
Tile matrix multiply unit
[edit]TMUL unit supports BF16 and INT8 input types.[7] AMX-FP16 and AMX-COMPLEX also add support for real and complex FP16 numbers. The register file consists of 8 tiles, each with 16 rows of size of 64 bytes (32 BF16/FP16 or 64 INT8 elements). The only supported operation is matrix multiply and accumulate (MMA), which is the extension of the fused multiply–add (FMA) operation for scalar values as applied to matrix operands:[8]
4th Gen Intel Xeon Scalable processor core can perform 2048 INT8 or 1024 BF16 operations per cycle:[9][10] the maximal input sizes are for A and for B, where J is 64 for INT8 and 32 for BF16. The matrix multiplication requires multiplication and additions, thus performing operations in 16 cycles.[10]
Software support
[edit]- Compiler and assembler support
- Operating system support
- glibc support for detecting AMX feature in CPUs committed on 25 June 2020[19]
- Linux kernel support since version 5.16[20]
- VMware vSphere support for AMX in virtual machines released in ESXi version 8.0u1 for VMs using Hardware Version 20[21]
References
[edit]- ^ Hemsoth, Nicole (August 19, 2021). "With AMX, Intel Adds AI/ML Sparkle to Sapphire Rapids". The Next Platform.
- ^ online, heise (28 June 2020). "Intel AMX: Erste Informationen zur Advanced Matrix Extensions Architecture". heise online.
- ^ Cutress, Ian. "Intel Xeon Sapphire Rapids: How To Go Monolithic with Tiles". AnandTech. Archived from the original on August 31, 2021.
- ^ a b "Intel® Architecture Instruction Set Extensions and Future Features".
- ^ Schor, David (June 29, 2020). "The x86 Advanced Matrix Extension (AMX) Brings Matrix Operations; To Debut with Sapphire Rapids".
- ^ Larabel, Michael (July 12, 2023). "Intel Granite Rapids D Support Merged Into GCC 14". Phoronix.
- ^ "Advanced Matrix Extension (AMX) - x86 - WikiChip". en.wikichip.org.
- ^ "Generalized acceleration of matrix multiply accumulate operations". patents.google.com.
- ^ "Accelerate Artificial Intelligence (AI) Workloads with Intel Advanced Matrix Extensions (Intel AMX)" (PDF). Intel. Retrieved 2023-04-13.
- ^ a b "Intel® 64 and IA-32 Architectures Optimization Reference Manual Volume 1". Intel.
- ^ "What's New in LLVM for 4th Gen Intel® Xeon® & Max Series CPUs". Retrieved 21 April 2023.
- ^ Larabel, Michael (2020-07-02). "Intel AMX Support Begins Landing In LLVM". Phoronix. Retrieved 2020-07-02.
- ^ "[X86-64] Support Intel AMX instructions". GitHub. 2020-07-02. Retrieved 2020-07-02.
- ^ a b Larabel, Michael (2020-07-02). "Intel AMX Support Lands In The GNU Assembler". Phoronix. Retrieved 2020-07-02.
- ^ "GCC 11 Release Series — Changes, New Features, and Fixes - GNU Project". Retrieved 21 April 2023.
- ^ "[PATCH] Enable GCC support for AMX". 2020-07-06. Retrieved 2020-07-09.
- ^ "Enable GCC support for AMX-TILE, AMX-INT8, AMX-BF16. · gcc-mirror/gcc@5c60984". GitHub. Retrieved 2022-09-05.
- ^ "commits with Intel AMX". 2020-07-02. Retrieved 2020-07-02.
- ^ "x86: Detect Intel Advanced Matrix Extensions". 2020-07-02. Retrieved 2020-07-02.
- ^ "Linux 5.16 Features Include FUTEX2, Intel AMX, Folios, DG2/Alchemist, More Apple Silicon Support". Phoronix.
- ^ "Accessing Sapphire Rapids AMX instructions on vSphere". Earl C. Ruby III. 2023-08-24.
External links
[edit]Advanced Matrix Extensions
View on GrokipediaHistory
Announcement and Development
Advanced Matrix Extensions (AMX) were first introduced by Intel in June 2020, with further details presented at the company's Architecture Day event on August 13, 2020, where it was presented as a key component of built-in AI acceleration for future processors.[5] Developed as an extension to the x86 instruction set architecture, AMX aims to accelerate matrix multiplication and accumulation operations central to artificial intelligence and machine learning workloads on general-purpose CPUs. It specifically addresses the inefficiencies of prior vector-based instructions like AVX-512 in handling the dense, two-dimensional data patterns common in deep learning models, such as those involving neural network training.[6] The primary motivations for AMX's creation were to enable competitive CPU-based performance for AI training and inference, diminishing reliance on discrete accelerators and supporting cost-effective deployments in data centers and edge computing scenarios.[1] From its inception, AMX was designed for integration into Intel's oneAPI software ecosystem to facilitate portable, high-level programming across heterogeneous hardware. Early development included prototypes in Intel's labs, with the company expressing readiness to sample hardware to partners shortly after the announcement, emphasizing a tile-based approach for improved data handling efficiency.[7][8]Initial Release
The initial hardware implementation of Advanced Matrix Extensions (AMX) debuted with the 4th Generation Intel Xeon Scalable processors, codenamed Sapphire Rapids, which were released on January 10, 2023.[3] These server-grade processors marked the first integration of AMX into production silicon, providing dedicated acceleration for matrix multiplication operations critical to AI and high-performance computing workloads.[9] AMX is implemented as a dedicated hardware block embedded within each CPU core of the Sapphire Rapids architecture, operating alongside existing execution units to handle tiled matrix computations efficiently. This integration relies on updated CPU microcode for instruction decoding and execution, with activation controlled through operating system flags and BIOS configurations to ensure compatibility and security.[10] The rollout of AMX in Sapphire Rapids aligned with Intel's strategic emphasis on accelerating AI inference and training in data center environments, enabling scalable performance gains for enterprise server deployments without requiring discrete accelerators.[3] Enabling AMX typically involves selecting the "Intel Advanced Matrix Extensions" option in the system BIOS under processor features, followed by verification in supported operating systems like Linux, where kernels version 5.16 and later detect and enable the feature at runtime without mandatory boot parameters.[10][11]Subsequent Extensions
Following the initial release of Advanced Matrix Extensions (AMX) with the 4th Generation Intel Xeon Scalable processors (Sapphire Rapids) in early 2023, Intel introduced enhancements to broaden its applicability in AI and high-performance computing workloads. In September 2022, Intel documented the AMX-FP16 extension in revision 46 of the Intel Architecture Instruction Set Extensions and Future Features Programming Reference, adding support for half-precision floating-point (FP16) operations to accelerate AI inference tasks while balancing precision and efficiency.[12] This extension enables processing of FP16-trained models, complementing the original INT8 and BF16 formats by reducing memory bandwidth demands in neural network deployments.[1] In December 2022, Intel further expanded AMX capabilities with the AMX-COMPLEX extension, documented in revision 047 of the same reference manual, which introduces operations on half-precision complex numbers.[12] Designed for signal processing, scientific simulations, and quantum computing applications, this extension supports complex matrix multiplications essential for Fourier transforms and other domain-specific computations.[13] AMX-COMPLEX first became available in the Granite Rapids-D processors, a variant of the 6th Generation Intel Xeon Scalable family launched in 2025.[14] AMX support expanded across subsequent processor generations, enhancing scalability in data center environments. The 5th Generation Intel Xeon Scalable processors (Emerald Rapids), announced in December 2023, integrated baseline AMX along with FP16 capabilities on every core, enabling seamless upgrades from prior generations while optimizing for AI acceleration. Building on this, the 6th Generation Intel Xeon Scalable processors (Granite Rapids), released in September 2024, incorporated full AMX support including FP16 and COMPLEX extensions, delivering up to 2,048 FP16 operations per cycle per core for improved throughput in mixed-precision workloads.[14] By 2025, AMX integration extended to Xeon 6 processors with P-cores, as seen in new SKUs like the Xeon 6776P announced in May 2025, which added enhanced FP16 support to facilitate on-chip AI processing in edge and cloud scenarios.[15] In October 2025, Intel and AMD announced an agreement to support AMX in future x86 processors, promoting standardization and wider adoption across the ecosystem.[16] As of November 2025, Intel's oneAPI Base Toolkit version 2025 includes optimizations for AMX interoperability with AVX10.2, such as early access to extended instruction sets in the oneDNN library, to streamline development for hybrid CPU-GPU AI pipelines.[17] However, no major new AMX extensions have been announced, with focus shifting toward software ecosystem maturity and compatibility with emerging standards like APX.[18]Architecture
Tile Registers
The tile registers form the core storage mechanism in Intel's Advanced Matrix Extensions (AMX), providing a dedicated register file for holding matrix data during computations. This register file consists of eight two-dimensional registers named tmm0 through tmm7, each capable of storing up to 1 KB (1024 bytes) of data, for a total capacity of 8 KB across all tiles.[19][1] These registers are used to hold sub-matrices representing source operands A and B, as well as the accumulator matrix C, enabling efficient data staging for operations like matrix multiplication.[19] Each tile register supports a configurable two-dimensional layout, allowing flexible matrix shapes tailored to specific workloads. The dimensions are defined in terms of rows and columns measured in bytes, with a maximum of 16 rows and 64 bytes per row (e.g., supporting shapes like 16 rows by 64 bytes for matrix A).[19] This configuration enables tiles to represent portions of larger matrices, with the actual element count depending on the data format specified in the configuration.[19] The registers are volatile, meaning their contents are not preserved across context switches or interrupts unless explicitly saved.[20] The tile configuration is managed through a 64-bit Tile Configuration Palette register (TILECFG), which specifies the row and column dimensions, as well as the data type, for each individual tile.[19] This palette is loaded using the LDTILECFG instruction or restored via XRSTOR, and its settings can be queried through CPUID leaf 1DH (sub-leaf 1 with ECX=1).[19] The palette supports different "palettes" (e.g., Palette 1 for standard AMX configurations), each defining maximum dimensions and total tile bytes to optimize for various matrix sizes.[19] Data is loaded into the tile registers primarily from system memory using the TILELOADD instruction, which transfers elements into the specified tile based on the configured dimensions and starting address.[19] For example, TILELOADD can load a rectangular block of data directly into tmm0 from a memory operand, supporting both contiguous and strided access patterns to accommodate row-major or column-major matrix layouts.[19] This mechanism allows efficient population of tiles before computations, minimizing data movement overhead.[1]Matrix Multiply Unit
The Tile Matrix Multiply Unit (TMUL) serves as the dedicated hardware accelerator within Intel's Advanced Matrix Extensions (AMX) for executing dense matrix multiply-accumulate (MMA) operations, enabling efficient computation of large-scale matrix products essential for AI workloads.[1] This unit processes data stored in the AMX tile registers, focusing on fused multiply-add operations to minimize latency and maximize throughput in deep learning training and inference tasks.[6] In terms of operation flow, the TMUL takes two input tiles—A with dimensions and B with dimensions —and performs the accumulation , where the inner dimension aligns with the configured tile widths for seamless multiplication without intermediate transpositions.[6] This design supports block-wise computation of larger matrices by iterating over tiled submatrices, leveraging the 2D structure of the tiles to handle up to 1 KB per tile in a single instruction.[1] The TMUL is implemented as a pipelined execution block integrated directly into each CPU core of supported processors, such as the 4th Generation Intel Xeon Scalable family (Sapphire Rapids microarchitecture), allowing concurrent operation alongside scalar and vector execution pipelines.[9] It functions independently of the AVX-512 vector units, dedicating separate hardware resources—including a grid of fused multiply-add units—to matrix operations, which isolates AI compute from general-purpose processing and reduces contention for shared resources like caches.[21] Regarding throughput, in 4th Generation Xeon implementations, the TMUL delivers up to 2048 INT8 operations per cycle for inference-focused workloads or 1024 BF16 operations per cycle for training, representing an 8x improvement for INT8 and 16x for BF16 over AVX-512 VNNI and FP32 equivalents, respectively.[21] This performance stems from the unit's parallel processing of tile elements, with each cycle advancing the multiply-accumulate across the full tile dimensions without scalar overhead.[1]Data Formats
Advanced Matrix Extensions (AMX) primarily support bfloat16 (BF16) and 8-bit integer (INT8) as the core data formats for matrix elements, optimized for deep learning workloads. BF16 is a 16-bit floating-point format featuring 1 sign bit, 8 exponent bits, and 7 mantissa bits, preserving the exponent range of single-precision floating-point (FP32) to handle the dynamic ranges common in neural network training and inference while reducing memory bandwidth and computational overhead.[22] INT8, available in signed and unsigned variants, facilitates quantized inference by representing weights and activations with lower precision, achieving significant speedups in post-training quantization scenarios without substantial accuracy degradation in many models.[1][8] Subsequent extensions expand format support: AMX-FP16 introduces IEEE 754 half-precision floating-point (FP16), a 16-bit format with 1 sign bit, 5 exponent bits, and 10 mantissa bits, suitable for scenarios requiring higher precision in the mantissa than BF16. These extensions, including AMX-FP16 and AMX-COMPLEX, were introduced in the 6th Generation Intel Xeon Scalable processors (Granite Rapids) in 2024.[1][23] AMX-COMPLEX enables operations on complex numbers, where each complex element packs two FP16 values—one for the real part and one for the imaginary part—into a 32-bit dword, targeting applications in signal processing and scientific computing that involve Fourier transforms or other complex-domain computations.[23] Matrix elements are packed into tiles in row-major order, with each row spanning up to 64 bytes (32 BF16/FP16 elements or 64 INT8 elements), ensuring contiguous storage for efficient loading from memory.[24] Accumulations in matrix multiply operations incorporate saturation modes for INT8 to clamp results within representable ranges, preventing overflow, and rounding to nearest, tie-even for BF16 outputs, with denormals flushed to zero independent of the floating-point environment.[25] These formats are configured per tile to match the element type, influencing the interpretable dimensions and palette used for loading data.[26] The selection of BF16 and INT8 aligns closely with requirements in major AI frameworks like TensorFlow and PyTorch, where BF16 supports mixed-precision training to accelerate convergence and inference while mitigating the precision loss seen in traditional FP16, and INT8 enables deployment of compressed models on resource-constrained systems.[27][8]Instruction Set
Configuration Instructions
The configuration of Advanced Matrix Extensions (AMX) tiles begins with the LDTILECFG instruction, which loads a 64-byte tile configuration descriptor from memory into the TILECFG state register.[28] This descriptor specifies the palette identifier (0 for initialization or 1 for standard configuration), the starting row index (initially 0), and per-tile parameters including the number of rows and bytes per row for each of the eight source or accumulator tiles.[26] The palette determines the data format and maximum tile dimensions, such as up to 16 rows and 64 bytes per row for BF16 or INT8 formats, enabling flexible matrix shapes while adhering to the hardware's 1 KB per tile limit (8 tiles total, 8 KB aggregate).[28] Upon execution, LDTILECFG validates the descriptor; invalid palettes or dimensions trigger a general protection fault (#GP), and all tiles are zeroed if the palette is 0, preparing the architecture for subsequent operations.[28] The complementary STTILECFG instruction stores the current TILECFG state to a 64-byte memory location, writing zeros if no configuration is active, which is essential for querying or persisting setup details.[29] Data movement into and out of tiles is handled by dedicated load and store instructions that support both 1D (contiguous) and 2D (strided) memory layouts via Scale-Index-Base (SIB) addressing. The TILELOADD instruction loads tile data from memory into a specified tile register (TMM), using the base address plus an optional index register as the row stride; if no index is provided, a stride of zero implies 1D loading.[30] It respects the current TILECFG dimensions, filling rows sequentially from the start_row index (updated on partial loads due to page faults for restartability), and zeros any unfilled upper rows beyond the configured size.[30] A variant, TILELOADDT1, adds a temporal data hint to optimize caching for repeated accesses. For example, loading a 16x64 BF16 matrix uses a stride matching the row byte count (128 bytes), ensuring efficient transfer of up to 1 KB per tile. The TILESTORED instruction performs the reverse, storing a tile's contents to memory starting from the configured start_row, again using SIB for strided 2D access and resetting start_row to zero upon completion.[31] Both load and store operations are synchronous with the instruction stream, maintain cache coherency, and raise AMX-E3 exceptions for tile architecture issues, with Intel TSX transactions aborting on execution.[31] To prepare tiles for computation, particularly accumulators, the TILEZERO instruction clears all bytes in a specified tile register to zero, using the current palette to determine the extent (rows and bytes per row).[32] This operation is atomic, affects no flags, and is crucial for initializing accumulators before matrix multiplications, ensuring no residual data interferes with results; it raises AMX-E5 exceptions if the tile state is invalid. For instance, zeroing an 8x64 INT8 accumulator tile resets its 512 bytes efficiently in a single instruction.[32] State management for context switching in multitasking environments involves saving and restoring both the configuration and tile data using dedicated mechanisms beyond standard XSAVE/XRSTOR extensions. The TILECFG state is persisted via STTILECFG to memory, while tile data requires explicit stores with TILESTORED to capture the full 8 KB register file. Restoration reverses this: LDTILECFG reloads the configuration, followed by TILELOADD to repopulate tiles from saved memory. This approach, supported by OS-enabled XTILECFG and XTILEDATA components in XSAVE (via arch_prctl or equivalent), ensures complete AMX state migration without hardware DMA, though it incurs latency due to the large data volume; Intel recommends minimizing switches in performance-critical paths.[26]Compute Instructions
The compute instructions in Advanced Matrix Extensions (AMX) form the core of its arithmetic capabilities, enabling efficient matrix multiply-accumulate operations on tile registers. These instructions, part of the TDP (Tile Dot Product) family, perform the fundamental computation , where , , and represent matrices stored in 2D tile registers (TMM0 through TMM7), with dimensions configured via the tile palette. The accumulation occurs element-wise after multiplying corresponding rows of (size ) and columns of (size ), producing an updated of size . All operations use round-to-nearest-even (RNE) rounding and flush denormals to zero for both inputs and outputs, without raising exceptions or updating the MXCSR register.[19] The TDP family includes variants tailored to specific data precisions, supporting low-precision formats common in AI and high-performance computing workloads. For bfloat16 (BF16) inputs, the TDPBF16PS instruction computes the dot product of BF16 elements from two source tiles and accumulates the results into a single-precision floating-point (FP32) tile. For example, each pair of BF16 values from the source tiles is multiplied and summed across the inner dimension , with the output scaled implicitly by the precision conversion to FP32. Similarly, the TDPFP16PS instruction (part of the AMX-FP16 extension) handles FP16 inputs, performing the same multiply-accumulate semantics but treating FP16 pairs as the operands, again accumulating into FP32. These floating-point variants enable high-throughput operations for neural network layers, where BF16 and FP16 reduce memory bandwidth without significant accuracy loss.[33][19] Integer variants target quantized models, with the AMX-INT8 instructions providing four signed-saturate options: TDPBSSD for signed-by-signed byte multiplications accumulating to 32-bit integers; TDPBSUD for signed-by-unsigned; TDPBUSD for unsigned-by-signed; and TDPBUUD for unsigned-by-unsigned. Each processes byte elements (four per 32-bit dword) from the source tiles, computing dot products with saturation to prevent overflow during accumulation into an INT32 tile. These instructions support signed saturation, ensuring results stay within the representable range of the accumulator. For complex arithmetic (AMX-COMPLEX), the instructions TCMMRLFP16PS and TCMMIMFP16PS handle real and imaginary parts separately using FP16 components, performing conjugate multiplications and accumulations into separate FP32 tiles for real and imaginary outputs, effectively realizing for complex matrices.[34][19] Subsequent extensions as of November 2025 include AMX-FP8 for 8-bit floating-point operations (e.g., TDPBF8PS, TDPHF8PS instructions supporting E4M3 and E5M2 formats) and AMX-TF32 for tensor float 32 (e.g., TDPTF32PS), enabling further optimizations for AI workloads on newer processors like Intel Xeon 6 (Granite Rapids). AMX-TRANSPOSE was proposed for efficient matrix transposition but later discontinued.[35][36] All TDP-family instructions are encoded using the EVEX prefix, extending the AVX-512 framework, with tile registers specified via ModR/M and VEX.vvvv fields (e.g., TDPBF16PS uses EVEX.512.66.0F38.W0 5F /r, where /r indicates tmm1 as destination, tmm2 as first source, and tmm3 as second source). The EVEX encoding ensures 512-bit vector operation modes while embedding the 2D tile semantics, distinguishing AMX from scalar or 1D vector instructions. Prior tile configuration via the configuration instructions is required to define the palette and dimensions before executing these compute operations. Support for each variant is enumerated via CPUID leaf 07H. Basic features (AMX-TILE in EDX bit 24, AMX-BF16 in EDX bit 22, AMX-INT8 in EDX bit 25) at sub-leaf ECX=0; AMX-FP16 in EDX bit 21 and AMX-COMPLEX in EDX bit 8 at sub-leaf ECX=1.[19][37]Software Support
Operating Systems
Linux kernel version 5.16, released in late 2021, introduced support for Advanced Matrix Extensions (AMX), enabling the operating system to manage AMX state components through the XSAVE feature and allowing per-process enablement via the arch_prctl(2) system call with ARCH_REQ_XCOMP_PERM to request permission for dynamic XSTATE components like TILE_DATA.[38][39] This support ensures proper context switching and signal handling for AMX usage, with the kernel detecting AMX-capable processors at runtime and setting the AMX bit in XCR0 during boot.[40] VMware ESXi 8.0 Update 1, released in 2023, provides full passthrough support for AMX instructions to virtual machines configured with virtual hardware version 20 or higher, allowing guest operating systems like Linux kernel 5.19 or later to utilize the extension without host intervention.[41] Windows 11, along with Windows Server 2022, enables AMX support on compatible Intel Xeon processors through kernel updates that handle AMX context saving and restoration, requiring driver-level feature detection to ensure compatibility; Windows 10 lacks this context save capability, preventing AMX usage.[42] Oracle Linux's Unbreakable Enterprise Kernel 7 Update 1, released in April 2023, includes runtime support for AMX on capable processors, automatically detecting the feature and enabling userspace access without additional configuration.[11] FreeBSD provides basic enablement for AMX through Intel microcode updates via the cpu-microcode-intel port, which applies processor firmware updates at boot to ensure compatibility on supported hardware, though full kernel integration for AMX state management remains limited.[43] Across these operating systems, AMX features are detected using the CPUID instruction with leaf 7 and subleaf 0, where EDX bit 24 indicates support for the core AMX-TILE functionality, including tile multiply accumulate (TMUL) operations, and bit 22 signals the AMX-BF16 variant.[44][45]Compilers and Libraries
Support for Advanced Matrix Extensions (AMX) in compilers primarily comes through intrinsic functions that map directly to the AMX instruction set architecture, enabling developers to write low-level code for tile-based operations.[26] The GNU Compiler Collection (GCC) version 11 and later provides AMX support via the -mamx-tile flag, which exposes intrinsics such as _tile_loadconfig for loading tile configuration descriptors (TCFG).[46][47] Similarly, LLVM-based Clang compilers, starting from version 13, include AMX intrinsics through headers like amxintrin.h, activated by the -mamx-tile option, allowing for tile loading, computation, and storage operations.[48] These intrinsics facilitate direct access to AMX features without requiring assembly code, though they demand explicit management of tile registers and shapes.[49] Optimized libraries abstract AMX usage for higher-level programming, reducing the need for manual intrinsic calls. Intel's oneDNN (oneAPI Deep Neural Network Library), formerly known as MKL-DNN, integrates automatic AMX offload for primitives like matrix multiplication and convolution, detecting hardware availability at runtime via its CPU dispatcher to select AMX-accelerated kernels. This enables seamless acceleration of deep learning workloads on supported Intel Xeon processors without code modifications. For heterogeneous computing, the oneAPI DPC++ compiler supports AMX through its integration with oneMKL (Math Kernel Library), allowing SYCL-based code to target AMX for matrix operations alongside GPU or FPGA offloads. Major deep learning frameworks leverage AMX via dedicated extensions. The Intel Extension for TensorFlow, integrated since TensorFlow 2.11 in early 2023, incorporates AMX-optimized kernels for operations like GEMM and batch normalization, automatically dispatching to AMX on compatible hardware.[50] Likewise, the Intel Extension for PyTorch, available from version 1.13 in 2023, provides AMX support for training and inference kernels, including fused operations that utilize tile registers for improved throughput.[8] These extensions maintain framework compatibility while enabling AMX acceleration for models like transformers. As of 2025, the oneAPI Base Toolkit release enhances AMX integration with support for AVX10.2 instructions on upcoming processors such as Diamond Rapids, extending AMX capabilities to newer vector-matrix hybrid operations within the oneDNN and DPC++ environments.[17] This update ensures backward compatibility with existing AMX code while optimizing for advanced ISA features in future Intel architectures.Performance and Applications
Performance Metrics
Advanced Matrix Extensions (AMX) provide significant throughput improvements for matrix operations on Intel Xeon processors. On 4th-generation Intel Xeon Scalable processors, each core can perform up to 2,048 INT8 operations per cycle and 1,024 BF16 operations per cycle, enabling high-efficiency AI workloads.[51] These specifications represent an 8x increase in INT8 throughput and a 16x increase in BF16 throughput compared to AVX-512 VNNI, which delivers 256 INT8 and 64 FP32 operations per cycle, primarily due to AMX's 2D tile register architecture that reduces data movement overhead.[51] Benchmark results demonstrate substantial speedups in AI tasks. For instance, 5th-generation Intel Xeon processors achieve up to 14x better training and inference performance compared to 3rd-generation Intel Xeon processors, leveraging AMX for accelerated matrix multiplications.[1] Specific workloads show even higher gains: real-time speech recognition inference (RNN-T) yields up to 10.7x speedup using BF16 on 5th-generation Intel Xeon Platinum 8592+ versus FP32 on 3rd-generation models.[1] Additionally, GPT-J-6B inference performance doubles on Intel Xeon 6 processors compared to 5th-generation Intel Xeon, with BF16 data format.[1] Efficiency considerations highlight trade-offs in precision and power. BF16 operations offer a balance of precision and performance suitable for training and high-accuracy inference, while INT8 enables higher throughput for quantized models at reduced precision, often with lower power consumption.[51] For example, AMX delivers up to 7.9x higher performance per watt in speech recognition on 5th-generation Intel Xeon versus 3rd-generation, emphasizing energy-efficient scaling for data center deployments.[1]| Data Type | AMX Throughput (ops/cycle per core) | AVX-512 VNNI Throughput (ops/cycle per core) | Speedup Factor |
|---|---|---|---|
| INT8 | 2,048 | 256 | 8x |
| BF16 | 1,024 | 64 (FP32 equivalent) | 16x |
Use Cases
Advanced Matrix Extensions (AMX) have found significant application in artificial intelligence and machine learning workloads, particularly for accelerating deep learning training and inference directly on CPUs without relying on dedicated GPUs. This capability is especially beneficial for transformer-based models such as BERT and GPT variants, where matrix multiplications form the core of operations like attention mechanisms. For instance, optimized implementations using AMX on Intel Xeon processors enable efficient natural language processing tasks, including text classification and generation, by leveraging low-precision formats like BF16 and INT8 to reduce computational overhead while maintaining model accuracy.[52][53] In scientific computing, AMX supports matrix operations critical to simulations and numerical methods, such as those involving mixed-precision floating-point computations for high-fidelity modeling in fields like renewable energy and physics. These extensions facilitate faster execution of linear algebra routines, which underpin algorithms for data analysis and predictive modeling, allowing researchers to process large datasets on standard server hardware.[24] AMX enables efficient AI inference in both edge and cloud environments, particularly on Intel Xeon-based servers, where power and space constraints demand integrated acceleration. In cloud deployments, it supports scalable AI services for hyperscale data centers, reducing costs for lighter workloads like recommendation systems and image recognition by avoiding GPU provisioning. As of 2025, adoption has grown in enterprise and cloud infrastructures, with platforms like Nutanix and VMware integrating AMX for on-premises, hybrid, and edge AI applications, including Google's use of AMX with Intel TDX for confidential AI workloads as of July 2025, promoting cost-effective scaling for distributed computing needs.[54][55][41][56]References
- https://en.wikichip.org/wiki/x86/amx
