Hubbry Logo
FMA instruction setFMA instruction setMain
Open search
FMA instruction set
Community hub
FMA instruction set
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
FMA instruction set
FMA instruction set
from Wikipedia

The FMA instruction set is an extension to the 128- and 256-bit Streaming SIMD Extensions instructions in the x86 microprocessor instruction set to perform fused multiply–add (FMA) operations.[1] There are two variants:

Instructions

[edit]

FMA3 and FMA4 instructions have almost identical functionality, but are not compatible. Both contain fused multiply–add (FMA) instructions for floating-point scalar and SIMD operations, but FMA3 instructions have three operands, while FMA4 ones have four. The FMA operation has the form d = round(a · b + c), where the round function performs a rounding to allow the result to fit within the destination register if there are too many significant bits to fit within the destination.

The four-operand form (FMA4) allows a, b, c and d to be four different registers, while the three-operand form (FMA3) requires that d be the same register as a, b or c. The three-operand form makes the code shorter and the hardware implementation slightly simpler, while the four-operand form provides more programming flexibility.

See XOP instruction set for more discussion of compatibility issues between Intel and AMD.

FMA3 instruction set

[edit]

CPUs with FMA3

[edit]
  • AMD
    • Piledriver (2012) and newer microarchitectures[3]
      • 2nd gen APUs, "Trinity" (32nm), May 15, 2012
      • 2nd gen "Bulldozer" (bdver2) with Piledriver cores, October 23, 2012
  • Intel

Excerpt from FMA3

[edit]

Supported commands include

Mnemonic Operation Mnemonic Operation
VFMADD result = + a · b + c VFMADDSUB result = a · b + c for i = 1, 3, ...
result = a · b − c for i = 0, 2, ...
VFNMADD result = − a · b + c
VFMSUB result = + a · b − c VFMSUBADD result = a · b − c for i = 1, 3, ...
result = a · b + c for i = 0, 2, ...
VFNMSUB result = − a · b − c
Note
  • VFNMADD is result = − a · b + c, not result = − (a · b + c).
  • VFNMSUB generates a −0 when all inputs are zero.

Explicit order of operands is included in the mnemonic using numbers "132", "213", and "231":

Postfix
1
Operation possible
memory operand
overwrites
132 a = a · c + b c (factor) a (other factor)
213 a = b · a + c c (summand) a (factor)
231 a = b · c + a c (factor) a (summand)

as well as operand format (packed or scalar) and size (single or double).

Postfix
2
precision size Postfix
2
precision size
SS Single 32 bit SD Double 64 bit
PSx 4× 32 bit PDx 2× 64 bit
PSy 8× 32 bit PDy 4× 64 bit
PSz 16× 32 bit PDz 8× 64 bit

This results in

Encoding Mnemonic Operands Operation
VEX.256.66.0F38.W1 98 /r VFMADD132PDy ymm, ymm, ymm/m256 a = a · c + b
VEX.256.66.0F38.W0 98 /r VFMADD132PSy
VEX.128.66.0F38.W1 98 /r VFMADD132PDx xmm, xmm, xmm/m128
VEX.128.66.0F38.W0 98 /r VFMADD132PSx
VEX.LIG.66.0F38.W1 99 /r VFMADD132SD xmm, xmm, xmm/m64
VEX.LIG.66.0F38.W0 99 /r VFMADD132SS xmm, xmm, xmm/m32
VEX.256.66.0F38.W1 A8 /r VFMADD213PDy ymm, ymm, ymm/m256 a = b · a + c
VEX.256.66.0F38.W0 A8 /r VFMADD213PSy
VEX.128.66.0F38.W1 A8 /r VFMADD213PDx xmm, xmm, xmm/m128
VEX.128.66.0F38.W0 A8 /r VFMADD213PSx
VEX.LIG.66.0F38.W1 A9 /r VFMADD213SD xmm, xmm, xmm/m64
VEX.LIG.66.0F38.W0 A9 /r VFMADD213SS xmm, xmm, xmm/m32
VEX.256.66.0F38.W1 B8 /r VFMADD231PDy ymm, ymm, ymm/m256 a = b · c + a
VEX.256.66.0F38.W0 B8 /r VFMADD231PSy
VEX.128.66.0F38.W1 B8 /r VFMADD231PDx xmm, xmm, xmm/m128
VEX.128.66.0F38.W0 B8 /r VFMADD231PSx
VEX.LIG.66.0F38.W1 B9 /r VFMADD231SD xmm, xmm, xmm/m64
VEX.LIG.66.0F38.W0 B9 /r VFMADD231SS xmm, xmm, xmm/m32

FMA4 instruction set

[edit]

CPUs with FMA4

[edit]
  • AMD
    • "Heavy Equipment" processors
    • Zen: WikiChip's testing shows FMA4 still appears to work (under the conditions of the tests) despite not being officially supported and not even reported by CPUID. This has also been confirmed by Agner Fog.[8] But other tests gave wrong results.[9] AMD Official Web Site FMA4 Support Note ZEN CPUs = AMD ThreadRipper 1900x, R7 Pro 1800, 1700, R5 Pro 1600, 1500, R3 Pro 1300, 1200, R3 2200G, R5 2400G.[10][11][12]
  • Intel
    • Intel has not released CPUs with support for FMA4.

Excerpt from FMA4

[edit]
Mnemonic (AT&T) Operands Operation
VFMADDPDx xmm, xmm, xmm/m128, xmm/m128 a = b·c + d
VFMADDPDy ymm, ymm, ymm/m256, ymm/m256
VFMADDPSx xmm, xmm, xmm/m128, xmm/m128
VFMADDPSy ymm, ymm, ymm/m256, ymm/m256
VFMADDSD xmm, xmm, xmm/m64, xmm/m64
VFMADDSS xmm, xmm, xmm/m32, xmm/m32

History

[edit]

The incompatibility between Intel's FMA3 and AMD's FMA4 is due to both companies changing plans without coordinating coding details with each other. AMD changed their plans from FMA3 to FMA4 while Intel changed their plans from FMA4 to FMA3 almost at the same time. The history can be summarized as follows:

  • August 2007: AMD announces the SSE5 instruction set, which includes 3-operand FMA instructions. A new coding scheme (DREX) is introduced for allowing instructions to have three operands.[13]
  • April 2008: Intel announces their AVX and FMA instruction sets, including 4-operand FMA instructions. The coding of these instructions uses the new VEX coding scheme,[14] which is more flexible than AMD's DREX scheme.
  • December 2008: Intel changes the specification for their FMA instructions from 4-operand to 3-operand instructions. The VEX coding scheme is still used.[15]
  • May 2009: AMD changes the specification of their FMA instructions from the 3-operand DREX form to the 4-operand VEX form, compatible with the April 2008 Intel specification rather than the December 2008 Intel specification.[16]
  • October 2011: AMD Bulldozer processor supports FMA4.[17]
  • January 2012: AMD announces FMA3 support in future processors codenamed Trinity and Vishera; they are based on the Piledriver architecture.[18]
  • May 2012: AMD Piledriver processor supports both FMA3 and FMA4.[17]
  • June 2013: Intel Haswell processor supports FMA3.[19]
  • February 2017: The first generation of AMD Ryzen processors officially supports FMA3, but not FMA4 according to the CPUID instruction.[2] There has been confusion regarding whether FMA4 was implemented or not on this processor due to errata in the initial patch to the GNU Binutils package that has since been rectified.[20][21] One unconfirmed report of wrong results[9] led to some doubt, but Mysticial (Alexander Yee, developer of y-cruncher) debunked it:[22] FMA4 worked for bit-exact bignum calculations on his Zen 1 system for years, and the one report on Reddit never had any followup investigation to rule out mistakes in the testing software before being widely repeated. The initial Ryzen CPUs could be crashed by a particular sequence of FMA3 instructions, but updated CPU microcode fixes the problem.[23]
  • July 2019: AMD Zen 2 and later Ryzen processors don't support FMA4 at all.[24] They continue to support FMA3. Only Zen 1 and Zen+ have unofficial FMA4 support.

Compiler and assembler support

[edit]

Different compilers provide different levels of support for FMA:

References

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
The fused multiply-add (FMA) is a operation that computes the expression a×b+ca \times b + c (or variants like a×bca \times b - c) in a single computational step, applying only once at the end to maintain higher precision than separate multiply and add instructions, which incur two rounding errors. This fused approach reduces latency, boosts throughput in numerical algorithms such as and , and is particularly valuable in , scientific simulations, and workloads. The FMA operation was first implemented in hardware with the POWER1 processor (also known as RS/6000) in 1990, marking a significant advancement in reduced instruction set computing (RISC) floating-point units by enabling a two-cycle pipeline latency for the combined operation. It was later formalized as a recommended operation in the IEEE 754-2008 standard for , which specifies the fusedMultiplyAdd function to compute (xy)+z(x \cdot y) + z with extended-range intermediate results before final rounding to the destination format. Early adopters included architectures like the HP PA-RISC and Intel Itanium, but widespread integration accelerated in the as vector extensions evolved to leverage FMA for parallel processing. FMA instructions have been incorporated into major processor instruction sets, enhancing support for single- and double-precision floating-point operations across scalar and vector registers. In the x86 architecture, introduced three-operand FMA (FMA3) with the AVX2 extension in Haswell processors (2013), supporting 256-bit operations on YMM registers, and expanded it to 512-bit ZMM registers in starting with Knights Landing (2016) and later mainstream CPUs like Skylake-X. implemented FMA4 (four-operand variant) in (2011) and FMA3 in Piledriver (2012), with ongoing support in architectures; ARM added FMA intrinsics in ARMv8-A (2013) for NEON SIMD units, performing the operation with a single step. These extensions typically detect support via bits and enable fused operations that can double floating-point performance in fused-domain workloads compared to non-fused alternatives.

Fundamentals

Fused Multiply-Add Operation

The fused multiply-add (FMA) operation is a ternary instruction that computes the expression a×b+ca \times b + c (or variants such as a×bca \times b - c) as a single fused step, performing the multiplication and addition without an intermediate rounding after the multiplication. This design, standardized in IEEE 754-2008, ensures the intermediate product maintains full precision before addition, with rounding applied only once at the end to the destination format. Execution of the FMA typically proceeds through distinct stages in the processor pipeline: first, a multiplication stage computes the exact product a×ba \times b using extended internal precision; second, an addition stage incorporates the third operand cc (or subtracts it in the variant case); and finally, a single rounding stage converts the accumulated result to the target precision. This fused approach adheres to IEEE 754 compliance by treating the entire computation as if performed with unbounded range and precision, minimizing opportunities for rounding errors. FMA supports common data types including single-precision (binary32, 32 bits), double-precision (binary64, 64 bits), and extended-precision formats (such as 80-bit) where hardware implementations provide them. The operation is mathematically expressed in pseudocode as: result=\round(a×b+c)\text{result} = \round(a \times b + c) where \round\round applies the active rounding mode—such as round-to-nearest (ties to even), round toward , round toward positive infinity, or round toward negative infinity—to fit the result into the destination format, fully aligning with requirements. In contrast to non-fused multiply-add sequences, which require two separate instructions (one for multiplication and one for addition) and thus incur additional latency from instruction dispatch and intermediate , FMA consolidates the computation into one instruction, often achieving latency equivalent to a single multiply or add on capable processors. This reduction in instruction count enhances throughput in compute-intensive applications while preserving or improving numerical accuracy.

Precision Benefits and Performance Advantages

The fused multiply-add (FMA) operation enhances numerical precision by performing the and in a single step, applying only once to the final result rather than twice—once after and again after —as occurs with separate multiply and add instructions. This "fused" preserves more significant bits from the intermediate product, reducing accumulated errors, particularly in iterative computations where errors compound over multiple operations. According to IEEE 754-2008 standards, FMA guarantees a result correctly rounded as if computed with extra precision before the final , making it especially valuable for algorithms sensitive to precision loss, such as those involving subtractive cancellation where small differences between large values can otherwise lead to catastrophic loss of accuracy. A concrete illustration of this precision benefit appears in double-precision . Consider computing a×b+ca \times b + c where a=1010a = 10^{10}, b=1+1010b = 1 + 10^{-10}, and c=1010c = -10^{10}. The exact mathematical result is 1.0. With separate multiply and add operations, the intermediate product a×b=1010+1a \times b = 10^{10} + 1 rounds to 101010^{10} because the unit in the last place (ulp) at 101010^{10} is approximately 9.54×1079.54 \times 10^{-7}, larger than 1, so adding cc results in 0.0. In contrast, FMA computes the product with (effectively doubling the mantissa bits before addition), preserving the +1, and after adding cc and a single , yields 1.0—the correct result with full double-precision accuracy. This discrepancy highlights how FMA avoids the loss of up to half the precision bits that can occur in non-fused sequences. Beyond precision, FMA offers significant advantages by fusing two operations into one instruction, reducing the total instruction count from two (multiply followed by add) to one, which lowers overhead in pipelined execution and improves code density. On modern x86 processors, FMA instructions typically exhibit a latency of 4-5 cycles for double-precision vector operations (e.g., 256-bit AVX2), compared to 8-12 cycles for dependent multiply-add sequences, while offering comparable or better throughput (e.g., 0.5-1 cycle per operation versus ~1 for separate ops). This efficiency is amplified in vector units, where dual FMA execution ports on Haswell and later (or architectures) enable up to two fused operations per cycle, boosting overall throughput in SIMD-heavy workloads. These benefits are particularly critical in applications reliant on repeated multiply-add patterns, such as linear algebra routines like , where FMA accelerates the core saxpy (scalar multiply and add) operations central to algorithms like DGEMM in BLAS libraries. Similarly, benefits from FMA in , reducing error accumulation across nested multiplications and additions; fast Fourier transforms (FFTs) leverage it for multiplications and accumulations in butterfly computations; and gradient computations, involving dot products and updates, see improved speed and stability in frameworks like or when using FMA-enabled hardware. Despite these advantages, FMA has potential drawbacks, including slightly higher power consumption per operation due to the increased complexity of the fused arithmetic unit, which performs more computation in a single cycle compared to separate, simpler multiply and add units. Additionally, FMA provides no benefit—and may introduce issues—for non-associative expressions where the specific grouping enforced by () + c alters the computational order in ways that affect or require separate operations to match a desired associativity pattern, as is inherently non-associative.

FMA3 Instruction Set

Technical Description and Instructions

The FMA3 instruction set, introduced as part of the AVX2 extension, provides three-operand fused multiply-add operations for in processors. These instructions compute expressions of the form (A × B) ± C in a single operation, fusing the and (or ) steps to minimize errors and improve computational compared to separate multiply and add instructions. This fusion ensures that the intermediate product is not rounded before , preserving precision equivalent to a single extended-precision operation rounded once to the destination format. FMA3 supports both single-precision (32-bit) and double-precision (64-bit) floating-point data, operating on 128-bit, 256-bit, and (in contexts) 512-bit vector registers, enabling SIMD parallelism across multiple elements. The instructions use VEX (for AVX/AVX2) or EVEX (for ) encoding prefixes to specify vector lengths, operand orders, and additional features like writemasking. In VEX encoding, a 2- or 3-byte prefix (e.g., C4H or C5H) is followed by an opcode from the 0F 38H range, a byte for addressing, and optional SIB or displacement for memory access. The three s are DEST (which is both read and written), SRC1, and SRC2, with the result overwriting DEST; this design allows in-place computation but requires careful register management to avoid . EVEX encoding extends this with a 4-byte prefix supporting 512-bit vectors, embedded writemasks (k registers), zeroing/merging behavior, broadcast from memory, and embedded rounding control bits (SAE for suppressed exceptions). All FMA3 instructions raise SIMD floating-point exceptions (#XM) for invalid operations, denormals, overflows, underflows, and inexact results, controlled by the MXCSR register or EVEX.SAE. They also generate general-protection (#GP) exceptions for unaligned memory accesses unless aligned or using appropriate flags. FMA3 includes variants for different operation signs and patterns, all following the general fused multiply-add paradigm. The core instructions are prefixed with "V" to indicate vector operation and suffixed to denote precision and type (PD for packed double-precision, PS for packed single-precision, SD/SS for scalar double/single). Negated and subtract variants use "N" or "SUB" suffixes. For example, VFMADDPD computes packed double-precision (A × B) + C, where A, B, and C are vectors of four 64-bit elements in 256-bit registers. The operation is defined mathematically as: DESTi=round((SRC1i×SRC2i)+SRC3i)\text{DEST}_i = \text{round}(( \text{SRC1}_i \times \text{SRC2}_i ) + \text{SRC3}_i ) for each element ii, using the current rounding mode (from MXCSR or EVEX.RC). Similarly, VFMSUBPD performs (A × B) - C, and VFNMADDPD computes -((A × B) + C). Alternating add/subtract patterns are handled by VFMADDSUBPD/PS, which applies + to even-indexed elements and - to odd-indexed ones in the vector. Scalar versions like VFMADDSD operate only on the low-order element of XMM registers. In AVX2 (VEX encoding), scalar operations on XMM registers zero the upper 128 bits of the YMM register. In AVX-512 (EVEX encoding), behavior can be controlled with merging or zeroing via writemasks.
InstructionOperationPrecisionVector SizeKey Notes
VFMADDPD/PS(SRC1 × SRC2) + DESTDouble/Single (packed)128/256/512-bitStandard add variant; supports memory operand for SRC2.
VFMADDSD/SS(SRC1 × SRC2) + DEST (low element only)Double/Single (scalar)128-bit (XMM)Upper bits zeroed in legacy SSE; merged in VEX.
VFMSUBPD/PS(SRC1 × SRC2) - DESTDouble/Single (packed)128/256/512-bitSubtract variant; useful for difference computations.
VFMSUBSD/SS(SRC1 × SRC2) - DEST (low element only)Double/Single (scalar)128-bit (XMM)Scalar subtract; exceptions same as packed.
VFMADDSUBPD/PSAlternating (SRC1 × SRC2) + DEST (even), - DEST (odd)Double/Single (packed)128/256/512-bitFor complex or paired operations; index-based signing.
VFNMADDPD/PS-( (SRC1 × SRC2) + DEST )Double/Single (packed)128/256/512-bitNegated multiply-add; inverts sign post-fusion.
VFNMSUBPD/PS-( (SRC1 × SRC2) - DEST )Double/Single (packed)128/256/512-bitNegated multiply-subtract; common in polynomial evaluation.
VFNMSUBSD/SS-( (SRC1 × SRC2) - DEST ) (low element only)Double/Single (scalar)128-bit (XMM)Scalar negated subtract.
Operand order variants (132, 213, 231) allow flexible register usage: e.g., VFMADD132PD does (DEST × SRC2) + SRC1, enabling different destruction patterns without extra moves. In assembly, an example for 256-bit packed double-precision is VFMADDPD ymm0, ymm1, ymm2, ymm3, computing (ymm1 × ymm2) + ymm3 and storing in ymm0. Intrinsics in C/C++ (via <immintrin.h>) mirror this, such as _mm256_fmadd_pd(a, b, c) for VFMADDPD. Detection requires function 01H, ECX bit 12 (FMA) set to 1, with support on Haswell and later, and Piledriver and later architectures. Advanced variants like V4FMADDPS (introduced in Knights Landing for ) perform four sequential single-precision FMAs in a 512-bit operation, useful for matrix multiplications, but are not core to baseline FMA3. All instructions maintain compliance for exceptions and rounding, with EVEX.SAE suppressing #XM for faster execution in non-critical paths. Performance characteristics vary by ; for instance, on Haswell, a 256-bit VFMADDPD has a latency of 5 cycles and throughput of 0.5 cycles per element, fusing operations to boost FLOPS rates in vectorized code.

Supported Processors

The FMA3 instruction set, which provides three-operand fused multiply-add operations as an extension to AVX2, was first introduced in Intel's Haswell microarchitecture in 2013. It is supported across all subsequent Intel client and server processor families, including Broadwell (2014), Skylake (2015), and later generations such as Alder Lake, Raptor Lake, Meteor Lake, and Granite Rapids. Specific examples include the 4th-generation Core i7-4770K desktop processor and the Xeon E5-2600 v3 family for servers, with support verified through CPUID leaf 07H EBX bit 5 for AVX2, which encompasses FMA3. Low-end models like certain Pentium and Celeron processors from these generations may lack full AVX2/FMA3 support due to cost optimizations, but mainstream Core i3/i5/i7/i9 and Xeon lines include it universally. AMD introduced FMA3 support starting with its Piledriver microarchitecture in 2012, alongside continued support for the earlier FMA4 variant, enabling compatibility with both three- and four-operand operations. This extends to all later AMD architectures, including Steamroller (2014), Excavator (2015), Zen (2017), and subsequent Zen 2 through Zen 5 families as of 2025. Representative processors include the FX-8350 from the Piledriver-based FX series and modern examples like the Ryzen 9 7950X (Zen 4) and EPYC 9755 (Zen 5), where FMA3 is enumerated via CPUID function 01H ECX bit 12. AMD's implementation ensures backward compatibility with x86-64 standards, enhancing floating-point performance in high-performance computing workloads.

Implementation Examples

FMA3 instructions enable efficient implementation of fused multiply-add operations in both assembly and higher-level languages via intrinsics, particularly for vectorized computations in scientific and numerical applications. These instructions, part of the AVX2 extension, operate on 128-bit (XMM) or 256-bit (YMM) registers, performing operations like a×b+ca \times b + c with a single step for improved precision and reduced latency compared to separate multiply and add instructions. Implementations typically target processors supporting AVX2, such as Haswell and later, where FMA3 achieves a latency of 3-5 cycles and throughput of 0.5 cycles per operation on floating-point units. In , FMA3 is accessed through intrinsics defined in <immintrin.h>, allowing developers to write portable vector code without inline assembly. For example, the _mm_fmadd_ps intrinsic multiplies two packed single-precision floating-point vectors and adds a third, storing the result in a 128-bit vector. A basic usage for computing a scalar fused multiply-add on the lowest elements might appear as follows:

c

#include <immintrin.h> __m128 a = _mm_set_ps1(2.0f); // Vector with all elements 2.0 __m128 b = _mm_set_ps1(3.0f); // Vector with all elements 3.0 __m128 c = _mm_set_ps1(1.0f); // Vector with all elements 1.0 __m128 result = _mm_fmadd_ps(a, b, c); // Computes a * b + c element-wise float scalar_result = _mm_cvtss_f32(result); // Extract lowest element: 7.0

#include <immintrin.h> __m128 a = _mm_set_ps1(2.0f); // Vector with all elements 2.0 __m128 b = _mm_set_ps1(3.0f); // Vector with all elements 3.0 __m128 c = _mm_set_ps1(1.0f); // Vector with all elements 1.0 __m128 result = _mm_fmadd_ps(a, b, c); // Computes a * b + c element-wise float scalar_result = _mm_cvtss_f32(result); // Extract lowest element: 7.0

This computes 2.0×3.0+1.0=7.02.0 \times 3.0 + 1.0 = 7.0 for each element, with the full vector operation leveraging SIMD parallelism. For 256-bit vectors, _mm256_fmadd_ps extends this to eight single-precision elements, ideal for matrix operations. Similar variants like _mm_fmsub_ps (for a×bca \times b - c) and _mm_fnmadd_ps (for (a×b)+c- (a \times b) + c) support diverse arithmetic patterns. Compilation requires flags such as -mavx2 -mfma in GCC to enable code generation. Assembly-level implementations provide finer control, especially in performance-critical kernels like the DAXPY (y = a * x + y) for double-precision vectors. On processors, FMA3 uses instructions like vfmadd231pd (multiply and add to accumulator), which update the destination register non-destructively. The following assembly snippet implements a loop for DAXPY on 128-bit XMM registers, processing two double-precision elements per iteration and achieving 1-2 cycles per iteration on Haswell and later microarchitectures:

assembly

section .text extern n ; Number of elements extern X ; Input array X extern Y ; Input/output array Y extern DA ; Scalar multiplier a DAXPY: mov rax, [n] ; Load n (assume even for simplicity) shl rax, 3 ; Convert to bytes (8 bytes per double) lea rsi, [X + rax] ; Point to end of X lea rdi, [Y + rax] ; Point to end of Y neg rax ; Negative offset for backward loop vmovddup xmm2, [DA] ; Broadcast scalar a to xmm2 L1: vmovapd xmm1, [rdi + rax] ; Load two doubles from Y vfmadd231pd xmm1, xmm2, [rsi + rax] ; Y[i] = Y[i] + a * X[i] vmovapd [rdi + rax], xmm1 ; Store back to Y add rax, 16 ; Advance by 16 bytes (two doubles) jl L1 ret

section .text extern n ; Number of elements extern X ; Input array X extern Y ; Input/output array Y extern DA ; Scalar multiplier a DAXPY: mov rax, [n] ; Load n (assume even for simplicity) shl rax, 3 ; Convert to bytes (8 bytes per double) lea rsi, [X + rax] ; Point to end of X lea rdi, [Y + rax] ; Point to end of Y neg rax ; Negative offset for backward loop vmovddup xmm2, [DA] ; Broadcast scalar a to xmm2 L1: vmovapd xmm1, [rdi + rax] ; Load two doubles from Y vfmadd231pd xmm1, xmm2, [rsi + rax] ; Y[i] = Y[i] + a * X[i] vmovapd [rdi + rax], xmm1 ; Store back to Y add rax, 16 ; Advance by 16 bytes (two doubles) jl L1 ret

This code uses a backward loop to avoid partial register stalls, with vfmadd231pd fusing the multiply and add in one instruction. For wider 256-bit YMM registers, replace xmm with ymm and adjust strides to 32 bytes, processing four doubles per . Such implementations are common in linear libraries, where FMA3 reduces instruction count by up to 50% in multiply-accumulate heavy workloads compared to SSE2. Beyond basic arithmetic, FMA3 facilitates advanced patterns like alternating add/subtract for pairwise computations, using intrinsics such as _mm_fmaddsub_ps which applies ai×bi+(1)icia_i \times b_i + (-1)^i c_i across vector lanes. In practice, these are integrated into optimized routines for dot products or convolutions, where chaining multiple FMA units (e.g., two per core on Haswell) sustains high throughput. Developers must ensure runtime CPU detection (e.g., via __builtin_cpu_supports("avx2") in GCC) to fallback to non-FMA code on older hardware.

FMA4 Instruction Set

Technical Description and Instructions

The FMA4 instruction set, introduced by as part of the XOP extension, provides four-operand fused multiply-add operations for in processors. These instructions compute expressions of the form DEST = (SRC1 × SRC2) ± SRC3 in a single operation, fusing the and (or ) steps to minimize errors and improve computational efficiency compared to separate multiply and add instructions. Unlike the three-operand FMA3, FMA4 uses a dedicated destination register, allowing non-destructive updates to source operands and more flexible programming. This fusion ensures that the intermediate product is not rounded before addition, preserving precision equivalent to a single extended-precision operation rounded once to the destination format. FMA4 supports both single-precision (32-bit) and double-precision (64-bit) floating-point data, operating on 128-bit XMM and 256-bit YMM vector registers, enabling SIMD parallelism across multiple elements. The instructions use XOP or VEX encoding prefixes to specify vector lengths, operand orders, and additional features. In XOP encoding, a three-byte prefix (8F RXB.mmmmm W.vvvv.L.pp) is followed by an from the 0F 01 range (e.g., C9H for VFMADDPD), a byte for addressing, and optional SIB or displacement for memory access. The four are DEST (written), SRC1, SRC2, and SRC3 (read), with the result stored in DEST without overwriting sources; this supports in-place accumulation without temporary registers. VEX encoding (C4/C5 prefixes) is also supported for compatibility. All FMA4 instructions raise SIMD floating-point exceptions (#XF) for invalid operations, denormals, overflows, underflows, divides-by-zero, and inexact results, controlled by the MXCSR register. They also generate general-protection (#GP) exceptions for unaligned memory accesses unless aligned or using appropriate flags. FMA4 includes variants for different operation signs and patterns, all following the general fused multiply-add paradigm with four operands. The core instructions are prefixed with "V" to indicate vector operation and suffixed to denote precision and type (PD for packed double-precision, PS for packed single-precision, SD/SS for scalar double/single). Negated and subtract variants use "N" or "SUB" suffixes. For example, VFMADDPD computes packed double-precision DEST = (SRC1 × SRC2) + SRC3, where SRC1, SRC2, and SRC3 are vectors of four 64-bit elements in 256-bit registers. The operation is defined mathematically as: DESTi=round((SRC1i×SRC2i)+SRC3i)\text{DEST}_i = \text{round}((\text{SRC1}_i \times \text{SRC2}_i) + \text{SRC3}_i) for each element ii, using the current rounding mode (from MXCSR). Similarly, VFMSUBPD performs DEST = (SRC1 × SRC2) - SRC3, and VFNMADDPD computes DEST = -((SRC1 × SRC2) + SRC3). Alternating add/subtract patterns are handled by VFMADDSUBPD/PS, which applies + to even-indexed elements and - to odd-indexed ones in the vector. Scalar versions like VFMADDSD operate only on the low-order element of XMM registers, zeroing upper bits.
InstructionOperationPrecisionVector SizeKey Notes
VFMADDPD/PSDEST = (SRC1 × SRC2) + SRC3Double/Single (packed)128/256-bitStandard add variant; supports memory operand for SRC3.
VFMADDSD/SSDEST = (SRC1 × SRC2) + SRC3 (low element only)Double/Single (scalar)128-bit (XMM)Upper bits zeroed.
VFMSUBPD/PSDEST = (SRC1 × SRC2) - SRC3Double/Single (packed)128/256-bitSubtract variant; useful for difference computations.
VFMSUBSD/SSDEST = (SRC1 × SRC2) - SRC3 (low element only)Double/Single (scalar)128-bit (XMM)Scalar subtract; exceptions same as packed.
VFMADDSUBPD/PSDEST = alternating (SRC1 × SRC2) + SRC3 (even), - SRC3 (odd)Double/Single (packed)128/256-bitFor complex or paired operations; index-based signing.
VFNMADDPD/PSDEST = -((SRC1 × SRC2) + SRC3)Double/Single (packed)128/256-bitNegated multiply-add; inverts sign post-fusion.
VFNMSUBPD/PSDEST = -((SRC1 × SRC2) - SRC3)Double/Single (packed)128/256-bitNegated multiply-subtract; common in .
VFNMSUBSD/SSDEST = -((SRC1 × SRC2) - SRC3) (low element only)Double/Single (scalar)128-bit (XMM)Scalar negated subtract.
VFMSUBADDPD/PSDEST = alternating (SRC1 × SRC2) - SRC3 (even), + SRC3 (odd)Double/Single (packed)128/256-bitInverse alternating pattern.
Operand order is fixed as multiply SRC1 and SRC2, then add/subtract SRC3 to DEST. In assembly, an example for 256-bit packed double-precision is vfmaddpd ymm0, ymm1, ymm2, ymm3, computing (ymm1 × ymm2) + ymm3 and storing in ymm0. Intrinsics in C/C++ (via fma4intrin.h) mirror this, such as _mm256_macc_pd(a, b, c) for VFMADDPD. Detection requires function 8000_0001H, ECX bit 16 (FMA4) set to 1, with support on and later architectures. Advanced integer variants like VPMACSDD perform fused operations on packed integers but are not core to floating-point FMA4. All instructions maintain IEEE 754 compliance for exceptions and rounding. Performance characteristics vary by microarchitecture; for instance, on AMD Bulldozer, FMA4 operations have low latency, fusing to boost FLOPS in vectorized code.

Supported Processors

The FMA4 instruction set, providing four-operand fused multiply-add operations as an extension to XOP, was first introduced in AMD's Bulldozer microarchitecture in 2011. It is supported in subsequent architectures including Piledriver (2012), Steamroller (2014), and Excavator (2015). Starting with Zen (2017) and later generations such as Zen 2 through Zen 5 (as of 2025), FMA4 is implemented in hardware but not exposed via CPUID for software detection, requiring manual enabling or OS-specific handling for use. Representative processors include the FX-8150 from the Bulldozer-based FX series, the FX-8350 (Piledriver), APU models like the A10-7870K (Kaveri/Exacavator), and modern examples like the Ryzen 9 7950X (Zen 4) and EPYC 9755 (Zen 5), where hardware support exists despite non-advertisement. FMA4 is enumerated via CPUID function 8000_0001H ECX bit 16 on supported pre-Zen architectures, enhancing floating-point performance in high-performance computing workloads. Intel processors do not support FMA4.

Implementation Examples

FMA4 instructions enable efficient implementation of fused multiply-add operations in both assembly and higher-level languages via intrinsics, particularly for vectorized computations in scientific and numerical applications. These instructions, part of the XOP extension, operate on 128-bit (XMM) or 256-bit (YMM) registers, performing operations like dest=a×b+c\text{dest} = a \times b + c with a single step for improved precision and reduced latency compared to separate multiply and add instructions. Implementations typically target processors supporting XOP/FMA4, such as and later, where FMA4 achieves low latency and high throughput on floating-point units. In C/C++, FMA4 is accessed through intrinsics defined in <fma4intrin.h>, allowing developers to write portable vector code without inline assembly. For example, the _mm_macc_ps intrinsic multiplies two packed single-precision floating-point vectors and adds a third, storing the result in a destination vector. A basic usage for computing a scalar fused multiply-add on the lowest elements might appear as follows:

c

#include <fma4intrin.h> __m128 a = _mm_set_ps1(2.0f); // Vector with all elements 2.0 __m128 b = _mm_set_ps1(3.0f); // Vector with all elements 3.0 __m128 c = _mm_set_ps1(1.0f); // Vector with all elements 1.0 __m128 dest = _mm_setzero_ps(); // Initialize destination __m128 result = _mm_macc_ps(a, b, dest, c); // Computes dest = a * b + c element-wise float scalar_result = _mm_cvtss_f32(result); // Extract lowest element: 7.0

#include <fma4intrin.h> __m128 a = _mm_set_ps1(2.0f); // Vector with all elements 2.0 __m128 b = _mm_set_ps1(3.0f); // Vector with all elements 3.0 __m128 c = _mm_set_ps1(1.0f); // Vector with all elements 1.0 __m128 dest = _mm_setzero_ps(); // Initialize destination __m128 result = _mm_macc_ps(a, b, dest, c); // Computes dest = a * b + c element-wise float scalar_result = _mm_cvtss_f32(result); // Extract lowest element: 7.0

This computes 2.0×3.0+1.0=7.02.0 \times 3.0 + 1.0 = 7.0 for each element, with the full vector operation leveraging SIMD parallelism. For 256-bit vectors, _mm256_macc_ps extends this to eight single-precision elements, ideal for matrix operations. Similar variants like _mm_msub_ps (for dest=a×bc\text{dest} = a \times b - c) and _mm_nmacc_ps (for dest=(a×b)+c\text{dest} = -(a \times b) + c) support diverse arithmetic patterns. Compilation requires flags such as -mxop -mfma4 in GCC to enable code generation. Assembly-level implementations provide finer control, especially in performance-critical kernels like the DAXPY (y = a * x + y) algorithm for double-precision vectors. On processors, FMA4 uses instructions like vfmadd231pd (multiply and add to accumulator, with permutation) or vfnmadd231pd (fused negate-multiply-add). The following assembly snippet implements a loop for DAXPY on 128-bit XMM registers, processing two double-precision elements per iteration:

assembly

section .text extern n ; Number of elements extern X ; Input [array](/page/Array) X extern Y ; Input/output [array](/page/Array) Y extern DA ; Scalar multiplier a DAXPY: mov rax, [n] ; Load n (assume even for simplicity) shl rax, 3 ; Convert to bytes (8 bytes per double) lea rsi, [X + rax] ; Point to end of X lea rdi, [Y + rax] ; Point to end of Y neg rax ; Negative offset for backward loop vmovddup xmm2, [DA] ; Broadcast scalar a to xmm2 L1: vmovapd xmm0, [rdi + rax] ; Load two doubles from Y to dest vmovapd xmm1, [rsi + rax] ; Load two doubles from X to src1 vfmadd231pd xmm0, xmm1, xmm2 ; Y[i] = Y[i] + a * X[i] vmovapd [rdi + rax], xmm0 ; Store back to Y add rax, 16 ; Advance by 16 bytes (two doubles) jl L1 ret

section .text extern n ; Number of elements extern X ; Input [array](/page/Array) X extern Y ; Input/output [array](/page/Array) Y extern DA ; Scalar multiplier a DAXPY: mov rax, [n] ; Load n (assume even for simplicity) shl rax, 3 ; Convert to bytes (8 bytes per double) lea rsi, [X + rax] ; Point to end of X lea rdi, [Y + rax] ; Point to end of Y neg rax ; Negative offset for backward loop vmovddup xmm2, [DA] ; Broadcast scalar a to xmm2 L1: vmovapd xmm0, [rdi + rax] ; Load two doubles from Y to dest vmovapd xmm1, [rsi + rax] ; Load two doubles from X to src1 vfmadd231pd xmm0, xmm1, xmm2 ; Y[i] = Y[i] + a * X[i] vmovapd [rdi + rax], xmm0 ; Store back to Y add rax, 16 ; Advance by 16 bytes (two doubles) jl L1 ret

This code uses a backward loop to avoid partial register stalls, with vfmadd231pd fusing the multiply and add in one instruction using the four-operand model. For wider 256-bit YMM registers, replace xmm with ymm and adjust strides to 32 bytes, processing four doubles per iteration. Such implementations are common in linear algebra libraries, where FMA4 reduces instruction count in multiply-accumulate heavy workloads compared to SSE2. Beyond basic arithmetic, FMA4 facilitates advanced patterns like alternating add/subtract for pairwise computations, using intrinsics such as _mm_maddsub_ps which applies desti=ai×bi+(1)ici\text{dest}_i = a_i \times b_i + (-1)^i c_i across vector lanes. In practice, these are integrated into optimized routines for dot products or convolutions, sustaining high throughput on FP units. Developers must ensure runtime CPU detection (e.g., via checks) or manual enabling on and later for fallback to non-FMA code on unsupported hardware.

Historical Development

Origins and Early Proposals

The conceptual foundations of the fused multiply-add (FMA) operation emerged in the mid-20th century amid efforts to optimize for scientific computing. During the Stretch project in the 1950s, designers proposed a "cumulative multiply" instruction that fused multiplication and addition into a single operation, expanding the traditional three-register datapath with a dedicated "factor" register to enable efficient matrix multiplications and other iterative calculations. This early idea, formalized by 1956 through contributions from engineers like , aimed to reduce instruction counts in high-performance scientific applications, though the Stretch system was delivered in after significant delays. By the late 1980s, renewed interest in fused operations arose from the need to minimize rounding errors in floating-point computations, particularly as processor designs evolved toward reduced instruction set computing (RISC). The first commercial hardware implementation of FMA debuted in 1990 with IBM's POWER1 processor (also known as RS/6000), where it was integrated into a second-generation RISC floating-point unit capable of executing the operation—a × b + c—with a single final rounding step, improving both speed and accuracy for common algorithms like polynomial evaluation. This design, which supported IEEE 754-compliant single- and double-precision formats, marked a pivotal shift by demonstrating FMA's potential to execute up to twice as many floating-point operations per cycle compared to separate multiply and add instructions. Hewlett-Packard followed with experimental support in its PA-RISC architecture during the 1990s, notably in the PA-8000 processor released in 1996, which featured two fused multiply-accumulate (FMAC) units for handling floating-point and integer multiply instructions in a superscalar pipeline. Standardization efforts gained momentum in the 1990s through revisions to floating-point standards, emphasizing operations that preserved accuracy by limiting intermediate roundings. Discussions within the revision committee, which began in earnest after the 1985 standard, highlighted FMA's role in reducing error propagation, leading to its formal inclusion in as the fusedMultiplyAdd operation, defined to compute (x × y) + z with unbounded intermediate precision and a single rounding to the destination format. Early adopters beyond and HP included the architecture (2001) and MIPS64, extending FMA to EPIC and other RISC designs. The entry of FMA into the x86 ecosystem occurred in the late 2000s, spurred by competitive proposals: announced the SSE5 extension in August 2007, incorporating FMA4 instructions with 4-operand syntax to accelerate and workloads, while responded in early 2008 with a 4-operand FMA proposal as part of its (AVX), enabling more flexible operand handling without overwriting inputs.

Evolution and Industry Adoption

The development of FMA instructions during the 2010s reflected a pivotal shift in x86 architecture, as Intel and AMD navigated competing extensions before aligning on a unified approach. Initially, Intel considered SSE5 extensions in 2008, which included early FMA concepts, but canceled these in favor of the AVX framework to streamline vector processing. This pivot culminated in 2013 with the introduction of FMA3 alongside AVX2 in the Haswell microarchitecture, providing hardware support for three-operand fused multiply-add operations to boost floating-point throughput in scientific and engineering applications. AMD pursued a parallel path by announcing FMA4 in 2010 as part of its Bulldozer microarchitecture, aiming to enhance SIMD capabilities with four-operand FMA variants integrated into SSE5. Despite its potential for higher flexibility in certain computations, FMA4 experienced low industry adoption due to compatibility challenges, primarily because Intel did not implement it, resulting in limited software optimization and ecosystem fragmentation. Industry convergence accelerated when AMD first implemented FMA3 in 2012 with the Piledriver microarchitecture (e.g., Trinity APUs), bridging the gap with Intel's offerings and promoting cross-vendor compatibility. This trend solidified in 2017 with AMD's Zen microarchitecture, which standardized FMA3 support across its processor lineup, enabling seamless integration in shared software bases. The broader impact extended to software ecosystems, with FMA inclusion in OpenMP 4.0 (2013) facilitating vectorized parallel programming, CUDA 6.0 (2014) enabling GPU-accelerated FMA for heterogeneous computing, and ARMv8-A (2013) incorporating FMA to expand its utility in diverse architectures. From a 2025 perspective, FMA serves as a foundational element in AI and pipelines, where its precision-preserving operations underpin matrix multiplications and training. Extensions in Intel's further amplify FMA capabilities with 512-bit vectors for massive parallelism, while AMD's refines FMA execution for energy-efficient scaling in data centers and edge devices.

Software Support

Compiler and Intrinsic Support

Compilers provide support for the Fused Multiply-Add (FMA) instruction set through command-line flags that enable instruction generation and optimization passes that automatically insert FMA operations where beneficial. In the GNU Compiler Collection (GCC), the -mfma flag explicitly enables the use of FMA instructions, targeting processors that support the FMA3 extension, such as Haswell and later architectures. This flag allows GCC to generate FMA instructions during code optimization, particularly when combined with architecture-specific options like -march=haswell. Additionally, GCC can automatically contract floating-point expressions into FMA operations under certain optimization levels, such as with -ffp-contract=fast or -ffast-math, which permit reassociation for performance gains while potentially altering strict compliance. The compiler, part of the project, offers compatible support for FMA through flags mirroring GCC's, including -mfma to enable FMA3 instructions on supported targets. Like GCC, generates FMA operations automatically in optimized code when targeting FMA-capable architectures (e.g., via -march=native on compatible hardware) and with floating-point contraction enabled via -ffp-contract=on or -ffast-math. This integration ensures portability across GCC and toolchains for environments. For Intel-specific compilation, the (ICC, now part of oneAPI DPC++/C++ Compiler) supports FMA via -xCORE-AVX2 or higher, which includes FMA3, and provides aggressive auto-vectorization that frequently inserts FMA for multiply-add patterns in loops. Microsoft Visual C++ (MSVC) enables FMA support through the /arch:AVX2 option, which activates FMA3 instructions alongside 256-bit AVX2 vector operations, available since 2013 Update 2. MSVC's optimizer can generate FMA instructions automatically for expressions like a * b + c under /O2 or higher, though this behavior may be influenced by floating-point model settings such as /fp:fast, which allows contraction similar to -ffast-math in GCC/. Under the strict /fp:precise model, MSVC historically avoided FMA to preserve exact IEEE semantics, though recent versions (e.g., 2019) may insert them selectively. Starting with 2022, the /fp:contract option provides explicit control over contraction generation under /fp:precise, defaulting to allowing contractions unless set to off. Intrinsic functions provide low-level access to FMA instructions, allowing developers to explicitly invoke them without inline assembly. For FMA3, the Intel Intrinsics Guide lists a comprehensive set of functions in <immintrin.h>, such as _mm_fmadd_ps for single-precision packed floats (performing a * b + c on 128-bit vectors) and _mm256_fmadd_pd for double-precision 256-bit vectors. These are supported across GCC, Clang, ICC, and MSVC when the appropriate headers are included and target flags are set, enabling portable SIMD code for FMA3 operations. GCC extends this with built-in functions like __builtin_ia32_vfmaddpd for FMA3 variants. For the AMD-specific FMA4 extension, support is more limited but available in GCC via the -mfma4 flag, which enables dedicated built-in functions such as __builtin_ia32_vfmaddps for 128-bit and 256-bit vectors, generating FMA4 instructions on compatible hardware like and Piledriver processors. provides intrinsics in fma4intrin.h, including _mm_madd_ps and variants for multiply-add/subtract, as documented in their instruction set reference. supports FMA4 intrinsics through LLVM's backend, though adoption is rare due to FMA4's in favor of FMA3 in modern architectures like . MSVC does not natively expose distinct FMA4 intrinsics, relying instead on general AVX headers that map to FMA3 on supported targets. Runtime detection of FMA support, such as via __builtin_cpu_supports("fma") in GCC or checks, is recommended to ensure portability across FMA3 and FMA4 hardware. Support for FMA extends beyond x86 to other architectures, such as . GCC and provide FMA intrinsics for ARMv8-A SIMD units via flags like -mfma (though primarily for with SVE or extensions), generating fused operations in optimized code. ARM's intrinsics, documented in the ACLE (ARM C Language Extensions), include functions like vfmaddq_f32 for quad-word single-precision vectors.

Assembler and Low-Level Usage

The Fused Multiply-Add (FMA) instructions can be directly invoked in using specific mnemonics, enabling low-level programmers to perform high-precision floating-point operations without intermediate . In assembly code, FMA instructions are typically written with operands specifying registers (e.g., XMM, YMM, or ZMM) or locations, and they require appropriate prefixes like VEX or EVEX for vector extensions. Support must be verified at runtime using function 01H with ECX bit 12 for FMA3 ( standard) or extended function 8000_0001H with ECX bit 16 for FMA4 ( variant). FMA3 instructions, part of Intel's AVX, AVX2, and extensions, use a three-operand format where the destination register also serves as one source, and the numeric (132, 213, or 231) indicates the order: for example, VFMADD132PD computes dest = (dest * src2) + src3. These instructions support packed single-precision (PS), packed double-precision (PD), scalar single-precision (), and scalar double-precision (SD) operations, with vector lengths of 128, 256, or 512 bits depending on the register size. A basic assembly example for packed double-precision fused multiply-add using YMM registers is:

vfmadd132pd ymm0, ymm1, ymm2 ; ymm0 = (ymm0 * ymm1) + ymm2

vfmadd132pd ymm0, ymm1, ymm2 ; ymm0 = (ymm0 * ymm1) + ymm2

This performs the operation across eight double-precision elements, leveraging VEX encoding for 256-bit vectors. For scalar operations, such as VFMADD132SD, the upper bits of the XMM register remain unchanged, allowing integration into legacy SSE code. Advanced features like writemasking (e.g., {k1}{z}) and embedded rounding are available in EVEX-encoded variants, as in:

vfmadd132pd zmm0 {k1}{z}, zmm1, zmm2 ; Conditional fused multiply-add with zeroing

vfmadd132pd zmm0 {k1}{z}, zmm1, zmm2 ; Conditional fused multiply-add with zeroing

These enable precise control over exceptions and from operands. In contrast, FMA4 instructions, introduced by in the architecture and supported in Piledriver cores, though deprecated and not officially supported in later architectures such as , employ a four-operand format to avoid overwriting sources: dest = (src1 * src2) +/- src3, using VEX prefixes for floating-point variants and XOP for some integer extensions. This design facilitates non-destructive operations, with src2 often allowing access. An example for packed single-precision is:

vfmaddps xmm0, xmm1, xmm2, xmm3 ; xmm0 = (xmm1 * xmm2) + xmm3

vfmaddps xmm0, xmm1, xmm2, xmm3 ; xmm0 = (xmm1 * xmm2) + xmm3

For 256-bit YMM operations, the VEX.L bit is set to 1, as in VFMADDPS ymm0, ymm1, [mem256], ymm3, which loads from for src2. Variants like VFMSUBPD negate the addition to , and VFNMADD negates the product, all rounding the final result according to the MXCSR register. FMA4's integer counterparts, such as VPMACSDD for packed 32-bit signed multiply-accumulate, extend usage to fixed-point computations but are less common in floating-point assembler code. Low-level usage in assemblers like NASM, , or MASM requires enabling the relevant extensions (e.g., via .enable_avx2 or -mfma flags during compilation) and handling alignment for operands to avoid #GP faults. Programmers must also manage floating-point exceptions via MXCSR and ensure compatibility across FMA3/FMA4 by conditional dispatching based on , as direct mixing can trigger invalid opcode exceptions. This approach is particularly useful in performance-critical kernels for linear algebra or , where manual scheduling of FMA instructions can optimize throughput on supported hardware.

Library and Framework Integration

The Fused Multiply-Add (FMA) instruction set is integrated into various mathematical libraries, particularly those implementing the (BLAS) and standards, to enhance floating-point performance in vector and matrix operations. , an open-source BLAS library, leverages FMA instructions on supported architectures such as Intel Haswell processors with AVX2, enabling fused operations in Level-2 and Level-3 BLAS routines for improved throughput in multiply-accumulate tasks common in linear algebra. Similarly, Intel's oneAPI (oneMKL) utilizes FMA for optimized vector scaling and addition, such as in the cblas_daxpy routine, where it performs a * x + y with a single rounding step to reduce error accumulation when the target CPU supports the instructions. ATLAS, another auto-tuned BLAS implementation, detects FMA availability during code generation and incorporates it into kernel optimizations for dense linear algebra, though its usage depends on the host architecture's capabilities. Higher-level scientific computing libraries build on these BLAS backends to indirectly benefit from FMA acceleration. , a foundational Python library for numerical computing, employs SIMD optimizations including FMA3 in its universal function (ufunc) loops, such as for dot products and element-wise multiplications, with runtime CPU dispatch to select FMA paths on compatible x86_64 systems. This integration, outlined in NumPy Enhancement Proposal 38, ensures portable performance gains while maintaining within 1-3 units in the last place (ULPs). , extending NumPy for advanced scientific routines, inherits FMA support through its reliance on optimized BLAS implementations like or oneMKL, particularly in sparse linear algebra and optimization modules where fused operations reduce computational overhead. In frameworks, FMA integration focuses on accelerating tensor operations and training. can be compiled with flags like -mfma to enable FMA instructions in performance-critical paths, such as matrix multiplications via its Eigen backend or oneDNN primitives, yielding performance improvements in CPU inference on supported hardware. similarly exploits FMA through its library and optional Intel Extension for PyTorch (IPEX), which uses oneDNN for fused multiply-add in bfloat16 (BF16) accumulations on processors, enhancing workloads with minimal precision loss. Both frameworks warn users if the binary lacks FMA compilation when the CPU supports it, recommending source builds for optimal utilization.

References

  1. https://en.wikichip.org/wiki/amd/microarchitectures/zen
Add your contribution
Related Hubs
User Avatar
No comments yet.