Recent from talks
Nothing was collected or created yet.
Quadruple-precision floating-point format
View on Wikipedia| Floating-point formats |
|---|
| IEEE 754 |
|
| Other |
| Alternatives |
| Tapered floating point |
| Computer architecture bit widths |
|---|
| Bit |
| Application |
| Binary floating-point precision |
| Decimal floating-point precision |
In computing, quadruple precision (or quad precision) is a binary floating-point–based computer number format that occupies 16 bytes (128 bits) with precision at least twice the 53-bit double precision.
This 128-bit quadruple precision is designed for applications needing results in higher than double precision,[1] and as a primary function, to allow computing double precision results more reliably and accurately by minimising overflow and round-off errors in intermediate calculations and scratch variables. William Kahan, primary architect of the original IEEE 754 floating-point standard noted, "For now the 10-byte Extended format is a tolerable compromise between the value of extra-precise arithmetic and the price of implementing it to run fast; very soon two more bytes of precision will become tolerable, and ultimately a 16-byte format ... That kind of gradual evolution towards wider precision was already in view when IEEE Standard 754 for Floating-Point Arithmetic was framed."[2]
In IEEE 754-2008 the 128-bit base-2 format is officially referred to as binary128.
IEEE 754 quadruple-precision binary floating-point format: binary128
[edit]The IEEE 754 standard specifies a binary128 as having:
- Sign bit: 1 bit
- Exponent width: 15 bits
- Significand precision: 113 bits (112 explicitly stored)
The sign bit determines the sign of the number (including when this number is zero, which is signed). "1" stands for negative.
This gives from 33 to 36 significant decimal digits precision. If a decimal string with at most 33 significant digits is converted to the IEEE 754 quadruple-precision format, giving a normal number, and then converted back to a decimal string with the same number of digits, the final result should match the original string. If an IEEE 754 quadruple-precision number is converted to a decimal string with at least 36 significant digits, and then converted back to quadruple-precision representation, the final result must match the original number.[3]
The format is written with an implicit lead bit with value 1 unless the exponent is stored with all zeros (used to encode subnormal numbers and zeros). Thus only 112 bits of the significand appear in the memory format, but the total precision is 113 bits (approximately 34 decimal digits: log10(2113) ≈ 34.016) for normal values; subnormals have gracefully degrading precision down to 1 bit for the smallest non-zero value. The bits are laid out as:
Exponent encoding
[edit]The quadruple-precision binary floating-point exponent is encoded using an offset binary representation, with the zero offset being 16383; this is also known as exponent bias in the IEEE 754 standard.
- Emin = 000116 − 3FFF16 = −16382
- Emax = 7FFE16 − 3FFF16 = 16383
- Exponent bias = 3FFF16 = 16383
Thus, as defined by the offset binary representation, in order to get the true exponent, the offset of 16383 has to be subtracted from the stored exponent.
The stored exponents 000016 and 7FFF16 are interpreted specially.
| Exponent | Significand zero | Significand non-zero | Equation |
|---|---|---|---|
| 000016 | 0, −0 | subnormal numbers | (−1)signbit × 2−16382 × 0.significandbits2 |
| 000116, ..., 7FFE16 | normalized value | (−1)signbit × 2exponentbits2 − 16383 × 1.significandbits2 | |
| 7FFF16 | ±∞ | NaN (quiet, signaling) | |
The minimum strictly positive (subnormal) value is 2−16494 ≈ 10−4965 and has a precision of only one bit. The minimum positive normal value is 2−16382 ≈ 3.3621 × 10−4932 and has a precision of 113 bits, i.e. ±2−16494 as well. The maximum representable value is 216384 − 216271 ≈ 1.1897 × 104932.
Quadruple precision examples
[edit]These examples are given in bit representation, in hexadecimal, of the floating-point value. This includes the sign, (biased) exponent, and significand.
|
0000 0000 0000 0000 0000 0000 0000 000116 = 2−16382 × 2−112 = 2−16494 0000 ffff ffff ffff ffff ffff ffff ffff16 = 2−16382 × (1 − 2−112) 0001 0000 0000 0000 0000 0000 0000 000016 = 2−16382 7ffe ffff ffff ffff ffff ffff ffff ffff16 = 216383 × (2 − 2−112) 3ffe ffff ffff ffff ffff ffff ffff ffff16 = 1 − 2−113 3fff 0000 0000 0000 0000 0000 0000 000016 = 1 (one) 3fff 0000 0000 0000 0000 0000 0000 000116 = 1 + 2−112 4000 0000 0000 0000 0000 0000 0000 000016 = 2 0000 0000 0000 0000 0000 0000 0000 000016 = 0 7fff 0000 0000 0000 0000 0000 0000 000016 = infinity 3ffd 5555 5555 5555 5555 5555 5555 555516 ≈ 0.3333333333333333333333333333333333173 4000 921f b544 42d1 8469 898c c517 01b816 ≈ 3.1415926535897932384626433832795027975 4008 74d9 9564 5aa0 0c11 d0cc 9770 5e5b16 ≈ 745.69987158227021999999999999999997147 |
By default, 1/3 rounds down like double precision, because of the odd number of bits in the significand. Thus, the bits beyond the rounding point are 0101... which is less than 1/2 of a unit in the last place.
Double-double arithmetic
[edit]A common software technique to implement nearly quadruple precision using pairs of double-precision values is sometimes called double-double arithmetic.[4][5][6] Using pairs of IEEE double-precision values with 53-bit significands, double-double arithmetic provides operations on numbers with significands of at least[4] 2 × 53 = 106 bits (actually 107 bits[7] except for some of the largest values, due to the limited exponent range), only slightly less precise than the 113-bit significand of IEEE binary128 quadruple precision. The range of a double-double remains essentially the same as the double-precision format because the exponent has still 11 bits,[4] significantly lower than the 15-bit exponent of IEEE quadruple precision (a range of 1.8 × 10308 for double-double versus 1.2 × 104932 for binary128).
In particular, a double-double/quadruple-precision value q in the double-double technique is represented implicitly as a sum q = x + y of two double-precision values x and y, each of which supplies half of q's significand.[5] That is, the pair (x, y) is stored in place of q, and operations on q values (+, −, ×, ...) are transformed into equivalent (but more complicated) operations on the x and y values. Thus, arithmetic in this technique reduces to a sequence of double-precision operations; since double-precision arithmetic is commonly implemented in hardware, double-double arithmetic is typically substantially faster than more general arbitrary-precision arithmetic techniques.[4][5]
Note that double-double arithmetic has the following special characteristics:[8]
- As the magnitude of the value decreases, the amount of extra precision also decreases. Therefore, the smallest number in the normalized range is narrower than double precision. The smallest number with full precision is 1000...02 (106 zeros) × 2−1074, or 1.000...02 (106 zeros) × 2−968. Numbers whose magnitude is smaller than 2−1021 will not have additional precision compared with double precision.
- The actual number of bits of precision can vary. In general, the magnitude of the low-order part of the number is no greater than a half ULP of the high-order part. If the low-order part is less than half ULP of the high-order part, significant bits (either all 0s or all 1s) are implied between the significand of the high-order and low-order numbers. Certain algorithms that rely on having a fixed number of bits in the significand can fail when using 128-bit long double numbers.
- Because of the reason above, it is possible to represent values like 1 + 2−1074, which is the smallest representable number greater than 1.
In addition to the double-double arithmetic, it is also possible to generate triple-double or quad-double arithmetic if higher precision is required without any higher precision floating-point library. They are represented as a sum of three (or four) double-precision values respectively. They can represent operations with at least 159/161 and 212/215 bits respectively. A natural extension to an arbitrary number of terms (though limited by the exponent range) is called floating-point expansions.
A similar technique can be used to produce a double-quad arithmetic, which is represented as a sum of two quadruple-precision values. They can represent operations with at least 226 (or 227) bits.[9]
Implementations
[edit]Quadruple precision is often implemented in software by a variety of techniques (such as the double-double technique above, although that technique does not implement IEEE quadruple precision), since direct hardware support for quadruple precision is, as of 2016[update], less common (see "Hardware support" below). One can use general arbitrary-precision arithmetic libraries to obtain quadruple (or higher) precision, but specialized quadruple-precision implementations may achieve higher performance.
Computer-language support
[edit]A separate question is the extent to which quadruple-precision types are directly incorporated into computer programming languages.
Quadruple precision is specified in Fortran by the real(real128) (module iso_fortran_env from Fortran 2008 must be used, the constant real128 is equal to 16 on most processors), or as real(selected_real_kind(33, 4931)), or in a non-standard way as REAL*16. (Quadruple-precision REAL*16 is supported by the Intel Fortran Compiler[10] and by the GNU Fortran compiler[11] on x86, x86-64, and Itanium architectures, for example.)
For the C programming language, ISO/IEC TS 18661-3 (floating-point extensions for C, interchange and extended types) specifies _Float128 as the type implementing the IEEE 754 quadruple-precision format (binary128).[12] Alternatively, in C/C++ with a few systems and compilers, quadruple precision may be specified by the long double type, but this is not required by the language (which only requires long double to be at least as precise as double), nor is it common.
As of C++23, the C++ language defines a <stdfloat> header that contains fixed-width floating-point types. Implementations of these are optional, but if supported, std::float128_t corresponds to quadruple precision.
On x86 and x86-64, the most common C/C++ compilers implement long double as either 80-bit extended precision (e.g. the GNU C Compiler gcc[13] and the Intel C++ Compiler with a /Qlong‑double switch[14]) or simply as being synonymous with double precision (e.g. Microsoft Visual C++[15]), rather than as quadruple precision. The procedure call standard for the ARM 64-bit architecture (AArch64) specifies that long double corresponds to the IEEE 754 quadruple-precision format.[16] On a few other architectures, some C/C++ compilers implement long double as quadruple precision, e.g. gcc on PowerPC (as double-double[17][18][19]) and SPARC,[20] or the Sun Studio compilers on SPARC.[21] Even if long double is not quadruple precision, however, some C/C++ compilers provide a nonstandard quadruple-precision type as an extension. For example, gcc provides a quadruple-precision type called __float128 for x86, x86-64 and Itanium CPUs,[22] and on PowerPC as IEEE 128-bit floating-point using the -mfloat128-hardware or -mfloat128 options;[23] and some versions of Intel's C/C++ compiler for x86 and x86-64 supply a nonstandard quadruple-precision type called _Quad.[24]
Zig provides support for it with its f128 type.[25]
Google's work-in-progress language Carbon provides support for it with the type called f128.[26]
As of 2024, Rust is currently working on adding a new f128 type for IEEE quadruple-precision 128-bit floats.[27]
Libraries and toolboxes
[edit]- The GCC quad-precision math library, libquadmath, provides
__float128and__complex128operations. - The Boost multiprecision library Boost.Multiprecision provides unified cross-platform C++ interface for
__float128and_Quadtypes, and includes a custom implementation of the standard math library.[28] - The Multiprecision Computing Toolbox for MATLAB allows quadruple-precision computations in MATLAB. It includes basic arithmetic functionality as well as numerical methods, dense and sparse linear algebra.[29]
- The DoubleFloats[30] package provides support for double-double computations for the Julia programming language.
- The doubledouble.py[31] library enables double-double computations in Python. [citation needed]
- Mathematica supports IEEE quad-precision numbers: 128-bit floating-point values (Real128), and 256-bit complex values (Complex256).[citation needed]
Hardware support
[edit]IEEE quadruple precision was added to the IBM System/390 G5 in 1998,[32] and is supported in hardware in subsequent z/Architecture processors.[33][34] The IBM POWER9 CPU (Power ISA 3.0) has native 128-bit hardware support.[23]
Native support of IEEE 128-bit floats is defined in PA-RISC 1.0,[35] and in SPARC V8[36] and V9[37] architectures (e.g. there are 16 quad-precision registers %q0, %q4, ...), but no SPARC CPU implements quad-precision operations in hardware as of 2004[update].[38]
Non-IEEE extended-precision (128 bits of storage, 1 sign bit, 7 exponent bits, 112 fraction bits, 8 bits unused) was added to the IBM System/370 series (1970s–1980s) and was available on some System/360 models in the 1960s (System/360-85,[39] -195, and others by special request or simulated by OS software).
The Siemens 7.700 and 7.500 series mainframes and their successors support the same floating-point formats and instructions as the IBM System/360 and System/370.
The VAX processor implemented non-IEEE quadruple-precision floating point as its "H Floating-point" format. It had one sign bit, a 15-bit exponent and 112-fraction bits, however the layout in memory was significantly different from IEEE quadruple precision and the exponent bias also differed. Only a few of the earliest VAX processors implemented H Floating-point instructions in hardware, all the others emulated H Floating-point in software.
The NEC Vector Engine architecture supports adding, subtracting, multiplying and comparing 128-bit binary IEEE 754 quadruple-precision numbers.[40] Two neighboring 64-bit registers are used. Quadruple-precision arithmetic is not supported in the vector register.[41]
The RISC-V architecture specifies a "Q" (quad-precision) extension for 128-bit binary IEEE 754-2008 floating-point arithmetic.[42] The "L" extension (not yet certified) will specify 64-bit and 128-bit decimal floating point.[43]
Quadruple-precision (128-bit) hardware implementation should not be confused with "128-bit FPUs" that implement SIMD instructions, such as Streaming SIMD Extensions or AltiVec, which refers to 128-bit vectors of four 32-bit single-precision or two 64-bit double-precision values that are operated on simultaneously.
See also
[edit]- IEEE 754, IEEE standard for floating-point arithmetic
- ISO/IEC 10967, Language independent arithmetic
- Primitive data type
- Q notation (scientific notation)
References
[edit]- ^ Bailey, David H.; Borwein, Jonathan M. (July 6, 2009). "High-Precision Computation and Mathematical Physics" (PDF).
- ^ Higham, Nicholas (2002). "Designing stable algorithms" in Accuracy and Stability of Numerical Algorithms (2 ed). SIAM. p. 43.
- ^ Kahan, Wiliam (1 October 1987). "Lecture Notes on the Status of IEEE Standard 754 for Binary Floating-Point Arithmetic" (PDF).
- ^ a b c d Yozo Hida, X. Li, and D. H. Bailey, Quad-Double Arithmetic: Algorithms, Implementation, and Application, Lawrence Berkeley National Laboratory Technical Report LBNL-46996 (2000). Also Y. Hida et al., Library for double-double and quad-double arithmetic (2007).
- ^ a b c J. R. Shewchuk, Adaptive Precision Floating-Point Arithmetic and Fast Robust Geometric Predicates, Discrete & Computational Geometry 18: 305–363, 1997.
- ^ Knuth, D. E. The Art of Computer Programming (2nd ed.). chapter 4.2.3. problem 9.
- ^ Robert Munafo. F107 and F161 High-Precision Floating-Point Data Types (2011).
- ^ 128-Bit Long Double Floating-Point Data Type.
- ^ sourceware.org Re: The state of glibc libm
- ^ "Intel Fortran Compiler Product Brief (archived copy on web.archive.org)" (PDF). Su. Archived from the original on October 25, 2008. Retrieved 2010-01-23.
- ^ "GCC 4.6 Release Series - Changes, New Features, and Fixes". Retrieved 2010-02-06.
- ^ "ISO/IEC TS 18661-3" (PDF). 2015-06-10. Retrieved 2019-09-22.
- ^ i386 and x86-64 Options (archived copy on web.archive.org), Using the GNU Compiler Collection.
- ^ Intel Developer Site.
- ^ MSDN homepage, about Visual C++ compiler.
- ^ "Procedure Call Standard for the ARM 64-bit Architecture (AArch64)" (PDF). 2013-05-22. Archived from the original (PDF) on 2019-10-16. Retrieved 2019-09-22.
- ^ RS/6000 and PowerPC Options, Using the GNU Compiler Collection.
- ^ Inside Macintosh – PowerPC Numerics. Archived October 9, 2012, at the Wayback Machine.
- ^ 128-bit long double support routines for Darwin Archived 2017-11-07 at the Wayback Machine.
- ^ SPARC Options, Using the GNU Compiler Collection.
- ^ The Math Libraries, Sun Studio 11 Numerical Computation Guide (2005).
- ^ Additional Floating Types, Using the GNU Compiler Collection
- ^ a b "GCC 6 Release Series - Changes, New Features, and Fixes". Retrieved 2016-09-13.
- ^ Intel C++ Forums (2007).
- ^ "Floats". ziglang.org. Retrieved 7 January 2024.
- ^ "Carbon Language's main repository - Language design". GitHub. 2022-08-09. Retrieved 2022-09-22.
- ^ Cross, Travis. "Tracking Issue for f16 and f128 float types". GitHub. Retrieved 2024-07-05.
- ^ "Boost.Multiprecision – float128". Retrieved 2015-06-22.
- ^ Holoborodko, Pavel (2013-01-20). "Fast Quadruple Precision Computations in MATLAB". Retrieved 2015-06-22.
- ^ "DoubleFloats.jl". GitHub.
- ^ "doubledouble.py". GitHub.
- ^ Schwarz, E. M.; Krygowski, C. A. (September 1999). "The S/390 G5 floating-point unit". IBM Journal of Research and Development. 43 (5/6): 707–721. CiteSeerX 10.1.1.117.6711. doi:10.1147/rd.435.0707.
- ^ Gerwig, G.; Wetter, H.; Schwarz, E. M.; Haess, J.; Krygowski, C. A.; Fleischer, B. M.; Kroener, M. (May 2004). "The IBM eServer z990 floating-point unit. IBM J. Res. Dev. 48". pp. 311–322.
- ^ Schwarz, Eric (June 22, 2015). "The IBM z13 SIMD Accelerators for Integer, String, and Floating-Point" (PDF). Archived from the original (PDF) on July 13, 2015. Retrieved July 13, 2015.
- ^ "Implementor support for the binary interchange formats". IEEE. Archived from the original on 2017-10-27. Retrieved 2021-07-15.
- ^ The SPARC Architecture Manual: Version 8 (archived copy on web.archive.org) (PDF). SPARC International, Inc. 1992. Archived from the original (PDF) on 2005-02-04. Retrieved 2011-09-24.
SPARC is an instruction set architecture (ISA) with 32-bit integer and 32-, 64-, and 128-bit IEEE Standard 754 floating-point as its principal data types.
- ^ Weaver, David L.; Germond, Tom, eds. (1994). The SPARC Architecture Manual: Version 9 (archived copy on web.archive.org) (PDF). SPARC International, Inc. Archived from the original (PDF) on 2012-01-18. Retrieved 2011-09-24.
Floating-point: The architecture provides an IEEE 754-compatible floating-point instruction set, operating on a separate register file that provides 32 single-precision (32-bit), 32 double-precision (64-bit), 16 quad-precision (128-bit) registers, or a mixture thereof.
- ^ "SPARC Behavior and Implementation". Numerical Computation Guide — Sun Studio 10. Sun Microsystems, Inc. 2004. Retrieved 2011-09-24.
There are four situations, however, when the hardware will not successfully complete a floating-point instruction: ... The instruction is not implemented by the hardware (such as ... quad-precision instructions on any SPARC FPU).
- ^ Padegs, A. (1968). "Structural aspects of the System/360 Model 85, III: Extensions to floating-point architecture". IBM Systems Journal. 7: 22–29. doi:10.1147/sj.71.0022.
- ^ Vector Engine AssemblyLanguage Reference Manual, Chapter4 Assembler Syntax page 23.
- ^ SX-Aurora TSUBASA Architecture Guide Revision 1.1, pp. 38, 60.
- ^ RISC-V ISA Specification v. 20191213, Chapter 13, “Q” Standard Extension for Quad-Precision Floating-Point, page 79.
- ^ [1] Chapter 15, p. 95.
External links
[edit]- High-Precision Software Directory
- QPFloat, a free software (GPL) software library for quadruple-precision arithmetic
- HPAlib, a free software (LGPL) software library for quad-precision arithmetic
- libquadmath, the GCC quad-precision math library
- IEEE-754 Analysis, interactive web page for examining binary32, binary64, and binary128 floating-point values
Quadruple-precision floating-point format
View on GrokipediaOverview
Definition and Standards
Quadruple-precision floating-point format, also known as binary128, is a binary floating-point representation that utilizes 128 bits to encode numerical values with high accuracy, providing approximately 34 decimal digits of precision.[5] This format builds on the fundamental structure of floating-point numbers, which consist of a sign bit to indicate positive or negative values, an exponent to scale the magnitude, and a significand (also called mantissa) to represent the fractional part of the number.[1] The IEEE 754-1985 standard specified single (binary32) and double (binary64) precisions, along with an optional extended precision format without a rigidly specified bit layout. It was not until the IEEE 754-2008 revision that quadruple precision was formally defined as the binary128 interchange format, establishing precise encoding rules to ensure portability and consistency across computing systems.[5][3] This revision defined binary128 as a standardized format suitable for applications requiring greater numerical fidelity than double precision's roughly 15-16 decimal digits.[5] In the IEEE 754-2008 standard, binary128 serves as both an arithmetic format and an interchange format, mandating exact representation for data exchange between implementations to prevent loss of precision during transfers.[1] The subsequent IEEE 754-2019 revision retained this definition with minor clarifications, reinforcing binary128's role in supporting high-precision computations while emphasizing compatibility with prior formats.[1] These standards collectively ensure that quadruple precision adheres to uniform rules for operations, rounding, and exception handling, promoting reliability in scientific and engineering applications.[5]Precision and Range Characteristics
The quadruple-precision floating-point format, as defined in IEEE 754-2008 under the binary128 interchange format, provides a significand of 113 bits, consisting of 112 explicitly stored bits and 1 implicit leading bit for normalized numbers. This configuration delivers approximately 34 decimal digits of precision, enabling highly accurate representations in computations where double-precision (about 15-16 decimal digits) is insufficient. The dynamic range is determined by a 15-bit exponent field with a bias of 16383, allowing unbiased exponents from -16382 to +16383 for normalized finite values. Consequently, the smallest positive normalized value is approximately 3.4 × 10^{-4932}, while the largest is approximately 1.2 × 10^{4932}. The numerical value of a finite non-zero binary128 number is given by the formula: where is the sign bit (0 or 1), is the 112-bit mantissa (fraction), and is the biased exponent (an integer from 1 to 32766).[6][7] Relative precision is characterized by the machine epsilon, which is , representing the spacing between representable numbers around 1.0 and the maximum relative error bound in rounding operations. Additions to 1.0 below this threshold may not be distinguishable in the format. Such precision supports applications in scientific simulations demanding extreme accuracy, though hardware support remains limited.Applications and Use Cases
Quadruple-precision floating-point format finds primary application in high-precision scientific computing domains where double-precision arithmetic (approximately 15-16 decimal digits) proves insufficient for maintaining accuracy over extended computations. In celestial mechanics, it enables precise solutions to the Kepler equation and long-term orbit integrations by minimizing accumulation of rounding errors in iterative solvers. For instance, specialized algorithms for hyperbolic Kepler equation solutions achieve superior precision in quadruple mode compared to double, supporting accurate trajectory predictions in space mission planning.[8] Similarly, in quantum chemistry computations, quadruple precision facilitates reliable evaluation of complex integrals and eigenvalue problems, where small numerical instabilities can propagate significantly in methods like coupled-cluster theory.[9] Numerical analysis tasks, such as solving stiff differential equations via high-order Runge-Kutta-Nyström integrators, also benefit from its extended range and precision, allowing stable simulations of quantum mechanical systems with up to 34 decimal digits of accuracy. The format's advantages stem from its capacity to reduce rounding errors in iterative algorithms, thereby enabling longer stable computations without precision loss. In orbital propagation, hybrid-precision approaches combining double and quadruple arithmetic for increment calculations preserve dynamical fidelity over millennia-scale integrations, outperforming pure double-precision methods in symplectic solvers for N-body problems. This error mitigation is particularly valuable in applications requiring verifiable high accuracy, such as recognizing numerical constants or discovering mathematical identities through extended arithmetic. Despite these benefits, quadruple precision introduces challenges, including substantially increased computational cost and memory usage, often necessitating trade-offs in large-scale simulations. Software implementations typically run 5 to 10 times slower than double precision due to the complexity of 128-bit operations, while memory demands double compared to 64-bit formats, impacting scalability in memory-bound workloads. In climate modeling, where global simulations involve vast grids and iterative atmospheric dynamics, the heightened precision helps mitigate instabilities in sensitive parametrizations but at the expense of prolonged run times, prompting hybrid strategies to balance accuracy and efficiency. Particle physics simulations similarly employ it for high-accuracy integrators in event generators, though the overhead limits routine use to critical subroutines, as seen in precision-sensitive particle-in-cell methods. Specific projects illustrate these applications: In quantum chemistry, software frameworks supporting extended precision, like those integrating quadruple arithmetic for correlated methods, enable benchmark calculations that validate theoretical models against experimental data. As of 2024, libraries like NumPy have introduced native support for quadruple precision, facilitating its use in Python-based scientific workflows.[10]IEEE 754 Binary128 Format
Bit Layout and Components
The binary128 format, as specified in the IEEE 754 standard, allocates its 128 bits into three primary fields: a 1-bit sign field, a 15-bit exponent field, and a 112-bit mantissa field (also referred to as the significand or fraction). This structure enables representation of a wide range of real numbers with high precision, where the total bit width balances sign determination, scaling, and fractional detail. The sign bit, occupying the most significant position, determines the polarity of the represented value: 0 denotes a non-negative number, while 1 indicates a negative number. The exponent field provides the scaling factor for the number's magnitude, allowing adjustment of the position of the binary point. The mantissa field stores the fractional digits, capturing the significant bits that define the number's precision beyond the scaling. Visually, the bit layout is structured as:| Sign (1 bit) | Exponent (15 bits) | Mantissa (112 bits) |
| Sign (1 bit) | Exponent (15 bits) | Mantissa (112 bits) |
Exponent Encoding and Bias
In the IEEE 754 binary128 quadruple-precision format, the exponent is represented by a 15-bit field, allowing for 2^{15} = 32,768 possible encoded values ranging from 0 to 32,767.[1] The all-zero encoding (0) and all-one encoding (32,767) are reserved for special values, leaving the range 1 through 32,766 for finite numbers.[1] This design follows the general IEEE 754 convention for binary floating-point formats, where the exponent field width determines the bias as 2^{14} - 1 = 16,383 to enable symmetric representation around zero.[1] For normal numbers, the actual exponent e is obtained by subtracting the bias from the encoded exponent value E, yielding e = E - 16,383, where E ranges from 1 to 32,766.[1] This biasing shifts the exponent range to positive encoded values, facilitating unsigned binary storage while supporting both positive and negative powers of two from e = -16,382 to e = 16,383.[1] The bias ensures that the smallest normal exponent aligns efficiently with the format's precision requirements. Denormalized numbers, also known as subnormals, are encoded with E = 0 and a non-zero trailing significand, representing the smallest magnitudes without underflow to zero.[1] In this case, the effective exponent is fixed at e = -16,382 (equivalently, 1 - 16,383), and the significand is interpreted with an explicit leading zero bit rather than the implicit one used for normals, allowing gradual underflow.[1] This mechanism preserves relative precision near zero by scaling the significand appropriately with the exponent. Overflow and underflow conditions are handled through the reserved encodings: E = 32,767 (all ones) with a zero significand denotes infinity (±∞ depending on the sign bit), while the same encoding with a non-zero significand indicates a NaN (not-a-number).[1] Conversely, E = 0 with a zero significand represents zero (±0), distinguishing signed zeros for certain computations.[1] These conventions ensure consistent handling of exceptional cases across IEEE 754 formats.Mantissa Representation and Normalization
In the IEEE 754 binary128 quadruple-precision format, the mantissa, also known as the significand, is represented by a 112-bit fraction field stored explicitly in the bit layout. For normalized numbers, an implicit leading bit of 1 is assumed before the fraction, resulting in a total significand precision of 113 bits. This structure allows for high-fidelity representation of the fractional part, where the value of the significand is , with denoting the 112-bit fraction and the biased exponent (bias = 16383).[3] Normalization in binary128 ensures that the significand is always in the range for finite nonzero normal numbers by hiding the leading 1 bit, which optimizes storage and maintains consistent precision across the exponent range. The normalization process involves left-shifting the mantissa until the leading bit is 1, adjusting the exponent accordingly to preserve the value. This implicit bit convention, inherited from earlier IEEE 754 formats, maximizes the effective precision without dedicating a bit to the leading 1.[3] Denormalized (subnormal) numbers occur when the exponent field is all zeros (biased exponent ) and the fraction is nonzero, in which case the implicit leading bit is taken as 0 rather than 1. This design enables gradual underflow, allowing representation of values smaller than the smallest normalized number () down to approximately , but at the cost of reduced precision. The effective number of significant bits decreases progressively as more leading zeros appear in the significand; for instance, near zero, the precision can drop to as few as 1 bit when only the least significant bit of the fraction is set.[3] IEEE 754 mandates support for five rounding modes in arithmetic operations involving binary128 numbers, with rounding to nearest, ties to even as the default mode to minimize bias and ensure reproducibility. In this mode, results are rounded to the nearest representable value, and ties (exactly halfway between two representable values) are resolved by rounding to the even (least significant bit 0) mantissa. Other modes include directed rounding (toward positive or negative infinity, or toward zero) and ties away from zero, which may be selected for specific computational needs but can introduce directional bias if used pervasively. These modes apply uniformly to the 113-bit significand during normalization and post-operation adjustments.[3]Special Values and Edge Cases
In the IEEE 754 binary128 format, zero is represented by setting the exponent field to all zeros and the significand field to all zeros, with the sign bit specifying either positive zero (sign bit 0, all other bits 0) or negative zero (sign bit 1). These representations allow for signed zeros, which are treated as equivalent in most arithmetic operations and comparisons, such as addition or equality checks, to preserve mathematical consistency, though they may differ in specific contexts like reciprocals (1 / +0 yields +∞, while 1 / -0 yields -∞).[11] Infinite values are encoded with the 15-bit exponent field set to all ones (binary 111111111111111, or 32767 in decimal) and the 112-bit significand field set to zero, where the sign bit determines positive infinity or negative infinity. This encoding is used to represent the result of overflow conditions, such as when the magnitude of a computation exceeds the maximum representable finite value (approximately 1.18973149535723176502 × 10^{4932}), depending on the rounding mode and exception handling.[2] Not-a-Number (NaN) values are indicated by an exponent field of all ones and a non-zero significand field, allowing a vast payload for diagnostic information (up to 111 bits after the leading bit). Within this, quiet NaNs have the most significant bit of the significand set to 1, propagating through operations without signaling an invalid-operation exception, while signaling NaNs have this bit set to 0 and are intended to trigger such exceptions upon use in arithmetic.[12] Subnormal numbers, an edge case for gradual underflow, are represented with an exponent field of zero and a non-zero significand, effectively using the minimum exponent (–16382) without an implicit leading 1, filling the gap between zero and the smallest normalized number (approximately 3.36210314311209350626 × 10^{-4932}). This mechanism ensures smoother transitions near underflow thresholds compared to abrupt flushing to zero.Alternative High-Precision Methods
Double-Double Arithmetic
Double-double arithmetic is a software-emulated technique for achieving higher precision by representing numbers as the unevaluated sum of two IEEE 754 binary64 (double-precision) floating-point values, denoted as , where is the high part and is the low part with to ensure non-overlapping significands and maintain a canonical form.[13] This pairing effectively extends the significand to approximately 106 bits, providing about 32 decimal digits of precision, which approximates the capabilities of quadruple-precision formats like IEEE 754 binary128 without requiring native hardware support.[13] The representation allows for portable implementation across systems lacking 128-bit floating-point units, with renormalization steps applied after operations to preserve the bound on the low part.[13] Key operations in double-double arithmetic rely on error-free transformations to avoid precision loss. Addition of two double-double numbers begins by summing the high parts and low parts separately using floating-point addition, followed by an error-free summation algorithm such as TWO-SUM or FAST-TWO-SUM to combine the results exactly.[14] The TWO-SUM algorithm, for inputs and , computes , then derives the roundoff error such that exactly, ensuring non-overlapping components; FAST-TWO-SUM is a variant assuming for efficiency, using , , yielding the pair .[14] These steps, applied iteratively, bound the roundoff error in the sum by times the result magnitude, with subsequent renormalization to canonical form.[13] Multiplication employs a similar strategy, computing the product via the expansion , where each cross-term product uses a TWO-PROD algorithm analogous to TWO-SUM for exact decomposition.[13] This approach resembles a two-term Karatsuba multiplication, reducing the number of full-precision multiplications while capturing roundoff errors, resulting in an error bound of approximately for the overall operation.[13] Conversion to and from single double-precision involves rounding the combined value, preserving the effective unit roundoff of .[13] Compared to native quadruple-precision implementations, double-double arithmetic offers portability on hardware without 128-bit support, enabling high-precision computations in standard double-precision environments with performance closer to native doubles than arbitrary-precision libraries, albeit with operations roughly 4-10 times slower due to multiple steps.[13] This method has been foundational in libraries for scientific computing, providing rigorous error control for applications requiring extended precision.[13]Paired and Block Floating-Point Techniques
Paired floating-point techniques extend precision beyond standard double-precision formats by representing numbers as sums of multiple lower-precision components, often two or more, to achieve effective quadruple or higher accuracy without native hardware support. These methods, known as floating-point expansions, store a number as an unevaluated sum of non-overlapping machine-precision floats, allowing adaptive refinement of computations to control roundoff errors. For instance, a quad-precision value can be approximated using expansions of four double-precision numbers, enabling operations like addition and multiplication through algorithms such as EXPANSION-SUM and TWO-PRODUCT, which refine results by isolating error terms.[15][16] This technique, exemplified by algorithms that maintain a compensation accumulator for roundoff residuals, improves the accuracy of sums by up to the full extended precision available, though it requires careful control of register precision to avoid unintended intermediate rounding. Double-double arithmetic represents a related pairing approach, using two doubles to simulate higher precision, but paired methods generalize to longer expansions for broader applications.[17] Block floating-point techniques apply a shared exponent across a block or array of fixed-point mantissas, providing high dynamic range in signal processing applications where individual exponents would be inefficient. This format scales the entire block dynamically based on the maximum magnitude, preserving precision within fixed-point hardware while extending the effective range to handle signals with wide amplitude variations, such as in FFT computations. For example, in DSP implementations like the TMS320C54x, block floating-point improves signal-to-noise ratio by 41% to 58% over pure fixed-point methods for 64-point complex FFTs, by tracking and adjusting exponents stage-by-stage to prevent overflow. Roundoff error analysis in such systems shows that the shared exponent introduces bounded quantization noise, but the overall dynamic range gain makes it suitable for audio and radar processing.[18][19] Other techniques for quadruple-like precision involve multiple accumulators, where parallel single- or double-precision units sum partial results to form higher-precision outputs, particularly in fast Fourier transform (FFT) implementations requiring robust geometric or numerical predicates. In adaptive arithmetic frameworks, these accumulators refine expansions during FFT stages, ensuring exactness in predicates like orientation tests by eliminating zero components and normalizing sums, though at a computational cost of several times the base flops. Such methods are applied in robust geometric computing, where expansions of up to 16 terms handle ill-conditioned inputs without full quad-precision hardware.[16] These non-native approaches incur higher overhead than IEEE 754 binary128, and block formats limited to array-based domains like signal processing. They excel in specific areas such as audio processing for dynamic range in filters and cryptography for precise modular arithmetic, but their domain specificity and error management needs restrict general-purpose use.[15][18]Implementations and Support
Programming Language Integration
Several programming languages provide native support for quadruple-precision floating-point arithmetic, primarily through compiler extensions that implement the IEEE 754 binary128 format. In Fortran, the REAL(KIND=16) type specifier declares 128-bit floating-point variables, enabling high-precision computations with full support for arithmetic operations, including addition, multiplication, and division, as well as intrinsic functions like SQRT and EXP. This feature has been available in major Fortran compilers such as gfortran and Intel Fortran since the Fortran 95 standard, which introduced selected real kinds for extended precision. For input/output, Fortran's standard formatted I/O handles REAL(KIND=16) values, though precision control may require explicit format specifiers to avoid truncation. In C and C++, the GNU Compiler Collection (GCC) and Clang provide native support via the non-standard __float128 type, introduced in GCC version 4.3 in 2008 to facilitate quadruple-precision calculations on platforms like x86, x86-64, and PowerPC. This type supports standard arithmetic operators (+, -, *, /) and comparisons, with literals denoted by suffixes like 'q' (e.g., 1.0q). Input/output requires specialized functions from the libquadmath library, such as quadmath_snprintf for formatted printing, as standard printf does not directly support __float128. The ISO/IEC 9899:2023 (C23) standard later formalized optional interchange types like _Float128, which aliases to __float128 in compatible compilers, promoting greater portability in future codebases. Despite this support, quadruple-precision integration remains non-standard across most languages, leading to portability challenges; for instance, code using __float128 must be compiled with specific GCC or Clang flags, such as -mfloat128 on PowerPC Linux targets, and may fall back to software emulation on hardware without native 128-bit instructions. Languages like Java and Python lack built-in quadruple-precision types, often relying on external libraries for emulation.Software Libraries and Toolboxes
The GNU Multiple Precision Floating-Point Reliable (MPFR) library is a C library that implements arbitrary-precision binary floating-point arithmetic with correct rounding, allowing users to configure the mantissa precision to 128 bits for computations equivalent to quadruple precision.[20] It supports all standard IEEE 754 operations, including special values like infinities and NaNs, and is built on the GMP library for efficient handling of large exponents.[21] MPFR's emulation mode ensures portability across platforms without native hardware support, with performance optimizations such as assembly-optimized routines for basic arithmetic on x86 architectures. The QD library provides double-double and quad-double arithmetic types, where the double-double format—representing numbers as unevaluated sums of two IEEE double-precision values—delivers at least 106 bits of significand precision, approximating the 113-bit mantissa of quadruple precision while offering about 34 decimal digits of accuracy.[13] Implemented in ANSI C++ with full operator and function overloading, it enables seamless mixing of double, double-double, and quad-double types in expressions, supporting algebraic operations (e.g., roots and powers), transcendental functions (e.g., sine and logarithm), and input/output conversions.[13] For performance, QD employs error-free transformations and renormalization to minimize rounding errors, making it suitable for emulation on standard hardware; it has been integrated into C++ scientific applications, such as parallel simulations in computational fluid dynamics, where high accuracy is critical for long-running calculations.[22] Boost.Multiprecision is a C++ header-only library that includes thefloat128 type for direct support of the IEEE 754 binary128 quadruple-precision format, featuring a 113-bit mantissa, 15-bit exponent, and full compliance with std::numeric_limits for properties like maximum digits (36 decimal).[23] It offers drop-in replacement for built-in float types, with constexpr support for arithmetic and comparisons (from C++14 with GCC), and integrates with Boost.Math for special functions like gamma.[23] Emulation is handled via backends like GCC's __float128 or Intel's _Quad, with optimizations for conversion and I/O to reduce overhead in numerical algorithms.
MATLAB's Symbolic Math Toolbox incorporates variable-precision arithmetic through the vpa function, which evaluates expressions to a user-specified number of significant digits—such as 34 for quadruple-precision equivalence—using arbitrary-precision floating-point internally.[24] This toolbox supports numeric, symbolic, vector, and matrix operations, restoring exact forms (e.g., rational or pi-based) when possible and employing guard digits to mitigate round-off, though it relies on software emulation for precisions beyond native double.[24]
In Python environments, the mpmath library delivers arbitrary-precision real and complex floating-point arithmetic, configurable to 128 bits or higher (e.g., via mp.dps = 34 for quadruple-level decimal precision), and extends to NumPy through array conversions or wrappers like mpmath.matrix for vectorized high-precision computations.[25] It includes optimized algorithms for integration, special functions, and linear algebra, with emulation based on Python's arbitrary-precision integers or optional GMP backends for speed.[26]
Post-2020 developments have enhanced GPU compatibility for these libraries, including NVIDIA CUDA's addition of __float128 mathematical functions (e.g., sine, square root) for device code on platforms with host compiler support, enabling quadruple-precision emulation in parallel computing workflows.[27] Extensions like GQD, a GPU-ported version of the QD library, leverage CUDA for double-double and quad-double operations on NVIDIA hardware, achieving up to several times the throughput of CPU baselines for basic arithmetic in scientific simulations.[28]
