Hubbry Logo
Double-precision floating-point formatDouble-precision floating-point formatMain
Open search
Double-precision floating-point format
Community hub
Double-precision floating-point format
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Double-precision floating-point format
Double-precision floating-point format
from Wikipedia

Double-precision floating-point format (sometimes called FP64 or float64) is a floating-point number format, usually occupying 64 bits in computer memory; it represents a wide range of numeric values by using a floating radix point.

Double precision may be chosen when the range or precision of single precision would be insufficient.

In the IEEE 754 standard, the 64-bit base-2 format is officially referred to as binary64; it was called double in IEEE 754-1985. IEEE 754 specifies additional floating-point formats, including 32-bit base-2 single precision and, more recently, base-10 representations (decimal floating point).

One of the first programming languages to provide floating-point data types was Fortran.[citation needed] Before the widespread adoption of IEEE 754-1985, the representation and properties of floating-point data types depended on the computer manufacturer and computer model, and upon decisions made by programming-language implementers. E.g., GW-BASIC's double-precision data type was the 64-bit MBF floating-point format.

IEEE 754 double-precision binary floating-point format: binary64

[edit]

Double-precision binary floating-point is a commonly used format on PCs, due to its wider range over single-precision floating point, in spite of its performance and bandwidth cost. It is commonly known simply as double. The IEEE 754 standard specifies a binary64 as having:

The sign bit determines the sign of the number (including when this number is zero, which is signed).

The exponent field is an 11-bit unsigned integer from 0 to 2047, in biased form: an exponent value of 1023 represents the actual zero. Exponents range from −1022 to +1023 because exponents of −1023 (all 0s) and +1024 (all 1s) are reserved for special numbers.

The 53-bit significand precision gives from 15 to 17 significant decimal digits precision (2−53 ≈ 1.11 × 10−16). If a decimal string with at most 15 significant digits is converted to the IEEE 754 double-precision format, giving a normal number, and then converted back to a decimal string with the same number of digits, the final result should match the original string. If an IEEE 754 double-precision number is converted to a decimal string with at least 17 significant digits, and then converted back to double-precision representation, the final result must match the original number.[1]

The format is written with the significand having an implicit integer bit of value 1 (except for special data, see the exponent encoding below). With the 52 bits of the fraction (F) significand appearing in the memory format, the total precision is therefore 53 bits (approximately 16 decimal digits, 53 log10(2) ≈ 15.955). The bits are laid out as follows:

The real value assumed by a given 64-bit double-precision datum with a given biased exponent and a 52-bit fraction is

or

Between 252=4,503,599,627,370,496 and 253=9,007,199,254,740,992 the representable numbers are exactly the integers. For the next range, from 253 to 254, everything is multiplied by 2, so the representable numbers are the even ones, etc. Conversely, for the previous range from 251 to 252, the spacing is 0.5, etc.

The spacing as a fraction of the numbers in the range from 2n to 2n+1 is 2n−52. The maximum relative rounding error when rounding a number to the nearest representable one (the machine epsilon) is therefore 2−53.

The 11 bit width of the exponent allows the representation of numbers between 10−308 and 10308, with full 15–17 decimal digits precision. By compromising precision, the subnormal representation allows even smaller values up to about 5 × 10−324.

Exponent encoding

[edit]

The double-precision binary floating-point exponent is encoded using an offset-binary representation, with the zero offset being 1023; also known as exponent bias in the IEEE 754 standard. Examples of such representations would be:

e =000000000012=00116=1: (smallest exponent for normal numbers)
e =011111111112=3ff16=1023: (zero offset)
e =100000001012=40516=1029:
e =111111111102=7fe16=2046: (highest exponent)

The exponents 00016 and 7ff16 have a special meaning:

  • 000000000002=00016 is used to represent a signed zero (if F = 0) and subnormal numbers (if F ≠ 0); and
  • 111111111112=7ff16 is used to represent (if F = 0) and NaNs (if F ≠ 0),

where F is the fractional part of the significand. All bit patterns are valid encoding.

Except for the above exceptions, the entire double-precision number is described by:

In the case of subnormal numbers (e = 0) the double-precision number is described by:

Endianness

[edit]
Although many processors use little-endian storage for all types of data (integer, floating point), there are a number of hardware architectures where floating-point numbers are represented in big-endian form while integers are represented in little-endian form.[2] There are ARM processors that have mixed-endian floating-point representation for double-precision numbers: each of the two 32-bit words is stored as little-endian, but the most significant word is stored first. VAX floating point stores little-endian 16-bit words in big-endian order. Because there have been many floating-point formats with no network standard representation for them, the XDR standard uses big-endian IEEE 754 as its representation. It may therefore appear strange that the widespread IEEE 754 floating-point standard does not specify endianness.[3] Theoretically, this means that even standard IEEE floating-point data written by one machine might not be readable by another. However, on modern standard computers (i.e., implementing IEEE 754), one may safely assume that the endianness is the same for floating-point numbers as for integers, making the conversion straightforward regardless of data type. Small embedded systems using special floating-point formats may be another matter, however.

Double-precision examples

[edit]
  0 01111111111 00000000000000000000000000000000000000000000000000002
≙ 3FF0 0000 0000 000016
≙ +20 × 1
= 1
  0 01111111111 00000000000000000000000000000000000000000000000000012
≙ 3FF0 0000 0000 000116
≙ +20 × (1 + 2−52)
≈ 1.0000000000000002220 (the smallest number greater than 1)
  0 01111111111 00000000000000000000000000000000000000000000000000102
≙ 3FF0 0000 0000 000216
≙ +20 × (1 + 2−51)
≈ 1.0000000000000004441 (the second smallest number greater than 1)
  0 10000000000 00000000000000000000000000000000000000000000000000002
≙ 4000 0000 0000 000016
≙ +21 × 1
= 2
  1 10000000000 00000000000000000000000000000000000000000000000000002
≙ C000 0000 0000 000016
≙ −21 × 1
= −2
  0 10000000000 10000000000000000000000000000000000000000000000000002
≙ 4008 0000 0000 000016
≙ +21 × 1.12
= 112
= 3
  0 10000000001 00000000000000000000000000000000000000000000000000002
≙ 4010 0000 0000 000016
≙ +22 × 1
= 1002
= 4
  0 10000000001 01000000000000000000000000000000000000000000000000002
≙ 4014 0000 0000 000016
≙ +22 × 1.012
= 1012
= 5
  0 10000000001 10000000000000000000000000000000000000000000000000002
≙ 4018 0000 0000 000016
≙ +22 × 1.12
= 1102
= 6
  0 10000000011 01110000000000000000000000000000000000000000000000002
≙ 4037 0000 0000 000016
≙ +24 × 1.01112
= 101112
= 23
  0 01111111000 10000000000000000000000000000000000000000000000000002
≙ 3F88 0000 0000 000016
≙ +2−7 × 1.12
= 0.000000112
= 0.01171875 (3/256)
  0 00000000000 00000000000000000000000000000000000000000000000000012
≙ 0000 0000 0000 000116
≙ +2−1022 × 2−52
= 2−1074
≈ 4.9406564584124654 × 10−324 (smallest positive subnormal number)
  0 00000000000 11111111111111111111111111111111111111111111111111112
≙ 000F FFFF FFFF FFFF16
≙ +2−1022 × (1 − 2−52)
≈ 2.2250738585072009 × 10−308 (largest subnormal number)
  0 00000000001 00000000000000000000000000000000000000000000000000002
≙ 0010 0000 0000 000016
≙ +2−1022 × 1
≈ 2.2250738585072014 × 10−308 (smallest positive normal number)
  0 11111111110 11111111111111111111111111111111111111111111111111112
≙ 7FEF FFFF FFFF FFFF16
≙ +21023 × (2 − 2−52)
≈ 1.7976931348623157 × 10308 (largest normal number)
  0 00000000000 00000000000000000000000000000000000000000000000000002
≙ 0000 0000 0000 000016
≙ +0 (positive zero)
  1 00000000000 00000000000000000000000000000000000000000000000000002
≙ 8000 0000 0000 000016
≙ −0 (negative zero)
  0 11111111111 00000000000000000000000000000000000000000000000000002
≙ 7FF0 0000 0000 000016
≙ +∞ (positive infinity)
  1 11111111111 00000000000000000000000000000000000000000000000000002
≙ FFF0 0000 0000 000016
≙ −∞ (negative infinity)
  0 11111111111 00000000000000000000000000000000000000000000000000012
≙ 7FF0 0000 0000 000116
≙ NaN (sNaN on most processors, such as x86 and ARM)
  0 11111111111 10000000000000000000000000000000000000000000000000012
≙ 7FF8 0000 0000 000116
≙ NaN (qNaN on most processors, such as x86 and ARM)
  0 11111111111 11111111111111111111111111111111111111111111111111112
≙ 7FFF FFFF FFFF FFFF16
≙ NaN (an alternative encoding of NaN)
  0 01111111101 01010101010101010101010101010101010101010101010101012
≙ 3FD5 5555 5555 555516
≙ +2−2 × (1 + 2−2 + 2−4 + ... + 2−52)
≈ 0.33333333333333331483 (closest approximation to 1/3)
  0 10000000000 10010010000111111011010101000100010000101101000110002
≙ 4009 21FB 5444 2D1816
≈ 3.141592653589793116 (closest approximation to π)

Encodings of qNaN and sNaN are not completely specified in IEEE 754 and depend on the processor. Most processors, such as the x86 family and the ARM family processors, use the most significant bit of the significand field to indicate a quiet NaN; this is what is recommended by IEEE 754. The PA-RISC processors use the bit to indicate a signaling NaN.

By default, 1/3 rounds down, instead of up like single precision, because of the odd number of bits in the significand.

In more detail:

Given the hexadecimal representation 3FD5 5555 5555 555516,
  Sign = 0
  Exponent = 3FD16 = 1021
  Exponent Bias = 1023 (constant value; see above)
  Fraction = 5 5555 5555 555516
  Value = 2(Exponent − Exponent Bias) × 1.Fraction – Note that Fraction must not be converted to decimal here
        = 2−2 × (15 5555 5555 555516 × 2−52)
        = 2−54 × 15 5555 5555 555516
        = 0.333333333333333314829616256247390992939472198486328125
        ≈ 1/3

Execution speed with double-precision arithmetic

[edit]

Using double-precision floating-point variables is usually slower than working with their single precision counterparts. One area of computing where this is a particular issue is parallel code running on GPUs. For example, when using Nvidia's CUDA platform, calculations with double precision can take, depending on hardware, from 2 to 32 times as long to complete compared to those done using single precision.[4]

Additionally, many mathematical functions (e.g., sin, cos, atan2, log, exp and sqrt) need more computations to give accurate double-precision results, and are therefore slower.

Precision limitations on integer values

[edit]
  • Integers from −253 to 253 (−9,007,199,254,740,992 to 9,007,199,254,740,992) can be exactly represented.
  • Integers between 253 and 254 = 18,014,398,509,481,984 round to a multiple of 2 (even number).
  • Integers between 254 and 255 = 36,028,797,018,963,968 round to a multiple of 4.
  • Integers between 2n and 2n+1 round to a multiple of 2n−52.

Implementations

[edit]

Doubles are implemented in many programming languages in different ways. On processors with only dynamic precision, such as x86 without SSE2 (or when SSE2 is not used, for compatibility purpose) and with extended precision used by default, software may have difficulties to fulfill some requirements.

C and C++

[edit]

C and C++ offer a wide variety of arithmetic types. Double precision is not required by the standards (except by the optional annex F of C99, covering IEEE 754 arithmetic), but on most systems, the double type corresponds to double precision. However, on 32-bit x86 with extended precision by default, some compilers may not conform to the C standard or the arithmetic may suffer from double rounding.[5]

Fortran

[edit]

Fortran provides several integer and real types, and the 64-bit type real64, accessible via Fortran's intrinsic module iso_fortran_env, corresponds to double precision.

Common Lisp

[edit]

Common Lisp provides the types SHORT-FLOAT, SINGLE-FLOAT, DOUBLE-FLOAT and LONG-FLOAT. Most implementations provide SINGLE-FLOATs and DOUBLE-FLOATs with the other types appropriate synonyms. Common Lisp provides exceptions for catching floating-point underflows and overflows, and the inexact floating-point exception, as per IEEE 754. No infinities and NaNs are described in the ANSI standard, however, several implementations do provide these as extensions.

Java

[edit]

On Java before version 1.2, every implementation had to be IEEE 754 compliant. Version 1.2 allowed implementations to bring extra precision in intermediate computations for platforms like x87. Thus a modifier strictfp was introduced to enforce strict IEEE 754 computations. Strict floating point has been restored in Java 17.[6]

JavaScript

[edit]

As specified by the ECMAScript standard, all arithmetic in JavaScript shall be done using double-precision floating-point arithmetic.[7]

JSON

[edit]

The JSON data encoding format supports numeric values, and the grammar to which numeric expressions must conform has no limits on the precision or range of the numbers so encoded. However, RFC 8259 advises that, since IEEE 754 binary64 numbers are widely implemented, good interoperability can be achieved by implementations processing JSON if they expect no more precision or range than binary64 offers.[8]

Rust and Zig

[edit]

Rust and Zig have the f64 data type.[9][10]

See also

[edit]

Notes and references

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
The , officially known as binary64 in the standard, is a 64-bit binary interchange format for representing real numbers in computer systems, consisting of a 1-bit sign field, an 11-bit biased exponent field, and a 52-bit significand field with an implicit leading 1 for normalized values. This format enables the approximation of a wide range of real numbers using scientific notation in binary, where the value is calculated as (-1)^sign × 2^(exponent - 1023) × (1 + fraction/2^52) for normalized numbers. Key characteristics include a precision of 53 bits for the significand, equivalent to approximately 16 decimal digits, allowing for high-fidelity representations in numerical computations. The of 1023 supports a from the smallest normalized positive value of about 2.225 × 10^{-308} to the largest of approximately 1.798 × 10^{308}, with subnormal numbers extending the underflow range down to around 4.940 × 10^{-324}. The format also accommodates special values such as signed zeros, infinities, and Not-a-Number () payloads for handling exceptional conditions in arithmetic operations. Developed as part of the standard, which was drafted starting in 1978 at the under the leadership of , the double-precision format has become the for in most modern processors and programming languages due to its balance of precision and range for scientific, engineering, and general-purpose computing. Subsequent revisions, including IEEE 754-2008 and IEEE 754-2019, have refined the standard while maintaining backward compatibility for binary64, ensuring its widespread adoption in hardware like x86, , and GPU architectures.

Overview and Fundamentals

Definition and Purpose

Double-precision floating-point format, also known as binary64 in the standard, is a 64-bit computational representation designed to approximate real numbers using a , an 11-bit exponent, and a 52-bit (mantissa). This structure allows for the encoding of a wide variety of numerical values in binary , where the number is expressed as ±(1 + fraction) × 2^(exponent - ), providing a normalized form for most representable values. The format was standardized in to ensure consistent representation and arithmetic operations across different computer systems, promoting portability in software and hardware implementations. The primary purpose of double-precision format in is to achieve a balance between numerical range and precision suitable for scientific simulations, analyses, and general-purpose calculations that require higher accuracy than single-precision alternatives. It supports a of approximately 10^{-308} to 10^{308}, enabling the representation of extremely large or small magnitudes without excessive loss of detail, and offers about 15 decimal digits of precision due to the 53-bit effective (including the implicit leading 1). This precision is sufficient for most applications where relative accuracy matters more than absolute exactness, such as in physics modeling or financial computations. Compared to fixed-point or formats, double-precision floating-point significantly reduces the risks of overflow and underflow by dynamically adjusting the binary point through the exponent, allowing seamless handling of scales from subatomic to astronomical without manual rescaling. This feature makes it indispensable for iterative algorithms and data processing where input values vary widely, ensuring computational stability and efficiency in diverse fields.

Historical Development and Standardization

The development of double-precision floating-point formats emerged in the 1950s and 1960s amid the growing need for mainframe computers to handle scientific and engineering computations requiring greater numerical range and accuracy than single-precision or integer formats could provide. The , introduced in 1954 as the first mass-produced computer with dedicated floating-point hardware, supported single-precision (36-bit) and double-precision (72-bit) binary formats, enabling more reliable processing of complex mathematical operations in fields like physics and . Subsequent systems, such as the 7094 in 1962, expanded these capabilities with hardware support for double-precision operations and index registers, further solidifying as essential for . Pre-IEEE implementations varied significantly, contributing to interoperability challenges, but several influenced the eventual standard. The DEC PDP-11 series, launched in 1970, featured a 64-bit double-precision floating-point format (G-floating) with an 8-bit exponent, 55-bit , and hidden-bit normalization, which bore close resemblance to the later IEEE design despite differences in (129 versus 1023) and handling of subnormals; this format's structure informed discussions on binary representation during standardization efforts. In the , divergent floating-point implementations across architectures led to severe portability issues, such as inconsistent behaviors and unreliable results (e.g., distinct nonzero values yielding zero under ), inflating costs and limiting numerical reliability. To address this "," the IEEE formed the Floating-Point Working Group in 1977 under the Microprocessor Standards Subcommittee, with initial meetings in November of that year; , a key consultant to and co-author of influential drafts, chaired efforts that balanced precision, range, and needs from industry stakeholders. The resulting standard formalized the 64-bit binary double-precision format (binary64), specifying a 1-bit , 11-bit biased exponent, and 52-bit (with implicit leading 1), thereby establishing a portable foundation for in binary systems. The core binary64 format has remained stable through subsequent revisions, which focused on enhancements rather than fundamental changes. IEEE 754-2008 introduced fused multiply-add (FMA) operations—computing x×y+zx \times y + z with a single rounding step to minimize error accumulation—along with decimal formats and refined , while preserving the unchanged structure of binary64 for . The 2019 revision (IEEE 754-2019) delivered minor bug fixes, clarified operations like augmented addition, and ensured upward compatibility, without altering the binary64 specification to maintain ecosystem stability. Adoption accelerated in the late 1980s, driven by hardware implementations that embedded the standard into mainstream processors. The 80387 , released in 1987 as a for the 80386, provided full support for double-precision arithmetic, including 64-bit operations with 53-bit precision and , enabling widespread use in x86-based personal computers and scientific applications by the 1990s.

IEEE 754 Binary64 Specification

Bit Layout and Components

The double-precision floating-point format, designated as binary64 in the standard, employs a fixed 64-bit structure to encode real numbers, balancing range and precision for computational applications. This layout divides the 64 bits into three primary fields: a , an exponent field, and a field. The occupies the most significant position (bit 63), with a value of 0 denoting a positive number and 1 indicating a . The exponent field spans the next 11 bits (bits 62 through 52), serving as an unsigned that scales the overall magnitude. The field, also known as the mantissa or , comprises the least significant 52 bits (bits 51 through 0), capturing the binary digits that define the number's precision. For normalized numbers, the represented value is given by the formula: (1)s×(1+f252)×2e1023(-1)^s \times \left(1 + \frac{f}{2^{52}}\right) \times 2^{e - 1023} where ss is the (0 or 1), ff is the bits interpreted as a 52-bit , and ee is the exponent field value as an 11-bit unsigned . The straightforwardly controls the number's polarity. The exponent field adjusts the binary scale factor, while the encodes the fractional part after the radix point. In normalized form, the assumes an implicit leading 1 (the "hidden bit") before the explicit 52 bits, yielding a total of 53 bits of precision for the mantissa. This hidden bit convention ensures efficient use of storage by omitting the redundant leading 1 in normalized representations. The 64-bit structure, with its 53-bit effective significand precision, supports approximately 15.95 decimal digits of accuracy, calculated as log10(253)\log_{10}(2^{53}).
FieldBit PositionsWidth (bits)Purpose
Sign631Polarity (0: positive, 1: negative)
Exponent62–5211Scaling factor (biased)
Significand51–052Precision digits (with hidden bit)

Exponent Encoding and Bias

In the IEEE 754 binary64 format, the exponent is represented by an 11-bit field that stores a value to accommodate both positive and negative exponents using an unsigned binary encoding. This bias mechanism adds a fixed offset to the true exponent, allowing the field to range from 0 to 2047 while mapping to effective exponents that span negative and positive values. The bias value for binary64 is 1023, calculated as 211112^{11-1} - 1. The true exponent ee is obtained by subtracting the from the encoded exponent EE: e=E1023e = E - 1023 This formula enables the representation of normalized numbers with exponents ranging from 1022-1022 to +1023+1023. For normalized values, EE ranges from 1 to 2046, ensuring an implicit leading bit of 1 and full precision. When the exponent field is all zeros (E=0E = 0), it denotes subnormal (denormalized) numbers, providing gradual underflow toward zero rather than abrupt flushing. In this case, the effective exponent is fixed at [1022](/page/1022)-[1022](/page/1022), and the value is computed as (1)s×(0+f/252)×2[1022](/page/1022)(-1)^s \times (0 + f / 2^{52}) \times 2^{-[1022](/page/1022)}, where ss is the and ff is the 52-bit field interpreted as a less than 1. The all-ones exponent field (E=2047E = 2047) is reserved for special values. The use of allows the exponent to be stored as an , supporting signed effective exponents without the complexities of representation, such as asymmetric ranges or additional hardware for sign handling. This design choice simplifies comparisons of floating-point magnitudes by treating the biased exponents as unsigned values and promotes efficient implementation across diverse hardware architectures.

Significand Representation and Normalization

In the IEEE 754 binary64 format, the significand, also known as the mantissa, consists of 52 explicitly stored bits in the trailing significand field, augmented by an implicit leading bit of 1 for normalized numbers, resulting in an effective precision of 53 bits. This design allows the significand to represent values in the range [1, 2) in binary, where the explicit bits capture the fractional part following the implicit integer bit. The choice of 53 bits provides a relative precision of approximately 2531.11×10162^{-53} \approx 1.11 \times 10^{-16} for numbers near 1, enabling the representation of about 15 to 17 decimal digits of accuracy. Normalization ensures that the is adjusted to have its leading bit as 1, maximizing the use of available bits for precision. During the normalization process, the binary representation of a number is shifted left or right until the most significant bit is 1, with corresponding adjustments to the exponent to maintain the overall value; this implicit leading 1 is not stored, freeing up space for additional al bits. For a normalized binary64 number, the value is thus given by 1.f×2E1.f \times 2^{E}, where ff is the 52-bit and EE is the unbiased exponent (with the biased exponent referenced from the encoding scheme). This normalization applies to all finite nonzero numbers except subnormals, ensuring consistent precision across the representable range. Denormalized (or subnormal) numbers are used to represent values smaller than the smallest , extending the range toward zero without underflow to zero. In this case, the exponent field is set to zero, and there is no implicit leading 1; instead, the is interpreted as 0.f×210220.f \times 2^{-1022}, where ff is the 52-bit , resulting in reduced precision that gradually decreases as more leading zeros appear in the . This mechanism fills the gap between zero and the minimum normalized value of 210222^{-1022}, with the smallest positive subnormal being 210742^{-1074}.

Special Values and Exceptions

In the IEEE 754 binary64 format, zero is encoded with an exponent field of all zeros (unbiased value 0) and a of all zeros, where the determines positive zero (+0) or negative zero (-0). Signed zeros are preserved in arithmetic operations and are significant in contexts like division, where they affect the sign of the resulting (e.g., 1/+0=+1 / +0 = +\infty and 1/0=1 / -0 = -\infty). This distinction ensures consistent handling of directional and branch cuts in complex arithmetic. Positive and negative are represented by setting the exponent field to all ones (biased value 2047) and the to all zeros, with the specifying the direction. These values arise from operations like overflow, where a result exceeds the largest representable finite number (approximately 1.8×103081.8 \times 10^{308}), or , producing ++\infty or -\infty based on signs. propagates through most arithmetic operations, such as +5=\infty + 5 = \infty, maintaining the expected mathematical behavior while signaling potential issues. Not-a-Number (NaN) values indicate indeterminate or invalid results and are encoded with the exponent field set to 2047 and a non-zero significand. NaNs are categorized as quiet NaNs (qNaNs), which have the most significant bit of the significand set to 1 and silently propagate through operations without raising exceptions, or signaling NaNs (sNaNs), which have that bit set to 0 and trigger the invalid operation exception upon use. The remaining 51 bits of the significand serve as a payload, allowing implementations to embed diagnostic information, such as the operation that generated the NaN, in line with IEEE 754 requirements for NaN propagation in arithmetic (e.g., NaN+3=NaN\text{NaN} + 3 = \text{NaN}). IEEE 754 defines five floating-point exceptions to handle edge cases, with default results that maintain computational continuity: the invalid operation exception, triggered by operations like 1\sqrt{-1}
Add your contribution
Related Hubs
User Avatar
No comments yet.