Hubbry Logo
Floating-point arithmeticFloating-point arithmeticMain
Open search
Floating-point arithmetic
Community hub
Floating-point arithmetic
logo
8 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Floating-point arithmetic
Floating-point arithmetic
from Wikipedia

An early electromechanical programmable computer, the Z3, included floating-point arithmetic (replica on display at Deutsches Museum in Munich).

In computing, floating-point arithmetic (FP) is arithmetic on subsets of real numbers formed by a significand (a signed sequence of a fixed number of digits in some base) multiplied by an integer power of that base. Numbers of this form are called floating-point numbers.[1]: 3 [2]: 10 

For example, the number 2469/200 is a floating-point number in base ten with five digits: However, 7716/625 = 12.3456 is not a floating-point number in base ten with five digits—it needs six digits. The nearest floating-point number with only five digits is 12.346. And 1/3 = 0.3333… is not a floating-point number in base ten with any finite number of digits. In practice, most floating-point systems use base two, though base ten (decimal floating point) is also common.

Floating-point arithmetic operations, such as addition and division, approximate the corresponding real number arithmetic operations by rounding any result that is not a floating-point number itself to a nearby floating-point number.[1]: 22 [2]: 10  For example, in a floating-point arithmetic with five base-ten digits, the sum 12.345 + 1.0001 = 13.3451 might be rounded to 13.345.

The term floating point refers to the fact that the number's radix point can "float" anywhere to the left, right, or between the significant digits of the number. This position is indicated by the exponent, so floating point can be considered a form of scientific notation.

A floating-point system can be used to represent, with a fixed number of digits, numbers of very different orders of magnitude — such as the number of meters between galaxies or between protons in an atom. For this reason, floating-point arithmetic is often used to allow very small and very large real numbers that require fast processing times. The result of this dynamic range is that the numbers that can be represented are not uniformly spaced; the difference between two consecutive representable numbers varies with their exponent.[3]

Single-precision floating-point numbers on a number line: the green lines mark representable values.
Augmented version above showing both signs of representable values

Over the years, a variety of floating-point representations have been used in computers. In 1985, the IEEE 754 Standard for Floating-Point Arithmetic was established, and since the 1990s, the most commonly encountered representations are those defined by the IEEE.

The speed of floating-point operations, commonly measured in terms of FLOPS, is an important characteristic of a computer system, especially for applications that involve intensive mathematical calculations.

Floating-point numbers can be computed using software implementations (softfloat) or hardware implementations (hardfloat). Floating-point units (FPUs, colloquially math coprocessors) are specially designed to carry out operations on floating-point numbers and are part of most computer systems. When FPUs are not available, software implementations can be used instead.

Overview

[edit]

Floating-point numbers

[edit]

A number representation specifies some way of encoding a number, usually as a string of digits.

There are several mechanisms by which strings of digits can represent numbers. In standard mathematical notation, the digit string can be of any length, and the location of the radix point is indicated by placing an explicit "point" character (dot or comma) there. If the radix point is not specified, then the string implicitly represents an integer and the unstated radix point would be off the right-hand end of the string, next to the least significant digit. In fixed-point systems, a position in the string is specified for the radix point. So a fixed-point scheme might use a string of 8 decimal digits with the decimal point in the middle, whereby "00012345" would represent 0001.2345.

In scientific notation, the given number is scaled by a power of 10, so that it lies within a specific range—typically between 1 and 10, with the radix point appearing immediately after the first digit. As a power of ten, the scaling factor is then indicated separately at the end of the number. For example, the orbital period of Jupiter's moon Io is 152,853.5047 seconds, a value that would be represented in standard-form scientific notation as 1.528535047×105 seconds.

Floating-point representation is similar in concept to scientific notation. Logically, a floating-point number consists of:

  • A signed (meaning positive or negative) digit string of a given length in a given radix (or base). This digit string is referred to as the significand, mantissa, or coefficient.[nb 1] The length of the significand determines the precision to which numbers can be represented. The radix point position is assumed always to be somewhere within the significand—often just after or just before the most significant digit, or to the right of the rightmost (least significant) digit. This article generally follows the convention that the radix point is set just after the most significant (leftmost) digit.
  • A signed integer exponent (also referred to as the characteristic, or scale),[nb 2] which modifies the magnitude of the number.

To derive the value of the floating-point number, the significand is multiplied by the base raised to the power of the exponent, equivalent to shifting the radix point from its implied position by a number of places equal to the value of the exponent—to the right if the exponent is positive or to the left if the exponent is negative.

Using base-10 (the familiar decimal notation) as an example, the number 152,853.5047, which has ten decimal digits of precision, is represented as the significand 1528535047 together with 5 as the exponent. To determine the actual value, a decimal point is placed after the first digit of the significand and the result is multiplied by 105 to give 1.528535047×105, or 152,853.5047. In storing such a number, the base (10) need not be stored, since it will be the same for the entire range of supported numbers, and can thus be inferred.

Symbolically, this final value is:

where s is the significand (ignoring any implied decimal point), p is the precision (the number of digits in the significand), b is the base (in our example, this is the number ten), and e is the exponent.

Historically, several number bases have been used for representing floating-point numbers, with base two (binary) being the most common, followed by base ten (decimal floating point), and other less common varieties, such as base sixteen (hexadecimal floating point[4][5][nb 3]), base eight (octal floating point[1][5][6][4][nb 4]), base four (quaternary floating point[7][5][nb 5]), base three (balanced ternary floating point[1]) and even base 256[5][nb 6] and base 65,536.[8][nb 7]

A floating-point number is a rational number, because it can be represented as one integer divided by another; for example 1.45×103 is (145/100)×1000 or 145,000/100. The base determines the fractions that can be represented; for instance, 1/5 cannot be represented exactly as a floating-point number using a binary base, but 1/5 can be represented exactly using a decimal base (0.2, or 2×10−1). However, 1/3 cannot be represented exactly by either binary (0.010101...) or decimal (0.333...), but in base 3, it is trivial (0.1 or 1×3−1) . The occasions on which infinite expansions occur depend on the base and its prime factors.

The way in which the significand (including its sign) and exponent are stored in a computer is implementation-dependent. The common IEEE formats are described in detail later and elsewhere, but as an example, in the binary single-precision (32-bit) floating-point representation, , and so the significand is a string of 24 bits. For instance, the number π's first 33 bits are:

In this binary expansion, let us denote the positions from 0 (leftmost bit, or most significant bit) to 32 (rightmost bit). The 24-bit significand will stop at position 23, shown as the underlined bit 0 above. The next bit, at position 24, is called the round bit or rounding bit. It is used to round the 33-bit approximation to the nearest 24-bit number (there are specific rules for halfway values, which is not the case here). This bit, which is 1 in this example, is added to the integer formed by the leftmost 24 bits, yielding:

When this is stored in memory using the IEEE 754 encoding, this becomes the significand s. The significand is assumed to have a binary point to the right of the leftmost bit. So, the binary representation of π is calculated from left-to-right as follows:

where p is the precision (24 in this example), n is the position of the bit of the significand from the left (starting at 0 and finishing at 23 here) and e is the exponent (1 in this example).

It can be required that the most significant digit of the significand of a non-zero number be non-zero (except when the corresponding exponent would be smaller than the minimum one). This process is called normalization. For binary formats (which uses only the digits 0 and 1), this non-zero digit is necessarily 1. Therefore, it does not need to be represented in memory, allowing the format to have one more bit of precision. This rule is variously called the leading bit convention, the implicit bit convention, the hidden bit convention,[1] or the assumed bit convention.

Alternatives to floating-point numbers

[edit]

The floating-point representation is by far the most common way of representing in computers an approximation to real numbers. However, there are alternatives:

  • Fixed-point representation uses integer hardware operations controlled by a software implementation of a specific convention about the location of the binary or decimal point, for example, 6 bits or digits from the right. The hardware to manipulate these representations is less costly than floating point, and it can be used to perform normal integer operations, too. Binary fixed point is usually used in special-purpose applications on embedded processors that can only do integer arithmetic, but decimal fixed point is common in commercial applications.
  • Logarithmic number systems (LNSs) represent a real number by the logarithm of its absolute value and a sign bit. The value distribution is similar to floating point, but the value-to-representation curve (i.e., the graph of the logarithm function) is smooth (except at 0). Conversely to floating-point arithmetic, in a logarithmic number system multiplication, division and exponentiation are simple to implement, but addition and subtraction are complex. The (symmetric) level-index arithmetic (LI and SLI) of Charles Clenshaw, Frank Olver and Peter Turner is a scheme based on a generalized logarithm representation.
  • Tapered floating-point representation, used in Unum formats, including Posit.
  • Some simple rational numbers (e.g., 1/3 and 1/10) cannot be represented exactly in binary floating point, no matter what the precision is. Using a different radix allows one to represent some of them (e.g., 1/10 in decimal floating point), but the possibilities remain limited. Software packages that perform rational arithmetic represent numbers as fractions with integral numerator and denominator, and can therefore represent any rational number exactly. Such packages generally need to use "bignum" arithmetic for the individual integers.
  • Interval arithmetic allows one to represent numbers as intervals and obtain guaranteed bounds on results. It is generally based on other arithmetics, in particular floating point.
  • Computer algebra systems such as Mathematica, Maxima, and Maple can often handle irrational numbers like or in a completely "formal" way (symbolic computation), without dealing with a specific encoding of the significand. Such a program can evaluate expressions like "" exactly, because it is programmed to process the underlying mathematics directly, instead of using approximate values for each intermediate calculation.

History

[edit]
Leonardo Torres Quevedo, in 1914, published an analysis of floating point based on the analytical engine.

In 1914, the Spanish engineer Leonardo Torres Quevedo published Essays on Automatics,[9] where he designed a special-purpose electromechanical calculator based on Charles Babbage's analytical engine and described a way to store floating-point numbers in a consistent manner. He stated that numbers will be stored in exponential format as n × 10, and offered three rules by which consistent manipulation of floating-point numbers by machines could be implemented. For Torres, "n will always be the same number of digits (e.g. six), the first digit of n will be of order of tenths, the second of hundredths, etc, and one will write each quantity in the form: n; m." The format he proposed shows the need for a fixed-sized significand as is presently used for floating-point data, fixing the location of the decimal point in the significand so that each representation was unique, and how to format such numbers by specifying a syntax to be used that could be entered through a typewriter, as was the case of his Electromechanical Arithmometer in 1920.[10][11][12]

Konrad Zuse, architect of the Z3 computer, which uses a 22-bit binary floating-point representation

In 1938, Konrad Zuse of Berlin completed the Z1, the first binary, programmable mechanical computer;[13] it uses a 24-bit binary floating-point number representation with a 7-bit signed exponent, a 17-bit significand (including one implicit bit), and a sign bit.[14] The more reliable relay-based Z3, completed in 1941, has representations for both positive and negative infinities; in particular, it implements defined operations with infinity, such as , and it stops on undefined operations, such as .

Zuse also proposed, but did not complete, carefully rounded floating-point arithmetic that includes and NaN representations, anticipating features of the IEEE Standard by four decades.[15] In contrast, von Neumann recommended against floating-point numbers for the 1951 IAS machine, arguing that fixed-point arithmetic is preferable.[15]

The first commercial computer with floating-point hardware was Zuse's Z4 computer, designed in 1942–1945. In 1946, Bell Laboratories introduced the Model V, which implemented decimal floating-point numbers.[16]

The Pilot ACE has binary floating-point arithmetic, and it became operational in 1950 at National Physical Laboratory, UK. Thirty-three were later sold commercially as the English Electric DEUCE. The arithmetic is actually implemented in software, but with a one megahertz clock rate, the speed of floating-point and fixed-point operations in this machine were initially faster than those of many competing computers.

The mass-produced IBM 704 followed in 1954; it introduced the use of a biased exponent. For many decades after that, floating-point hardware was typically an optional feature, and computers that had it were said to be "scientific computers", or to have "scientific computation" (SC) capability (see also Extensions for Scientific Computation (XSC)). It was not until the launch of the Intel i486 in 1989 that general-purpose personal computers had floating-point capability in hardware as a standard feature.

The UNIVAC 1100/2200 series, introduced in 1962, supported two floating-point representations:

  • Single precision: 36 bits, organized as a 1-bit sign, an 8-bit exponent, and a 27-bit significand.
  • Double precision: 72 bits, organized as a 1-bit sign, an 11-bit exponent, and a 60-bit significand.

The IBM 7094, also introduced in 1962, supported single-precision and double-precision representations, but with no relation to the UNIVAC's representations. Indeed, in 1964, IBM introduced hexadecimal floating-point representations in its System/360 mainframes; these same representations are still available for use in modern z/Architecture systems. In 1998, IBM implemented IEEE-compatible binary floating-point arithmetic in its mainframes; in 2005, IBM also added IEEE-compatible decimal floating-point arithmetic.

Initially, computers used many different representations for floating-point numbers. The lack of standardization at the mainframe level was an ongoing problem by the early 1970s for those writing and maintaining higher-level source code; these manufacturer floating-point standards differed in the word sizes, the representations, and the rounding behavior and general accuracy of operations. Floating-point compatibility across multiple computing systems was in desperate need of standardization by the early 1980s, leading to the creation of the IEEE 754 standard once the 32-bit (or 64-bit) word had become commonplace. This standard was significantly based on a proposal from Intel, which was designing the i8087 numerical coprocessor; Motorola, which was designing the 68000 around the same time, gave significant input as well.

William Kahan, principal architect of the IEEE 754 floating-point standard

In 1989, mathematician and computer scientist William Kahan was honored with the Turing Award for being the primary architect behind this proposal; he was aided by his student Jerome Coonen and a visiting professor, Harold Stone.[17]

Among the x86 (more specifically i8087) innovations are these:

  • A precisely specified floating-point representation at the bit-string level, so that all compliant computers interpret bit patterns the same way. This makes it possible to accurately and efficiently transfer floating-point numbers from one computer to another (after accounting for endianness).
  • A precisely specified behavior for the arithmetic operations: A result is required to be produced as if infinitely precise arithmetic were used to yield a value that is then rounded according to specific rules. This means that a compliant computer program would always produce the same result when given a particular input, thus mitigating the almost mystical reputation that floating-point computation had developed for its hitherto seemingly non-deterministic behavior.
  • The ability of exceptional conditions (overflow, divide by zero, etc.) to propagate through a computation in a benign manner and then be handled by the software in a controlled fashion.

These features would be inherited into IEEE 754-1985 (with the exception of the encoding of special values and exceptions), though the extended internal precision of x87 means it requires explicit rounding of exact results directly to the destination precision in order to match standard IEEE 754 results.[18] However, the behavior may not be the same as a rounding to the destination format due to a possible wider exponent range of the extended format.

Range of floating-point numbers

[edit]

A floating-point number consists of two fixed-point components, whose range depends exclusively on the number of bits or digits in their representation. Whereas components linearly depend on their range, the floating-point range linearly depends on the significand range and exponentially on the range of exponent component, which attaches outstandingly wider range to the number.

On a typical computer system, a double-precision (64-bit) binary floating-point number has a coefficient of 53 bits (including 1 implied bit), an exponent of 11 bits, and 1 sign bit. Since 210 = 1024, the complete range of the positive normal floating-point numbers in this format is from 2−1022 ≈ 2 × 10−308 to approximately 21024 ≈ 2 × 10308.

The number of normal floating-point numbers in a system (B, P, L, U) where

  • B is the base of the system,
  • P is the precision of the significand (in base B),
  • L is the smallest exponent of the system,
  • U is the largest exponent of the system,

is , or considering the value 0.

There is a smallest positive normal floating-point number,

Underflow level = UFL = ,

which has a 1 as the leading digit and 0 for the remaining digits of the significand, and the smallest possible value for the exponent.

There is a largest floating-point number,

Overflow level = OFL = ,

which has B − 1 as the value for each digit of the significand and the largest possible value for the exponent.

In addition, there are representable values strictly between −UFL and UFL. Namely, positive and negative zeros, as well as subnormal numbers.

IEEE 754: floating point in modern computers

[edit]

The IEEE standardized the computer representation for binary floating-point numbers in IEEE 754 (a.k.a. IEC 60559) in 1985. This first standard is followed by almost all modern machines. It was revised in 2008. IBM mainframes support IBM's own hexadecimal floating point format and IEEE 754-2008 decimal floating point in addition to the IEEE 754 binary format. The Cray T90 series had an IEEE version, but the SV1 still uses Cray floating-point format.[citation needed]

The standard provides for many closely related formats, differing in only a few details. Five of these formats are called basic formats, and others are termed extended precision formats and extendable precision format. Three formats are especially widely used in computer hardware and languages:[citation needed]

  • Single precision (binary32), usually used to represent the "float" type in the C language family. This is a binary format that occupies 32 bits (4 bytes) and its significand has a precision of 24 bits (about 7 decimal digits).
  • Double precision (binary64), usually used to represent the "double" type in the C language family. This is a binary format that occupies 64 bits (8 bytes) and its significand has a precision of 53 bits (about 16 decimal digits).
  • Double extended, also ambiguously called "extended precision" format. This is a binary format that occupies at least 79 bits (80 if the hidden/implicit bit rule is not used) and its significand has a precision of at least 64 bits (about 19 decimal digits). The C99 and C11 standards of the C language family, in their annex F ("IEC 60559 floating-point arithmetic"), recommend such an extended format to be provided as "long double".[19] A format satisfying the minimal requirements (64-bit significand precision, 15-bit exponent, thus fitting on 80 bits) is provided by the x86 architecture. Often on such processors, this format can be used with "long double", though extended precision is not available with MSVC.[20] For alignment purposes, many tools store this 80-bit value in a 96-bit or 128-bit space.[21][22] On other processors, "long double" may stand for a larger format, such as quadruple precision,[23] or just double precision, if any form of extended precision is not available.[24]

Increasing the precision of the floating-point representation generally reduces the amount of accumulated round-off error caused by intermediate calculations.[25] Other IEEE formats include:

  • Decimal64 and decimal128 floating-point formats. These formats (especially decimal128) are pervasive in financial transactions because, along with the decimal32 format, they allow correct decimal rounding.
  • Quadruple precision (binary128). This is a binary format that occupies 128 bits (16 bytes) and its significand has a precision of 113 bits (about 34 decimal digits).
  • Half precision, also called binary16, a 16-bit floating-point value. It is being used in the NVIDIA Cg graphics language, and in the openEXR standard (where it actually predates the introduction in the IEEE 754 standard).[26][27]

Any integer with absolute value less than 224 can be exactly represented in the single-precision format, and any integer with absolute value less than 253 can be exactly represented in the double-precision format. Furthermore, a wide range of powers of 2 times such a number can be represented. These properties are sometimes used for purely integer data, to get 53-bit integers on platforms that have double-precision floats but only 32-bit integers.

The standard specifies some special values, and their representation: positive infinity (+∞), negative infinity (−∞), a negative zero (−0) distinct from ordinary ("positive") zero, and "not a number" values (NaNs).

Comparison of floating-point numbers, as defined by the IEEE standard, is a bit different from usual integer comparison. Negative and positive zero compare equal, and every NaN compares unequal to every value, including itself. All finite floating-point numbers are strictly smaller than +∞ and strictly greater than −∞, and they are ordered in the same way as their values (in the set of real numbers).

Internal representation

[edit]

Floating-point numbers are typically packed into a computer datum as the sign bit, the exponent field, and a field for the significand, from left to right. For the IEEE 754 binary formats (basic and extended) that have extant hardware implementations, they are apportioned as follows:

Format Bits for the encoding Exponent
bias
Bits
precision
Number of
decimal digits
Sign Exponent Significand Total
Half (binary16) 1 5 10 16 15 11 ~3.3
Single (binary32) 1 8 23 32 127 24 ~7.2
Double (binary64) 1 11 52 64 1023 53 ~15.9
x86 extended 1 15 64 80 16383 64 ~19.2
Quadruple (binary128) 1 15 112 128 16383 113 ~34.0
Octuple (binary256) 1 19 236 256 262143 237 ~71.3

While the exponent can be positive or negative, in binary formats it is stored as an unsigned number that has a fixed "bias" added to it. Values of all 0s in this field are reserved for the zeros and subnormal numbers; values of all 1s are reserved for the infinities and NaNs. The exponent range for normal numbers is [−126, 127] for single precision, [−1022, 1023] for double, or [−16382, 16383] for quad. Normal numbers exclude subnormal values, zeros, infinities, and NaNs.

In the IEEE binary interchange formats the leading bit of a normalized significand is not actually stored in the computer datum, since it is always 1. It is called the "hidden" or "implicit" bit. Because of this, the single-precision format actually has a significand with 24 bits of precision, the double-precision format has 53, quad has 113, and octuple has 237.

For example, it was shown above that π, rounded to 24 bits of precision, has:

  • sign = 0 ; e = 1 ; s = 110010010000111111011011 (including the hidden bit)

The sum of the exponent bias (127) and the exponent (1) is 128, so this is represented in the single-precision format as

  • 0 10000000 10010010000111111011011 (excluding the hidden bit) = 40490FDB[28] as a hexadecimal number.

An example of a layout for 32-bit floating point is

and the 64-bit ("double") layout is similar.

Other notable floating-point formats

[edit]

In addition to the widely used IEEE 754 standard formats, other floating-point formats are used, or have been used, in certain domain-specific areas.

  • The Microsoft Binary Format (MBF) was developed for the Microsoft BASIC language products, including Microsoft's first ever product the Altair BASIC (1975), TRS-80 LEVEL II, CP/M's MBASIC, IBM PC 5150's BASICA, MS-DOS's GW-BASIC and QuickBASIC prior to version 4.00. QuickBASIC version 4.00 and 4.50 switched to the IEEE 754-1985 format but can revert to the MBF format using the /MBF command option. MBF was designed and developed on a simulated Intel 8080 by Monte Davidoff, a dormmate of Bill Gates, during spring of 1975 for the MITS Altair 8800. The initial release of July 1975 supported a single-precision (32 bits) format due to cost of the MITS Altair 8800 4-kilobytes memory. In December 1975, the 8-kilobytes version added a double-precision (64 bits) format. A single-precision (40 bits) variant format was adopted for other CPU's, notably the MOS 6502 (Apple II, Commodore PET, Atari), Motorola 6800 (MITS Altair 680) and Motorola 6809 (TRS-80 Color Computer). All Microsoft language products from 1975 through 1987 used the Microsoft Binary Format until Microsoft adopted the IEEE 754 standard format in all its products starting in 1988 to their current releases. MBF consists of the MBF single-precision format (32 bits, "6-digit BASIC"),[29][30] the MBF extended-precision format (40 bits, "9-digit BASIC"),[30] and the MBF double-precision format (64 bits);[29][31] each of them is represented with an 8-bit exponent, followed by a sign bit, followed by a significand of respectively 23, 31, and 55 bits.
  • The bfloat16 format requires the same amount of memory (16 bits) as the IEEE 754 half-precision format, but allocates 8 bits to the exponent instead of 5, thus providing the same range as a IEEE 754 single-precision number. The tradeoff is a reduced precision, as the trailing significand field is reduced from 10 to 7 bits. This format is mainly used in the training of machine learning models, where range is more valuable than precision. Many machine learning accelerators provide hardware support for this format.
  • The TensorFloat-32[32] format combines the 8 bits of exponent of the bfloat16 with the 10 bits of trailing significand field of half-precision formats, resulting in a size of 19 bits. This format was introduced by Nvidia, which provides hardware support for it in the Tensor Cores of its GPUs based on the Nvidia Ampere architecture. The drawback of this format is its size, which is not a power of 2. However, according to Nvidia, this format should only be used internally by hardware to speed up computations, while inputs and outputs should be stored in the 32-bit single-precision IEEE 754 format.[32]
  • The Hopper and CDNA 3 architecture GPUs provide two FP8 formats: one with the same numerical range as half-precision (E5M2) and one with higher precision, but less range (E4M3).[33][34]
  • The Blackwell and CDNA 4 GPU architecture includes support for FP6 (E3M2 and E2M3) and FP4 (E2M1) formats. FP4 is the smallest floating-point format which allows for all IEEE 754 principles (see minifloat).
Comparison of common floating-point formats
Type Sign Exponent Significand Total bits
FP4 1 2 1 4
FP6 (E2M3) 1 2 3 6
FP6 (E3M2) 1 3 2 6
FP8 (E4M3) 1 4 3 8
FP8 (E5M2) 1 5 2 8
Half-precision 1 5 10 16
bfloat16 1 8 7 16
TensorFloat-32 1 8 10 19
Single-precision 1 8 23 32
Double-precision 1 11 52 64
Quadruple-precision 1 15 112 128
Octuple-precision 1 19 236 256

Representable numbers, conversion and rounding

[edit]

By their nature, all numbers expressed in floating-point format are rational numbers with a terminating expansion in the relevant base (for example, a terminating decimal expansion in base-10, or a terminating binary expansion in base-2). Irrational numbers, such as π or , or non-terminating rational numbers, must be approximated. The number of digits (or bits) of precision also limits the set of rational numbers that can be represented exactly. For example, the decimal number 123456789 cannot be exactly represented if only eight decimal digits of precision are available (it would be rounded to one of the two straddling representable values, 12345678 × 101 or 12345679 × 101), the same applies to non-terminating digits (.5 to be rounded to either .55555555 or .55555556).

When a number is represented in some format (such as a character string) which is not a native floating-point representation supported in a computer implementation, then it will require a conversion before it can be used in that implementation. If the number can be represented exactly in the floating-point format then the conversion is exact. If there is not an exact representation then the conversion requires a choice of which floating-point number to use to represent the original value. The representation chosen will have a different value from the original, and the value thus adjusted is called the rounded value.

Whether or not a rational number has a terminating expansion depends on the base. For example, in base-10 the number 1/2 has a terminating expansion (0.5) while the number 1/3 does not (0.333...). In base-2 only rationals with denominators that are powers of 2 (such as 1/2 or 3/16) are terminating. Any rational with a denominator that has a prime factor other than 2 will have an infinite binary expansion. This means that numbers that appear to be short and exact when written in decimal format may need to be approximated when converted to binary floating-point. For example, the decimal number 0.1 is not representable in binary floating-point of any finite precision; the exact binary representation would have a "1100" sequence continuing endlessly:

e = −4; s = 1100110011001100110011001100110011...,

where, as previously, s is the significand and e is the exponent.

When rounded to 24 bits this becomes

e = −4; s = 110011001100110011001101,

which is actually 0.100000001490116119384765625 in decimal.

As a further example, the real number π, represented in binary as an infinite sequence of bits is

11.0010010000111111011010101000100010000101101000110000100011010011...

but is

11.0010010000111111011011

when approximated by rounding to a precision of 24 bits.

In binary single-precision floating-point, this is represented as s = 1.10010010000111111011011 with e = 1. This has a decimal value of

3.1415927410125732421875,

whereas a more accurate approximation of the true value of π is

3.14159265358979323846264338327950...

The result of rounding differs from the true value by about 0.03 parts per million, and matches the decimal representation of π in the first 7 digits. The difference is the discretization error and is limited by the machine epsilon.

The arithmetical difference between two consecutive representable floating-point numbers which have the same exponent is called a unit in the last place (ULP). For example, if there is no representable number lying between the representable numbers 1.45A70C2216 and 1.45A70C2416, the ULP is 2×16−8, or 2−31. For numbers with a base-2 exponent part of 0, i.e. numbers with an absolute value higher than or equal to 1 but lower than 2, an ULP is exactly 2−23 or about 10−7 in single precision, and exactly 2−53 or about 10−16 in double precision. The mandated behavior of IEEE-compliant hardware is that the result be within one-half of a ULP.

Rounding modes

[edit]

Rounding is used when the exact result of a floating-point operation (or a conversion to floating-point format) would need more digits than there are digits in the significand. IEEE 754 requires correct rounding: that is, the rounded result is as if infinitely precise arithmetic was used to compute the value and then rounded (although in implementation only three extra bits are needed to ensure this). There are several different rounding schemes (or rounding modes). Historically, truncation was the typical approach. Since the introduction of IEEE 754, the default method (round to nearest, ties to even, sometimes called Banker's Rounding) is more commonly used. This method rounds the ideal (infinitely precise) result of an arithmetic operation to the nearest representable value, and gives that representation as the result.[nb 8] In the case of a tie, the value that would make the significand end in an even digit is chosen. The IEEE 754 standard requires the same rounding to be applied to all fundamental algebraic operations, including square root and conversions, when there is a numeric (non-NaN) result. It means that the results of IEEE 754 operations are completely determined in all bits of the result, except for the representation of NaNs. ("Library" functions such as cosine and log are not mandated.)

Alternative rounding options are also available. IEEE 754 specifies the following rounding modes:

  • round to nearest, where ties round to the nearest even digit in the required position (the default and by far the most common mode)
  • round to nearest, where ties round away from zero (optional for binary floating-point and commonly used in decimal)
  • round up (toward +∞; negative results thus round toward zero)
  • round down (toward −∞; negative results thus round away from zero)
  • round toward zero (truncation; it is similar to the common behavior of float-to-integer conversions, which convert −3.9 to −3 and 3.9 to 3)

Alternative modes are useful when the amount of error being introduced must be bounded. Applications that require a bounded error are multi-precision floating-point, and interval arithmetic. The alternative rounding modes are also useful in diagnosing numerical instability: if the results of a subroutine vary substantially between rounding to + and − infinity then it is likely numerically unstable and affected by round-off error.[35]

Binary-to-decimal conversion with minimal number of digits

[edit]

Converting a double-precision binary floating-point number to a decimal string is a common operation, but an algorithm producing results that are both accurate and minimal did not appear in print until 1990, with Steele and White's Dragon4. Some of the improvements since then include:

  • David M. Gay's dtoa.c, a practical open-source implementation of many ideas in Dragon4.[36]
  • Grisu3, with a 4× speedup as it removes the use of bignums. Must be used with a fallback, as it fails for ~0.5% of cases.[37]
  • Errol3, an always-succeeding algorithm similar to, but slower than, Grisu3. Apparently not as good as an early-terminating Grisu with fallback.[38]
  • Ryū, an always-succeeding algorithm that is faster and simpler than Grisu3.[39]
  • Schubfach, an always-succeeding algorithm that is based on a similar idea to Ryū, developed almost simultaneously and independently.[40] Performs better than Ryū and Grisu3 in certain benchmarks.[41]

Many modern language runtimes use Grisu3 with a Dragon4 fallback.[42]

Decimal-to-binary conversion

[edit]

The problem of parsing a decimal string into a binary FP representation is complex, with an accurate parser not appearing until Clinger's 1990 work (implemented in dtoa.c).[36] Further work has likewise progressed in the direction of faster parsing.[43]

Floating-point operations

[edit]

For ease of presentation and understanding, decimal radix with 7 digit precision will be used in the examples, as in the IEEE 754 decimal32 format. The fundamental principles are the same in any radix or precision, except that normalization is optional (it does not affect the numerical value of the result). Here, s denotes the significand and e denotes the exponent.

Addition and subtraction

[edit]

A simple method to add floating-point numbers is to first represent them with the same exponent. In the example below, the second number (with the smaller exponent) is shifted right by three digits, and one then proceeds with the usual addition method:

  123456.7 = 1.234567 × 10^5
  101.7654 = 1.017654 × 10^2 = 0.001017654 × 10^5
  Hence:
  123456.7 + 101.7654 = (1.234567 × 10^5) + (1.017654 × 10^2)
                      = (1.234567 × 10^5) + (0.001017654 × 10^5)
                      = (1.234567 + 0.001017654) × 10^5
                      =  1.235584654 × 10^5

In detail:

  e=5;  s=1.234567     (123456.7)
+ e=2;  s=1.017654     (101.7654)
  e=5;  s=1.234567
+ e=5;  s=0.001017654  (after shifting)
--------------------
  e=5;  s=1.235584654  (true sum: 123558.4654)

This is the true result, the exact sum of the operands. It will be rounded to seven digits and then normalized if necessary. The final result is

  e=5;  s=1.235585    (final sum: 123558.5)

The lowest three digits of the second operand (654) are essentially lost. This is round-off error. In extreme cases, the sum of two non-zero numbers may be equal to one of them:

  e=5;  s=1.234567
+ e=−3; s=9.876543
  e=5;  s=1.234567
+ e=5;  s=0.00000009876543 (after shifting)
----------------------
  e=5;  s=1.23456709876543 (true sum)
  e=5;  s=1.234567         (after rounding and normalization)

In the above conceptual examples it would appear that a large number of extra digits would need to be provided by the adder to ensure correct rounding; however, for binary addition or subtraction using careful implementation techniques only a guard bit, a rounding bit and one extra sticky bit need to be carried beyond the precision of the operands.[18][44]: 218–220 

Another problem of loss of significance occurs when approximations to two nearly equal numbers are subtracted. In the following example e = 5; s = 1.234571 and e = 5; s = 1.234567 are approximations to the rationals 123457.1467 and 123456.659.

  e=5;  s=1.234571
− e=5;  s=1.234567
----------------
  e=5;  s=0.000004
  e=−1; s=4.000000 (after rounding and normalization)

The floating-point difference is computed exactly because the numbers are close—the Sterbenz lemma guarantees this, even in case of underflow when gradual underflow is supported. Despite this, the difference of the original numbers is e = −1; s = 4.877000, which differs more than 20% from the difference e = −1; s = 4.000000 of the approximations. In extreme cases, all significant digits of precision can be lost.[18][45] This cancellation illustrates the danger in assuming that all of the digits of a computed result are meaningful. Dealing with the consequences of these errors is a topic in numerical analysis; see also Accuracy problems.

Multiplication and division

[edit]

To multiply, the significands are multiplied while the exponents are added, and the result is rounded and normalized.

  e=3;  s=4.734612
× e=5;  s=5.417242
-----------------------
  e=8;  s=25.648538980104 (true product)
  e=8;  s=25.64854        (after rounding)
  e=9;  s=2.564854        (after normalization)

Similarly, division is accomplished by subtracting the divisor's exponent from the dividend's exponent, and dividing the dividend's significand by the divisor's significand.

There are no cancellation or absorption problems with multiplication or division, though small errors may accumulate as operations are performed in succession.[18] In practice, the way these operations are carried out in digital logic can be quite complex (see Booth's multiplication algorithm and Division algorithm).[nb 9]

Literal syntax

[edit]

Literals for floating-point numbers depend on languages. They typically use e or E to denote scientific notation. The C programming language and the IEEE 754 standard also define a hexadecimal literal syntax with a base-2 exponent instead of 10. In languages like C, when the decimal exponent is omitted, a decimal point is needed to differentiate them from integers. Other languages do not have an integer type (such as JavaScript), or allow overloading of numeric types (such as Haskell). In these cases, digit strings such as 123 may also be floating-point literals.

Examples of floating-point literals are:

  • 99.9
  • -5000.12
  • 6.02e23
  • -3e-45
  • 0x1.fffffep+127 in C and IEEE 754

Dealing with exceptional cases

[edit]

Floating-point computation in a computer can run into three kinds of problems:

  • An operation can be mathematically undefined, such as ∞/∞, or division by zero.
  • An operation can be legal in principle, but not supported by the specific format, for example, calculating the square root of −1 or the inverse sine of 2 (both of which result in complex numbers).
  • An operation can be legal in principle, but the result can be impossible to represent in the specified format, because the exponent is too large or too small to encode in the exponent field. Such an event is called an overflow (exponent too large), underflow (exponent too small) or denormalization (precision loss).

Prior to the IEEE standard, such conditions usually caused the program to terminate, or triggered some kind of trap that the programmer might be able to catch. How this worked was system-dependent, meaning that floating-point programs were not portable. (The term "exception" as used in IEEE 754 is a general term meaning an exceptional condition, which is not necessarily an error, and is a different usage to that typically defined in programming languages such as a C++ or Java, in which an "exception" is an alternative flow of control, closer to what is termed a "trap" in IEEE 754 terminology.)

Here, the required default method of handling exceptions according to IEEE 754 is discussed (the IEEE 754 optional trapping and other "alternate exception handling" modes are not discussed). Arithmetic exceptions are (by default) required to be recorded in "sticky" status flag bits. That they are "sticky" means that they are not reset by the next (arithmetic) operation, but stay set until explicitly reset. The use of "sticky" flags thus allows for testing of exceptional conditions to be delayed until after a full floating-point expression or subroutine: without them exceptional conditions that could not be otherwise ignored would require explicit testing immediately after every floating-point operation. By default, an operation always returns a result according to specification without interrupting computation. For instance, 1/0 returns +∞, while also setting the divide-by-zero flag bit (this default of ∞ is designed to often return a finite result when used in subsequent operations and so be safely ignored).

The original IEEE 754 standard, however, failed to recommend operations to handle such sets of arithmetic exception flag bits. So while these were implemented in hardware, initially programming language implementations typically did not provide a means to access them (apart from assembler). Over time some programming language standards (e.g., C99/C11 and Fortran) have been updated to specify methods to access and change status flag bits. The 2008 version of the IEEE 754 standard now specifies a few operations for accessing and handling the arithmetic flag bits. The programming model is based on a single thread of execution and use of them by multiple threads has to be handled by a means outside of the standard (e.g. C11 specifies that the flags have thread-local storage).

IEEE 754 specifies five arithmetic exceptions that are to be recorded in the status flags ("sticky bits"):

  • inexact, set if the rounded (and returned) value is different from the mathematically exact result of the operation.
  • underflow, set if the rounded value is tiny (as specified in IEEE 754) and inexact (or maybe limited to if it has denormalization loss, as per the 1985 version of IEEE 754), returning a subnormal value including the zeros.
  • overflow, set if the absolute value of the rounded value is too large to be represented. An infinity or maximal finite value is returned, depending on which rounding is used.
  • divide-by-zero, set if the result is infinite given finite operands, returning an infinity, either +∞ or −∞.
  • invalid, set if a finite or infinite result cannot be returned e.g. sqrt(−1) or 0/0, returning a quiet NaN.
Fig. 1: resistances in parallel, with total resistance

The default return value for each of the exceptions is designed to give the correct result in the majority of cases such that the exceptions can be ignored in the majority of codes. inexact returns a correctly rounded result, and underflow returns a value less than or equal to the smallest positive normal number in magnitude and can almost always be ignored.[46] divide-by-zero returns infinity exactly, which will typically then divide a finite number and so give zero, or else will give an invalid exception subsequently if not, and so can also typically be ignored. For example, the effective resistance of n resistors in parallel (see fig. 1) is given by . If a short-circuit develops with set to 0, will return +infinity which will give a final of 0, as expected[47] (see the continued fraction example of IEEE 754 design rationale for another example).

Overflow and invalid exceptions can typically not be ignored, but do not necessarily represent errors: for example, a root-finding routine, as part of its normal operation, may evaluate a passed-in function at values outside of its domain, returning NaN and an invalid exception flag to be ignored until finding a useful start point.[46]

Accuracy problems

[edit]

The fact that floating-point numbers cannot accurately represent all real numbers, and that floating-point operations cannot accurately represent true arithmetic operations, leads to many surprising situations. This is related to the finite precision with which computers generally represent numbers.

For example, the decimal numbers 0.1 and 0.01 cannot be represented exactly as binary floating-point numbers. In the IEEE 754 binary32 format with its 24-bit significand, the result of attempting to square the approximation to 0.1 is neither 0.01 nor the representable number closest to it. The decimal number 0.1 is represented in binary as e = −4; s = 110011001100110011001101, which is

0.100000001490116119384765625 exactly.

Squaring this number gives

0.010000000298023226097399174250313080847263336181640625 exactly.

Squaring it with rounding to the 24-bit precision gives

0.010000000707805156707763671875 exactly.

But the representable number closest to 0.01 is

0.009999999776482582092285156250 exactly.

Also, the non-representability of π (and π/2) means that an attempted computation of tan(π/2) will not yield a result of infinity, nor will it even overflow in the usual floating-point formats (assuming an accurate implementation of tan). It is simply not possible for standard floating-point hardware to attempt to compute tan(π/2), because π/2 cannot be represented exactly. This computation in C:

// Enough digits to be sure we get the correct approximation.
const double pi = 3.1415926535897932384626433832795;
double z = tan(pi / 2.0);

will give a result of 16331239353195370.0. In single precision (using the tanf function), the result will be −22877332.0.

By the same token, an attempted computation of sin(π) will not yield zero. The result will be (approximately) 0.1225×10−15 in double precision, or −0.8742×10−7 in single precision.[nb 10]

While floating-point addition and multiplication are both commutative (a + b = b + a and a × b = b × a), they are not necessarily associative. That is, (a + b) + c is not necessarily equal to a + (b + c). Using 7-digit significand decimal arithmetic:

 a = 1234.567, b = 45.67834, c = 0.0004
 (a + b) + c:
     1234.567   (a)
   +   45.67834 (b)
   ____________
     1280.24534   rounds to   1280.245
    1280.245  (a + b)
   +   0.0004 (c)
   ____________
    1280.2454   rounds to   1280.245  ← (a + b) + c
 a + (b + c):
   45.67834 (b)
 +  0.0004  (c)
 ____________
   45.67874
   1234.567   (a)
 +   45.67874   (b + c)
 ____________
   1280.24574   rounds to   1280.246 ← a + (b + c)

They are also not necessarily distributive. That is, (a + b) × c may not be the same as a × c + b × c:

 1234.567 × 3.333333 = 4115.223
 1.234567 × 3.333333 = 4.115223
                       4115.223 + 4.115223 = 4119.338
 but
 1234.567 + 1.234567 = 1235.802
                       1235.802 × 3.333333 = 4119.340

In addition to loss of significance, inability to represent numbers such as π and 0.1 exactly, and other slight inaccuracies, the following phenomena may occur:

  • Cancellation: subtraction of nearly equal operands may cause extreme loss of accuracy.[48][45] When we subtract two almost equal numbers we set the most significant digits to zero, leaving ourselves with just the insignificant, and most erroneous, digits.[1]: 124  For example, when determining a derivative of a function the following formula is used:

    Intuitively one would want an h very close to zero; however, when using floating-point operations, the smallest number will not give the best approximation of a derivative. As h grows smaller, the difference between f(a + h) and f(a) grows smaller, cancelling out the most significant and least erroneous digits and making the most erroneous digits more important. As a result the smallest number of h possible will give a more erroneous approximation of a derivative than a somewhat larger number. This is perhaps the most common and serious accuracy problem.
  • Conversions to integer are not intuitive: converting (63.0/9.0) to integer yields 7, but converting (0.63/0.09) may yield 6. This is because conversions generally truncate rather than round. Floor and ceiling functions may produce answers which are off by one from the intuitively expected value.
  • Limited exponent range: results might overflow yielding infinity, or underflow yielding a subnormal number or zero. In these cases precision will be lost.
  • Testing for safe division is problematic: Checking that the divisor is not zero does not guarantee that a division will not overflow.
  • Testing for equality is problematic. Two computational sequences that are mathematically equal may well produce different floating-point values.[49]

Incidents

[edit]
  • On 25 February 1991, a loss of significance in a MIM-104 Patriot missile battery prevented it from intercepting an incoming Scud missile in Dhahran, Saudi Arabia, contributing to the death of 28 soldiers from the U.S. Army's 14th Quartermaster Detachment.[50] The weapons control computer counted time in an integer number of tenths of a second since boot. For conversion to a floating-point number of seconds in velocity and position calculations, the software originally multiplied this number by a 24-bit fixed-point binary approximation to 0.1, specifically Some parts of the software were later adapted to use a more accurate conversion to floating-point, but some parts were not updated and still used the 24-bit approximation.[51] These parts of the software drifted from one another by about 3.43 milliseconds per hour. After 20 hours, the discrepancy of about 68.7 ms was enough for the radar tracking system to lose track of Scuds; the control system in the Dhahran missile battery had been running for about 100 hours when it failed to track and intercept an incoming Scud.[50] The failure to intercept arose not from using floating point specifically, but from subtracting two different approximations to unit conversion with different errors when representing time, so the unit conversion error in the difference did not cancel out but rather grew indefinitely with uptime.[51]
  • Salami slicing is the practice of removing the 'invisible' part of a transaction into a separate account.[clarification needed]

Machine precision and backward error analysis

[edit]

Machine precision is a quantity that characterizes the accuracy of a floating-point system, and is used in backward error analysis of floating-point algorithms. It is also known as unit roundoff or machine epsilon. Usually denoted Εmach, its value depends on the particular rounding being used.

With rounding to zero, whereas rounding to nearest, where B is the base of the system and P is the precision of the significand (in base B).

This is important since it bounds the relative error in representing any non-zero real number x within the normalized range of a floating-point system:

Backward error analysis, the theory of which was developed and popularized by James H. Wilkinson, can be used to establish that an algorithm implementing a numerical function is numerically stable.[52] The basic approach is to show that although the calculated result, due to roundoff errors, will not be exactly correct, it is the exact solution to a nearby problem with slightly perturbed input data. If the perturbation required is small, on the order of the uncertainty in the input data, then the results are in some sense as accurate as the data "deserves". The algorithm is then defined as backward stable. Stability is a measure of the sensitivity to rounding errors of a given numerical procedure; by contrast, the condition number of a function for a given problem indicates the inherent sensitivity of the function to small perturbations in its input and is independent of the implementation used to solve the problem.[53]

As a trivial example, consider a simple expression giving the inner product of (length two) vectors and , then and so

where

where

by definition, which is the sum of two slightly perturbed (on the order of Εmach) input data, and so is backward stable. For more realistic examples in numerical linear algebra, see Higham 2002[54] and other references below.

Minimizing the effect of accuracy problems

[edit]

Although individual arithmetic operations of IEEE 754 are guaranteed accurate to within half a ULP, more complicated formulae can suffer from larger errors for a variety of reasons. The loss of accuracy can be substantial if a problem or its data are ill-conditioned, meaning that the correct result is hypersensitive to tiny perturbations in its data. However, even functions that are well-conditioned can suffer from large loss of accuracy if an algorithm numerically unstable for that data is used: apparently equivalent formulations of expressions in a programming language can differ markedly in their numerical stability. One approach to remove the risk of such loss of accuracy is the design and analysis of numerically stable algorithms, which is an aim of the branch of mathematics known as numerical analysis. Another approach that can protect against the risk of numerical instabilities is the computation of intermediate (scratch) values in an algorithm at a higher precision than the final result requires,[55] which can remove, or reduce by orders of magnitude,[56] such risk: IEEE 754 quadruple precision and extended precision are designed for this purpose when computing at double precision.[57][nb 11]

For example, the following algorithm is a direct implementation to compute the function , which is well-conditioned at .[nb 12] However, it can be shown to be numerically unstable and lose up to half the significant digits carried by the arithmetic when computed near 1.0.[58]

#include <math.h>

double f(double x)
{
    double y = x - 1.0;
    double z = exp(y);
    if (z != 1.0) {
        z = y / (z - 1.0);
    }
    return z;
}

A numerical analysis of the algorithm reveals that if the following non-obvious change to the line z = y / (z - 1.0); is made:

z = log(z) / (z - 1.0);

then the algorithm becomes numerically stable and can compute to full double precision.[58]

To maintain the properties of such carefully constructed numerically stable programs, careful handling by the compiler is required. Certain "optimizations" that compilers might make (for example, reordering operations) can work against the goals of well-behaved software. There is some controversy about the failings of compilers and language designs in this area: C99 is an example of a language where such optimizations are carefully specified to maintain numerical precision. See the external references at the bottom of this article.

A detailed treatment of the techniques for writing high-quality floating-point software is beyond the scope of this article, and the reader is referred to,[54][59] and the other references at the bottom of this article. Kahan suggests several rules of thumb that can substantially decrease by orders of magnitude[59] the risk of numerical anomalies, in addition to, or in lieu of, a more careful numerical analysis. These include: as noted above, computing all expressions and intermediate results in the highest precision supported in hardware (a common rule of thumb is to carry twice the precision of the desired result, i.e. compute in double precision for a final single-precision result, or in double extended or quad precision for up to double-precision results[60]); and rounding input data and results to only the precision required and supported by the input data (carrying excess precision in the final result beyond that required and supported by the input data can be misleading, increases storage cost and decreases speed, and the excess bits can affect convergence of numerical procedures:[61] notably, the first form of the iterative example given below converges correctly when using this rule of thumb). Brief descriptions of several additional issues and techniques follow.

As decimal fractions can often not be exactly represented in binary floating-point, such arithmetic is at its best when it is simply being used to measure real-world quantities over a wide range of scales (such as the orbital period of a moon around Saturn or the mass of a proton), and at its worst when it is expected to model the interactions of quantities expressed as decimal strings that are expected to be exact.[56][59] An example of the latter case is financial calculations. For this reason, financial software tends not to use a binary floating-point number representation.[62] The "decimal" data type of the C# and Python programming languages, and the decimal formats of the IEEE 754-2008 standard, are designed to avoid the problems of binary floating-point representations when applied to human-entered exact decimal values, and make the arithmetic always behave as expected when numbers are printed in decimal.

Expectations from mathematics may not be realized in the field of floating-point computation. For example, it is known that , and that . However, these facts cannot be relied on when the quantities involved are the result of floating-point computation.

The use of the equality test (if (x==y) ...) requires care when dealing with floating-point numbers. Even simple expressions like 0.6 / 0.2 - 3 == 0 will, on most computers, fail to be true[63] (in IEEE 754 double precision, for example, 0.6 / 0.2 - 3 is approximately equal to −4.44089209850063×10−16). Consequently, such tests are sometimes replaced with "fuzzy" comparisons (if (abs(x-y) < epsilon) ..., where epsilon is sufficiently small and tailored to the application, such as 1.0E−13). The wisdom of doing this varies greatly, and can require numerical analysis to bound epsilon.[54] Values derived from the primary data representation and their comparisons should be performed in a wider, extended, precision to minimize the risk of such inconsistencies due to round-off errors.[59] It is often better to organize the code in such a way that such tests are unnecessary. For example, in computational geometry, exact tests of whether a point lies off or on a line or plane defined by other points can be performed using adaptive precision or exact arithmetic methods.[64]

Small errors in floating-point arithmetic can grow when mathematical algorithms perform operations an enormous number of times. A few examples are matrix inversion, eigenvector computation, and differential equation solving. These algorithms must be very carefully designed, using numerical approaches such as iterative refinement, if they are to work well.[65]

Summation of a vector of floating-point values is a basic algorithm in scientific computing, and so an awareness of when loss of significance can occur is essential. For example, if one is adding a very large number of numbers, the individual addends are very small compared with the sum. This can lead to loss of significance. A typical addition would then be something like

3253.671
+  3.141276
-----------
3256.812

The low 3 digits of the addends are effectively lost. Suppose, for example, that one needs to add many numbers, all approximately equal to 3. After 1000 of them have been added, the running sum is about 3000; the lost digits are not regained. The Kahan summation algorithm may be used to reduce the errors.[54]

Round-off error can affect the convergence and accuracy of iterative numerical procedures. As an example, Archimedes approximated π by calculating the perimeters of polygons inscribing and circumscribing a circle, starting with hexagons, and successively doubling the number of sides. As noted above, computations may be rearranged in a way that is mathematically equivalent but less prone to error (numerical analysis). Two forms of the recurrence formula for the circumscribed polygon are:[citation needed]

  • First form:
  • Second form:
  • , converging as

Here is a computation using IEEE "double" (a significand with 53 bits of precision) arithmetic:

 i   6 × 2i × ti, first form    6 × 2i × ti, second form
---------------------------------------------------------
 0   3.4641016151377543863      3.4641016151377543863
 1   3.2153903091734710173      3.2153903091734723496
 2   3.1596599420974940120      3.1596599420975006733
 3   3.1460862151314012979      3.1460862151314352708
 4   3.1427145996453136334      3.1427145996453689225
 5   3.1418730499801259536      3.1418730499798241950
 6   3.1416627470548084133      3.1416627470568494473
 7   3.1416101765997805905      3.1416101766046906629
 8   3.1415970343230776862      3.1415970343215275928
 9   3.1415937488171150615      3.1415937487713536668
10   3.1415929278733740748      3.1415929273850979885
11   3.1415927256228504127      3.1415927220386148377
12   3.1415926717412858693      3.1415926707019992125
13   3.1415926189011456060      3.1415926578678454728
14   3.1415926717412858693      3.1415926546593073709
15   3.1415919358822321783      3.1415926538571730119
16   3.1415926717412858693      3.1415926536566394222
17   3.1415810075796233302      3.1415926536065061913
18   3.1415926717412858693      3.1415926535939728836
19   3.1414061547378810956      3.1415926535908393901
20   3.1405434924008406305      3.1415926535900560168
21   3.1400068646912273617      3.1415926535898608396
22   3.1349453756585929919      3.1415926535898122118
23   3.1400068646912273617      3.1415926535897995552
24   3.2245152435345525443      3.1415926535897968907
25                              3.1415926535897962246
26                              3.1415926535897962246
27                              3.1415926535897962246
28                              3.1415926535897962246
              The true value is 3.14159265358979323846264338327...

While the two forms of the recurrence formula are clearly mathematically equivalent,[nb 13] the first subtracts 1 from a number extremely close to 1, leading to an increasingly problematic loss of significant digits. As the recurrence is applied repeatedly, the accuracy improves at first, but then it deteriorates. It never gets better than about 8 digits, even though 53-bit arithmetic should be capable of about 16 digits of precision. When the second form of the recurrence is used, the value converges to 15 digits of precision.

"Fast math" optimization

[edit]

The aforementioned lack of associativity of floating-point operations in general means that compilers cannot as effectively reorder arithmetic expressions as they could with integer and fixed-point arithmetic, presenting a roadblock in optimizations such as common subexpression elimination and auto-vectorization.[66] The "fast math" option on many compilers (ICC, GCC, Clang, MSVC...) turns on reassociation along with unsafe assumptions such as a lack of NaN and infinite numbers in IEEE 754. Some compilers also offer more granular options to only turn on reassociation. In either case, the programmer is exposed to many of the precision pitfalls mentioned above for the portion of the program using "fast" math.[67]

In some compilers (GCC and Clang), turning on "fast" math may cause the program to disable subnormal floats at startup, affecting the floating-point behavior of not only the generated code, but also any program using such code as a library.[68]

In most Fortran compilers, as allowed by the ISO/IEC 1539-1:2004 Fortran standard, reassociation is the default, with breakage largely prevented by the "protect parens" setting (also on by default). This setting stops the compiler from reassociating beyond the boundaries of parentheses.[69] Intel Fortran Compiler is a notable outlier.[70]

A common problem in "fast" math is that subexpressions may not be optimized identically from place to place, leading to unexpected differences. One interpretation of the issue is that "fast" math as implemented currently has a poorly defined semantics. One attempt at formalizing "fast" math optimizations is seen in Icing, a verified compiler.[71]

See also

[edit]

Notes

[edit]

References

[edit]

Further reading

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Floating-point arithmetic is a computational method for representing and performing operations on real numbers in computers using a finite number of bits, typically in a format consisting of a sign bit, an exponent, and a significand (or mantissa), which allows for the approximation of a wide dynamic range of values with varying precision. This approach, essential for scientific, engineering, and numerical applications, contrasts with fixed-point arithmetic by enabling efficient handling of both very large and very small numbers, though it introduces inherent approximation errors due to limited precision. The predominant standard governing floating-point arithmetic is IEEE 754, first established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE) to promote portability and consistency across computing systems, defining binary and decimal formats, arithmetic operations (such as addition, subtraction, multiplication, division, and square root), rounding modes, and exception handling for cases like overflow, underflow, and invalid operations. Updated periodically to address evolving computational needs—such as the addition of lower-precision formats like binary16 in the 2008 revision and enhancements for reliable scientific computing in machine learning and autonomous systems—the 2019 revision (IEEE 754-2019) introduced features like augmented arithmetic operations with novel rounding behaviors and improved NaN (Not a Number) payload handling to mitigate inconsistencies in exception propagation. Prior to IEEE 754, implementations varied widely among vendors (e.g., IBM's System/360 used a hexadecimal base), leading to non-portable code and debugging challenges, but the standard's adoption has since become nearly universal in modern processors and programming languages like C, Fortran, and Java. Key aspects include the binary representation in IEEE formats, where single precision uses 32 bits (1 sign, 8 exponent, 23 significand) for about 7 decimal digits of precision, and double precision employs 64 bits (1 sign, 11 exponent, 52 significand) for roughly 16 digits, with special values like ±infinity and NaNs to represent undefined results. However, floating-point arithmetic is not exact; operations can produce rounding errors bounded by 0.5 units in the last place (ulp), potentially leading to catastrophic cancellation in subtractions of close values or accumulation of inaccuracies in iterative algorithms, necessitating careful numerical analysis and techniques like guard digits or compensated summation for mitigation. Despite these limitations, its balance of range, speed, and hardware support makes it indispensable for most real-number computations in digital systems.

Introduction

Definition and motivation

Floating-point arithmetic provides a method for representing and performing computations on real numbers in computer systems using an approximate, finite-precision format. A floating-point number is expressed in the general form x=m×bex = m \times b^e, where mm is the significand (also known as the mantissa), representing the significant digits of the number; bb is the base, typically 2 for binary systems or 10 for decimal; and ee is the exponent, an integer that scales the significand to place the decimal point appropriately. This structure allows the representation to "float" the decimal point to accommodate varying magnitudes, enabling the storage of numbers ranging from very small values near zero to extremely large ones within a fixed number of bits. The primary motivation for floating-point arithmetic stems from the need to approximate the infinite set of real numbers using finite computational resources, particularly in applications requiring a broad dynamic range where exact representations like fractions are impractical. In scientific and engineering computations, such as solving differential equations or simulating physical systems, values can span orders of magnitude—from subatomic scales to astronomical distances—making fixed-point or integer formats insufficient due to their limited range and precision trade-offs. By prioritizing relative precision over absolute, floating-point formats facilitate efficient hardware and software implementations for these tasks, supporting graphical rendering, signal processing, and numerical simulations while managing rounding errors inherent to approximation. This approach originated in the demands of early scientific computing, where the ability to handle scaled approximations was essential for iterative algorithms like those used in differential equation solvers, paving the way for standardized formats that ensure portability and reliability across systems.

Comparison to integer and fixed-point arithmetic

Integer arithmetic operates exactly on whole numbers within the limits of its fixed bit width, providing precise results for counting, indexing, and discrete operations without fractional support, though it suffers from abrupt overflows that wrap around or trigger errors when values exceed the representable range, such as from -2^{31} to 2^{31}-1 in 32-bit signed integers. Fixed-point arithmetic builds on integer methods by allocating a fixed number of bits for the fractional part, enabling precise representation of scaled decimals—such as in Qm.n formats where m bits are for the integer part and n for the fraction—but it offers limited dynamic range and requires explicit scaling or reformatting to handle very large or small values, with overflows behaving similarly to integers. Floating-point arithmetic addresses these limitations through an exponent that dynamically scales the significand, allowing a tremendous range of magnitudes from approximately 1030810^{-308} to 1030810^{308} in IEEE 754 double precision, which supports scientific and engineering computations far beyond the static scales of integer or fixed-point systems. However, this flexibility comes at the cost of inexact representation for many rational numbers, as binary floating-point cannot precisely encode decimals like 0.1, which approximates to 0.1000000000000000055511151231257827021181583404541015625 in double precision due to the infinite binary expansion of 1/10. In practice, integer arithmetic suits applications demanding exactness for discrete values, such as array indexing or tallying counts; fixed-point is preferred in embedded systems and digital signal processing where computational efficiency and predictable precision outweigh the need for wide range, including scenarios like audio filtering or scaled financial computations to avoid rounding discrepancies; floating-point dominates in fields requiring broad dynamic range, such as physical simulations, machine learning models, and 3D graphics rendering, where handling both minuscule probabilities and enormous scales is essential.
AspectInteger ArithmeticFixed-Point ArithmeticFloating-Point Arithmetic (Double Precision)
RepresentationFixed-width whole numbers (e.g., 32-bit signed)Scaled integers with fixed fractional bits (e.g., Q15.16)Significand with variable exponent (53-bit mantissa, 11-bit exponent)
PrecisionExact for integers up to bit limitExact within scale (e.g., 1/2^{16} resolution)Approximate, ~15 decimal digits; relative precision ~2^{-53}
RangeLimited (e.g., -2^{31} to 2^{31}-1)Moderate, fixed by bit allocation (e.g., ±2^{15} × scale)Vast (≈10^{-308} to 10^{308})
Overflow BehaviorAbrupt wraparound or exceptionSimilar to integer; requires manual handlingGradual underflow to zero; overflow to infinity
Example: 0.1Not representable (truncates to 0)Approximate (e.g., 6554/65536 ≈ 0.10003662109375)Approximate (0.10000000000000000555...)

Historical Development

Early formats and implementations

The origins of floating-point arithmetic trace back to the early 1940s with Konrad Zuse's Z3 computer, completed in 1941, which was the first programmable digital computer to implement binary floating-point operations in hardware using relay-based electromechanical technology. The Z3 employed a 22-bit word format consisting of a sign bit, a 7-bit exponent (biased, base 2), and a 14-bit mantissa, enabling automatic normalization and supporting a range suitable for engineering calculations like aerodynamics. This design allowed the Z3 to perform addition, subtraction, multiplication, and division at speeds of about 5–10 Hz, marking a significant advance over manual computation despite its destruction during World War II and later reconstruction. In contrast, the ENIAC, completed in 1945 as the first general-purpose electronic digital computer, relied on decimal fixed-point arithmetic implemented via 18,000 vacuum tubes and ring counters, with programmers manually scaling numbers to simulate floating-point behavior through wiring panels and switches. This approach supported approximately 500 multiplications per second but lacked dedicated floating-point hardware, reflecting the era's focus on decimal representation for ease of human verification in scientific and military applications like ballistic trajectory computations. John von Neumann, who consulted on ENIAC and authored the influential EDVAC report in 1945, advocated for stored-program architectures that laid the groundwork for efficient numerical processing, though he personally favored fixed-point over floating-point to avoid perceived risks of numerical instability. The 1950s saw the transition to dedicated hardware floating-point units, exemplified by the IBM 704 in 1954, the first mass-produced commercial computer with such capabilities, using a 36-bit binary format: 1 sign bit, 8-bit excess-128 exponent (base 2), and 27-bit normalized mantissa for single-precision operations. This enabled about 40,000 instructions per second, including floating-point add, multiply, and divide, and facilitated the development of FORTRAN in 1957 for scientific computing. Early systems often preferred decimal formats for business compatibility and readable output, as seen in machines like the IBM 650 (1953), but scientific applications increasingly adopted binary for speed; variations emerged in bases like 8 (octal in some Burroughs machines) and 16 (hexadecimal in the IBM System/360 of 1964, with 32-bit single and 64-bit double precision formats using base-16 exponents for efficient decimal-to-binary conversions). The IBM System/360's hexadecimal floating-point, while binary-compatible, addressed human-readable needs through packed decimal instructions, though binary operations were faster in scientific contexts. By the 1970s, supercomputers advanced these formats further, as in the Cray-1 (1976), which utilized a 64-bit binary format with 1 sign bit, 15-bit excess-16383 exponent (base 2), and 48-bit mantissa, achieving up to 160 megaflops through vectorized floating-point units for simulations in physics and weather modeling. These ad-hoc variations in word lengths, bases, and normalization across machines—such as binary in Zuse and IBM 704 versus hexadecimal in System/360—highlighted portability challenges that later drove standardization efforts.

Path to IEEE 754 standardization

In the 1960s, floating-point arithmetic suffered from severe portability problems due to vendor-specific formats that varied in precision, range, rounding behaviors, and exception handling, making software development across machines like IBM's System/360 and DEC's PDP series expensive and error-prone. William Kahan, while teaching at the University of Toronto, encountered these issues firsthand with the IBM 7090 and began advocating for standardized error analysis and portable floating-point computations, earning him recognition as the "father of floating point." During the 1970s, efforts to address these inconsistencies led to the formation of committees under the Association for Computing Machinery (ACM) and the IEEE, with Kahan playing a pivotal role in pushing for a universal standard. In 1977, the IEEE formed the P754 working group, chaired initially by Richard Delp and later by David Stevenson, which included representatives from Intel, DEC, IBM, and academia; Kahan collaborated with Jerome Coonen and Harold Stone to draft the influential K-C-S proposal. These groups debated formats, with DEC advocating for its VAX heritage from the PDP-11, but the need for microprocessor compatibility—particularly Intel's forthcoming 8087 coprocessor—drove consensus toward binary and decimal interchange formats, multiple rounding modes, and exception handling. The IEEE 754-1985 standard emerged from this process, officially approved and published in 1985, defining binary floating-point formats (single and double precision), decimal options, five rounding modes, and five exception types to ensure predictable behavior across systems. Motivated by the rise of affordable microprocessors like the Intel 8087 released in 1980, which Kahan consulted on, the standard aimed to eliminate the "Tower of Babel" of incompatible arithmetics. Subsequent revisions refined the standard: IEEE 754-2008 incorporated decimal floating-point formats for financial applications, fused multiply-add operations for improved accuracy, and enhanced exception handling, while maintaining backward compatibility with the 1985 version. The 2019 update (IEEE 754-2019) focused on clarifications for reproducibility, such as stricter rules for operations like remainder and sorting, and support for additional formats without major overhauls. By the 1990s, IEEE 754 achieved near-universal adoption in central processing units, with x86 architectures compliant from the Intel 8087 onward, SPARC processors implementing it starting in 1987, and ARM architectures integrating full support by the mid-1990s through extensions like the Floating-Point Unit. This widespread implementation, reconfirmed by IEEE in 1998, transformed floating-point arithmetic into a reliable foundation for scientific computing and software portability.

Floating-Point Representations

IEEE 754 binary formats

The IEEE 754-2008 standard, also known as IEEE 754r, defines several binary floating-point formats for representing real numbers in computing systems, with the core interchange formats being binary16 (half precision, 16 bits), binary32 (single precision, 32 bits), and binary64 (double precision, 64 bits). These formats enable a balance between precision, range, and storage efficiency, supporting normalized numbers, subnormal numbers, infinities, and NaNs (not-a-number values). The standard was revised to include binary16 for applications requiring reduced memory, such as graphics and machine learning, while maintaining backward compatibility with earlier formats. An optional binary128 (quad precision, 128 bits) format is also specified for higher precision needs. Each format consists of three components: a sign bit (1 bit, 0 for positive and 1 for negative), a biased exponent field, and a fraction (mantissa) field representing the significand. The exponent is stored as an unsigned integer with a bias added to the true exponent to allow representation of both positive and negative exponents using only positive values; for example, the bias is 127 for binary32. Normalized numbers have an implicit leading 1 in the significand, so the mantissa is interpreted as 1.f, where f is the stored fraction bits, providing a precision equal to one more than the number of fraction bits. Subnormal numbers, used for values near zero, have an explicit leading 0 in the significand to extend the range downward without gaps. The parameters for these formats are summarized in the following table:
FormatTotal BitsSign BitsExponent BitsBiasFraction BitsTotal Significand Bits (incl. implicit 1)
binary161615151011
binary3232181272324
binary646411110235253
binary12812811516383112113
These yield approximate decimal precisions of 3–4 digits for binary16, 7–8 digits for binary32, 15–16 digits for binary64, and 34 digits for binary128. The representable range varies by format, encompassing both normal and subnormal numbers. For binary32, the maximum finite value is approximately 3.4×10383.4 \times 10^{38} and the minimum positive subnormal is approximately 1.4×10451.4 \times 10^{-45}, providing about 24 bits of precision. For binary64, the maximum is approximately 1.8×103081.8 \times 10^{308} and the minimum positive subnormal is approximately 4.9×103244.9 \times 10^{-324}, with 53 bits of precision. Binary16 offers a smaller range, with a maximum of about 6.55×1046.55 \times 10^4 and minimum subnormal around 5.96×1085.96 \times 10^{-8}. Binary128 extends this dramatically, up to roughly 1.19×1049321.19 \times 10^{4932}. These ranges ensure coverage for most scientific and engineering computations while highlighting trade-offs in overflow and underflow risks. The value of a normalized finite number in these formats is given by the formula (1)s×(1.f)×2ebias(-1)^s \times (1.f) \times 2^{e - \text{bias}}, where ss is the sign bit, ff is the fraction interpreted as a binary fraction, ee is the unbiased exponent (stored exponent minus bias), and the leading 1 is implicit. For binary32 specifically, this is (1)s×1.f×2e127(-1)^s \times 1.f \times 2^{e-127}, where ee ranges from -126 to 127 for normal numbers. For instance, the binary32 representation of 1.0 has s=0s=0, stored exponent 127 (biased), and fraction 0, yielding 1×20=11 \times 2^{0} = 1. Subnormals use 0.f×2emin0.f \times 2^{\text{emin}} instead, with emin = 1 - emax (e.g., -126 for binary32). The binary128 format follows the same structure but is not required for conforming implementations, allowing flexibility for high-performance computing.

Internal encoding details

In the IEEE 754 binary formats, floating-point numbers are encoded using a fixed allocation of bits to represent the sign, exponent, and fraction (significand). For the binary32 format, the 32-bit value is structured with bit 31 as the sign bit (0 for positive, 1 for negative), bits 30 through 23 as the 8-bit exponent field, and bits 22 through 0 as the 23-bit fraction field. This allocation provides a balance between range (via the exponent) and precision (via the fraction), with the sign bit determining the overall polarity. The exponent field uses biasing to represent both positive and negative exponents with an unsigned binary value, avoiding the need for a sign bit in the exponent itself. For binary32, the bias is 127, so the stored exponent value is the true exponent ee plus 127 (i.e., e+127e + 127), allowing exponents from -126 to +127 for normalized numbers./03%3A_Data_Representation/3.03%3A_Floating-point_Representation) An all-zero exponent field (stored value 0) is reserved for subnormal numbers and zero, while an all-one field (stored value 255) indicates infinities or NaNs, ensuring special values can be distinguished without ambiguity. Normalized numbers maintain the significand in the half-open interval [1, 2) by shifting the binary point so that the leading bit is always 1, which is not explicitly stored as a "hidden bit" to maximize precision. For binary32, this implicit leading 1 combined with the 23-bit fraction yields 24 bits of significand precision. Subnormal (denormalized) numbers, used when the exponent field is zero and the fraction is nonzero, have an explicit leading 0 in the significand (0.f), enabling gradual underflow toward zero and preserving some precision for very small values near the minimum normalized magnitude. To illustrate, the decimal value 3.5 equals 1.112×211.11_2 \times 2^1 in normalized binary form. The sign bit is 0 (positive), the true exponent is 1 (biased to 128 or 10000000210000000_2), and the fraction is 11211_2 (0.75 in decimal, padded with 21 zeros: 11000000000000000000000211000000000000000000000_2). The full 32-bit encoding is thus:

0 10000000 11000000000000000000000

0 10000000 11000000000000000000000

This binary string corresponds to the hexadecimal value 0x40600000. For multi-byte formats like binary64 (64 bits), the internal bit layout follows the same sign-exponent-fraction structure (1 sign bit, 11 exponent bits with bias 1023, 52 fraction bits), but storage in memory across multiple bytes depends on the system's endianness. In big-endian systems, the most significant byte (containing the sign and part of the exponent) is stored at the lowest memory address, while in little-endian systems, the least significant byte is first; this affects portability when transferring binary64 values as two 32-bit words between architectures. The IEEE 754 standard defines the logical bit ordering from most to least significant but does not mandate byte order, leaving it to the host system's conventions.

Alternative and extended formats

Decimal floating-point formats, introduced in the IEEE 754-2008 standard and refined in the 2019 revision, provide base-10 representations to ensure exact storage of decimal fractions such as 0.1, which are common in financial and commercial applications where binary formats introduce rounding errors. These formats include decimal32 (32 bits, up to 7 decimal digits of precision), decimal64 (64 bits, up to 16 decimal digits), and decimal128 (128 bits, up to 34 decimal digits), offering a dynamic range from approximately 10^{-6143} to 10^{6144} for decimal128. The base-10 radix allows direct representation of human-readable decimals without approximation, making them suitable for applications like banking and accounting software. In Java, the BigDecimal class implements arbitrary-precision decimal arithmetic inspired by these principles, enabling precise financial calculations by avoiding the inexactness of binary floating-point. Historical alternatives to binary floating-point include IBM's hexadecimal format, which uses base-16 and has been supported in z/Architecture since the System/360 era in the 1960s. This format features single (32 bits), double (64 bits), and extended (128 bits) precisions, with a characteristic (exponent) and fraction where the mantissa is normalized to lie between 1/16 and 1, providing approximately 7 decimal digits for single precision, 16 for double, and 34 for extended precision. It persists in legacy mainframe environments for compatibility with existing data but offers less uniform precision distribution compared to binary formats due to the larger radix. Minifloats, reduced-precision formats with as few as 8 bits, emerged for resource-constrained embedded systems to balance computational efficiency and accuracy, often sacrificing range for lower memory and power usage in applications like sensor processing. Modern extensions to standard binary formats include brain floating-point (bfloat16), developed by Google in 2018 for machine learning workloads on tensor processing units. Bfloat16 uses 16 bits: 1 sign bit, 8 exponent bits (matching single-precision for extended range up to 3.4 × 10^{38}), and 7 mantissa bits, trading some precision (about 3-4 decimal digits) for the full dynamic range of 32-bit floats to avoid overflow during neural network training. Empirical studies show bfloat16 achieves comparable accuracy to single-precision in image classification and speech recognition tasks without requiring loss scaling techniques. Half-precision variants, such as ARM's FP16 introduced in the Armv8.2-A architecture in the late 2010s, extend the IEEE 754 binary16 format (16 bits: 1 sign, 5 exponent, 10 mantissa) with hardware support for arithmetic operations, enabling efficient storage and computation in graphics and embedded machine learning. An alternative representation, posits, was proposed by John Gustafson in 2017 as a drop-in replacement for IEEE 754 floats, featuring a sign bit followed by a variable-length regime field for tapered precision, an optional fixed-length exponent, and a fraction field. Unlike fixed exponent widths in IEEE formats, the regime uses a unary-like encoding to dynamically allocate bits, providing higher accuracy near unity (e.g., a 32-bit posit offers about 28 effective significand bits versus 24 in binary32) and a broader dynamic range (approximately 5 × 10^{-38} to 1 × 10^{35} for es=2). Posits eliminate NaNs and subnormals, using all-zero bits for zero and a leading 1 followed by zeros for infinity, which simplifies hardware and improves accuracy in operations like reciprocals and polynomial evaluations, with error reductions up to factors of hundreds in some examples per the source benchmarks.
FormatBitsRadixPrecision (decimal digits)Typical Use
decimal3232107Financial computations
decimal64641016Commercial data processing
decimal1281281034High-precision decimals
IBM Hex Single3216~7Legacy mainframes
bfloat16162~3-4ML training
ARM FP16162~4Embedded graphics/ML
32-bit Posit (es=2)322~3-8 (tapered)Numerical algorithms

Properties and Range

Exponent and mantissa structure

In floating-point arithmetic, the exponent determines the magnitude of the represented number through the scaling factor beb^e, where bb is the radix (typically 2 in binary formats) and ee is the unbiased exponent value. To facilitate efficient storage and comparison in binary hardware, the exponent is stored in a biased form: a positive constant, known as the bias, is added to the true exponent, allowing both positive and negative exponents to be represented using unsigned integer encoding without dedicating a separate sign bit. This bias is chosen such that the smallest normal exponent maps to 1 (avoiding all-zero exponents for special cases), and the maximum maps just below the all-ones value, which is reserved for infinities and NaNs. The mantissa, also called the significand or fraction, encodes the significant digits that provide the precision of the number, typically in a normalized form to maximize representational efficiency. Normalization ensures that the leading digit (or bit, in binary) is nonzero, eliminating leading zeros and allowing an implicit leading 1 to be assumed in binary formats, which effectively adds one extra bit of precision without explicit storage. The value is thus represented as (1)s×(1+f)×be(-1)^s \times (1 + f) \times b^e, where ss is the sign bit, ff is the fractional part stored in the mantissa field, and the precision p=1+tp = 1 + t bits, with tt denoting the number of explicit fraction bits. This structure prioritizes dense packing of precision within limited bits. The allocation of bits between the exponent and mantissa involves inherent trade-offs: increasing exponent bits expands the dynamic range (larger maximum and smaller minimum magnitudes) but reduces the mantissa bits available for precision, potentially leading to larger rounding errors in computations. For instance, in the double-precision format, 11 bits are devoted to the exponent for a wide range, contrasted with 52 bits for the mantissa to maintain high precision suitable for scientific applications. Subnormal numbers, or denormals, address abrupt underflow by permitting a zero leading bit in the significand when the biased exponent is at its minimum nonzero value, effectively allowing smaller exponents and gradual transitions to zero, which preserves relative accuracy for tiny values near the underflow threshold.

Representable values and gaps

In binary floating-point systems conforming to the IEEE 754 standard, only a specific subset of real numbers can be exactly represented: the signed dyadic rationals, which are finite sums of distinct powers of 12\frac{1}{2} (i.e., numbers of the form ±k2n\pm \frac{k}{2^n} for integers kk and n0n \geq 0). These arise from the normalized form ±(1+f)×2e\pm (1 + f) \times 2^e, where ff is the fractional part with a finite binary expansion limited by the mantissa bits, and ee is the exponent. For instance, 0.5 = 12=1×21\frac{1}{2} = 1 \times 2^{-1} is exactly representable in any IEEE 754 binary format, as its binary representation terminates. In contrast, 0.1 = 110\frac{1}{10} has a repeating binary expansion (0.0001100110011...₂) and cannot be exactly represented, requiring approximation to the nearest representable value. The gaps between consecutive representable values, known as the unit in the last place (ulp), vary with the magnitude of the numbers due to the fixed precision of the mantissa. In normalized representation, the ulp at a number xx with exponent ee (where 2ex<2e+12^e \leq |x| < 2^{e+1}) equals 2ep+12^{e - p + 1}, with pp being the precision (mantissa bits including the implicit leading 1). For IEEE 754 double precision (p=53p = 53), the ulp between 1 and 2 (where e=0e = 0) is 2522.22×10162^{-52} \approx 2.22 \times 10^{-16}, meaning the spacing between representables in this range is approximately 2.22×10162.22 \times 10^{-16}. These gaps widen exponentially as x|x| increases, since each increment in ee doubles the ulp. The set of representable values in any IEEE 754 binary format is finite and discrete, encompassing all signed dyadic rationals within the supported range, plus special values like zero, infinities, and NaNs. For double precision, the total number of distinct finite representable values (treating +0 and -0 as distinct bit patterns) is 2642531.84×10192^{64} - 2^{53} \approx 1.84 \times 10^{19}, accounting for normalized numbers, subnormals (denormals), and signed zero. This finite density implies clustering near zero—where subnormals fill smaller gaps down to about 210742^{-1074}—and progressive sparsening at extremes, as illustrated conceptually on the number line:
  • Near 0: Dense packing, with ulp as small as 210742^{-1074} for the tiniest subnormals.
  • Around 1: Moderate spacing at 2522^{-52}.
  • At large scales (e.g., near 210232^{1023}): Vast gaps up to 29712^{971}, where representables are separated by distances exceeding the observable universe's diameter in Planck units.
A key implication of these gaps is the progressive loss of precision for large integers beyond the mantissa width. In double precision, all integers from -2^{53} to 2^{53} are exactly representable, as they fit within the 53-bit precision. However, larger integers cannot all be distinguished; for example, 2^{53} + 1 and 2^{53} + 2 are both rounded to 2^{53} in the default representation, since the ulp at that scale is 2. This limitation underscores the trade-off between range and accuracy inherent in floating-point formats.

Dynamic range and precision limits

The dynamic range of floating-point numbers in IEEE 754 binary formats refers to the span from the smallest positive normal number to the largest finite number representable, determined primarily by the exponent field. In the binary32 (single-precision) format, the smallest positive normal number is 21262^{-126}, approximately 1.17549435×10381.17549435 \times 10^{-38}, while the largest finite number is (2223)×2127(2 - 2^{-23}) \times 2^{127}, approximately 3.40282347×10383.40282347 \times 10^{38}. For the binary64 (double-precision) format, this range expands significantly, with the smallest positive normal at 210222^{-1022}, about 2.22507386×103082.22507386 \times 10^{-308}, and the largest finite at (2252)×21023(2 - 2^{-52}) \times 2^{1023}, roughly 1.79769313×103081.79769313 \times 10^{308}. Precision in these systems is characterized by the relative accuracy, quantified by the machine epsilon ϵ\epsilon, which is the smallest positive number such that 1+ϵ>11 + \epsilon > 1 in the floating-point arithmetic; it equals 2(p1)2^{- (p-1)} where pp is the significand precision (24 for binary32, 53 for binary64). Thus, ϵ1.192×107\epsilon \approx 1.192 \times 10^{-7} for single precision and ϵ2.220×1016\epsilon \approx 2.220 \times 10^{-16} for double precision, ensuring that relative errors in representations and operations are bounded by about half of ϵ\epsilon. Absolute precision, however, decreases for larger magnitudes due to the fixed mantissa length, as the spacing between representable numbers scales with 2e2^{e} where ee is the unbiased exponent. When operations exceed these bounds, overflow occurs if the result's magnitude surpasses the maximum finite value, producing positive or negative infinity depending on the sign, while underflow happens for results smaller than the minimum normal, typically flushing to zero or using subnormal numbers to maintain some precision through gradual underflow. In gradual underflow, subnormals extend the range below the minimum normal but at reduced precision, as the leading 1 in the mantissa is suppressed, effectively shortening the significand and increasing relative error. The following table summarizes key dynamic range and precision parameters for common IEEE 754 binary formats:
FormatSignificand Bits (p)Min. Positive NormalMax. Finite ValueMachine Epsilon (ϵ\epsilon)
binary322421261.18×10382^{-126} \approx 1.18 \times 10^{-38}(2223)×21273.40×1038(2 - 2^{-23}) \times 2^{127} \approx 3.40 \times 10^{38}2231.19×1072^{-23} \approx 1.19 \times 10^{-7}
binary6453210222.23×103082^{-1022} \approx 2.23 \times 10^{-308}(2252)×210231.80×10308(2 - 2^{-52}) \times 2^{1023} \approx 1.80 \times 10^{308}2522.22×10162^{-52} \approx 2.22 \times 10^{-16}
Double precision provides a vastly wider dynamic range, spanning approximately 616 orders of magnitude compared to about 76 for single precision, making it suitable for applications requiring high dynamic range, such as scientific simulations, at the cost of doubled storage.

Conversion and Rounding

Binary-to-decimal and decimal-to-binary processes

Converting a binary floating-point number to its decimal representation involves extracting the sign, exponent, and mantissa from the encoded format, then scaling the mantissa appropriately to produce the integer and fractional parts in base 10. For IEEE 754 binary formats, the mantissa (or significand) is typically a 24-bit value for single precision (with an implicit leading 1) or 53 bits for double precision, while the biased exponent determines the scaling factor as 2ebias2^{e - \text{bias}}, where ee is the unbiased exponent. To generate the decimal digits, the scaled value is first adjusted to isolate the integer part by multiplying or dividing by powers of 10 as needed; subsequent fractional digits are obtained by repeatedly multiplying the fractional remainder by 10 and taking the integer part of the result. A key challenge in binary-to-decimal conversion is ensuring the output decimal string is the shortest possible representation that allows round-trip accuracy, meaning re-parsing the decimal back to binary yields the exact original value. For instance, the IEEE 754 double-precision encoding of 0.1 should print as "0.1" rather than an inaccurate expansion like "0.10000000000000000555", which could arise from naive digit generation without bounds checking. The Dragon4 algorithm, developed by Steele and White, addresses this by computing decimal digits while tracking upper and lower error bounds on the value; it generates digits until the interval containing the exact value is fully captured by a single decimal representation, guaranteeing correctness and minimality for all finite inputs. This method evolved from earlier techniques like Dragon2 and supports various output modes, such as fixed or scientific notation, while minimizing the number of digits—typically 17 for double precision to ensure round-trip fidelity. As an example, consider the IEEE 754 double-precision approximation of π (3.141592653589793115997963468544185161590576171875 in exact binary form). Extracting the mantissa (approximately 1.100100100001111110110101010001000010101101001011010001 in binary) and exponent (1, unbiased), scaling yields the integer part 3 and fractional digits generated via successive multiplications by 10, resulting in the 15-digit decimal "3.141592653589793" under shortest representation rules. For a basic binary-to-decimal conversion in pseudocode, the process can be outlined as follows, assuming a normalized double-precision input with implicit leading 1 in the mantissa:

function binary_to_decimal(sign, exponent, mantissa): if exponent == 0: # Denormalized value = (mantissa / 2^52) * 2^(1 - bias) # No implicit leading 1 else: value = (1 + mantissa / 2^52) * 2^(exponent - bias) if sign == 1: value = -value # Extract integer part integer_part = floor(abs(value)) fraction = abs(value) - integer_part # Generate decimal string decimal_str = str(integer_part) + "." for i in 1 to max_digits: fraction *= 10 digit = floor(fraction) decimal_str += str(digit) fraction -= digit if fraction == 0: break # Terminate early if exact return decimal_str

function binary_to_decimal(sign, exponent, mantissa): if exponent == 0: # Denormalized value = (mantissa / 2^52) * 2^(1 - bias) # No implicit leading 1 else: value = (1 + mantissa / 2^52) * 2^(exponent - bias) if sign == 1: value = -value # Extract integer part integer_part = floor(abs(value)) fraction = abs(value) - integer_part # Generate decimal string decimal_str = str(integer_part) + "." for i in 1 to max_digits: fraction *= 10 digit = floor(fraction) decimal_str += str(digit) fraction -= digit if fraction == 0: break # Terminate early if exact return decimal_str

This simplified version does not enforce shortest representation or error bounds, unlike Dragon4, and may require post-processing for accuracy. The inverse process, decimal-to-binary conversion, begins by parsing the input decimal string into its sign, integer part, and fractional part, then converting each to binary while adjusting the overall exponent. The integer part is converted using repeated division by 2 to obtain binary digits from most to least significant; the fractional part is handled by repeatedly multiplying by 2 and recording the integer part (0 or 1) as the next binary digit after the point, continuing until the fraction terminates or a precision limit is reached. Since many decimals (like 0.1) have non-terminating binary expansions, the algorithm must round the result to the nearest representable binary floating-point value, often using up to 17 decimal digits for double precision to ensure the binary output rounds back correctly. David Gay's algorithm for correctly rounded decimal-to-binary conversion employs table lookups for common powers of 10 and 5 to scale the decimal significand into a form amenable to binary normalization, followed by exact multiplication and division using arbitrary-precision integers to compute the closest binary representation. This approach handles non-terminating fractions by accumulating sufficient precision (e.g., 64 bits for double) and applying rounding rules to select the final mantissa and exponent, minimizing conversion errors. For example, parsing "0.1" yields a binary fraction starting 0.0001100110011... (repeating), which rounds to the double-precision value 0x3FB999999999999A after normalization to exponent -4. Non-terminating decimals pose a particular challenge, as infinite binary expansions must be truncated or rounded without introducing bias; Gay's method mitigates this by computing both over- and under-estimates of the value and selecting the representable binary number that minimizes the distance, ensuring round-trip conversions are exact when sufficient decimal digits are provided. Rounding during these conversions follows the specified mode (e.g., round-to-nearest) to resolve ties, as detailed in related standards.

Rounding modes and rules

In floating-point arithmetic, computations often yield results that cannot be exactly represented within the finite precision of the format, necessitating a systematic approach to select the closest representable value. The IEEE 754 standard defines rounding modes to resolve such inexactness consistently across operations like addition, multiplication, and format conversions. These modes ensure reproducibility and control over error direction, which is crucial for numerical stability and analysis. The standard specifies four primary rounding modes: round to nearest (with ties to even), round toward zero, round toward positive infinity, and round toward negative infinity.
ModeDescriptionBehavior on Positive Inexact Result >0
Round to nearest, ties to evenSelects the representable value closest to the exact result; ties (exactly midway) are resolved by choosing the value with an even least significant bit in the significand.Rounds up if fractional part > 0.5 ulp; down if < 0.5 ulp; to even if = 0.5 ulp.
Round toward zeroSelects the representable value closest to zero (truncates fractional part).Always rounds down (toward 0).
Round toward +∞Selects the smallest representable value no less than the exact result.Always rounds up.
Round toward -∞Selects the largest representable value no greater than the exact result.Always rounds down.
Here, ulp denotes the unit in the last place, the spacing between consecutive representable values at the result's magnitude, and the tie threshold for round to nearest is ulp/2. The round to nearest, ties to even mode is the default for all binary floating-point formats, promoting unbiased rounding and statistical reproducibility over repeated operations. In practice, these modes apply after computing an exact (infinitely precise) intermediate result, then adjusting to the nearest representable form while preserving the sign for inexact zeros. For instance, in double-precision arithmetic, the sum 0.1 + 0.2 yields 0.30000000000000004 under round to nearest, ties to even, because the exact value 0.3 is unrepresentable and rounds to the adjacent float with even mantissa least significant bit. This default mode affects both arithmetic operations and conversions, such as binary-to-decimal, ensuring consistent handling of inexactness. Directed rounding modes (toward +∞ or -∞) are particularly valuable in applications requiring error bounds, such as interval arithmetic, where outward rounding expands intervals to enclose all possible results from rounding uncertainty.

Exact decimal representation strategies

The challenge in exact decimal representation of binary floating-point numbers lies in producing the shortest decimal string that, when parsed back into binary floating-point, yields the original value exactly—a property known as round-trip conversion. Over-specifying the decimal output, such as rendering the exact value 1.0 as "1.0000000000000000", wastes space and can introduce unnecessary precision that misrepresents the internal binary approximation. For instance, the binary64 representation of 0.3 is not exactly 0.3 but approximately 0.2999999999999999888977697537484345957637, yet the shortest decimal string that round-trips to this value is simply "0.3". This approach ensures uniqueness without excess digits, distinguishing the value from its nearest binary neighbors. To determine the minimal number of decimal digits required for unique representation across all possible values, algorithms establish tight bounds based on the floating-point format's precision and exponent range. For the binary64 format (double precision with 53-bit significand), at most 17 decimal digits suffice to guarantee a round-trip conversion for any representable value, as adjacent binary floats can be separated by intervals requiring up to this many digits to resolve uniquely. This bound arises from the logarithmic relationship between binary precision and decimal digits, ensuring no two distinct binary64 values map to the same decimal string of 17 or fewer digits, nor does any such string round to the wrong binary value. The IEEE 754-2008 standard mandates correctly rounded conversions between binary floating-point formats and decimal representations, requiring that the output decimal string be the shortest one that rounds to the original binary value under the specified rounding mode, typically round-to-nearest. This ensures interoperability and accuracy in applications like scientific computing and financial systems, where binary internals must interface reliably with human-readable decimal outputs. For binary formats, conversions involving up to a specified number of significant decimal digits (e.g., 17 for binary64) must produce exact round-trips. Influential implementations address these requirements through efficient algorithms for binary-to-decimal conversion. David Gay's dtoa library, introduced in 1990 and subsequently updated, provides correctly rounded decimal strings using a combination of scaling, multiplication by powers of 10, and careful digit extraction to achieve the shortest representation. The library supports modes for fixed, scientific, and shortest output, ensuring compliance with IEEE 754 by generating at most 17 digits for binary64 while verifying round-trip accuracy. Modern language standards build on such work; for example, C++17's std::to_chars overloads for floating-point types produce the shortest decimal representation that round-trips exactly, leveraging table-based or multiplicative methods for speed and precision, without dynamic memory allocation. These tools prioritize performance, with dtoa achieving conversions in a few dozen instructions on average for typical values.

Arithmetic Operations

Addition and subtraction algorithms

Floating-point addition and subtraction in the IEEE 754 standard follow a structured algorithm to ensure correct rounding and precision preservation. The process begins with alignment of the operands, where the mantissa (significand) of the number with the smaller exponent is shifted right by the difference in exponents to match the larger exponent, effectively aligning the binary points. This shift may cause bits to be lost from the least significant end, but to mitigate precision loss, implementations typically employ guard bits—extra bits beyond the mantissa length—to retain shifted-out information for the subsequent operation. Once aligned, the mantissas are added (for addition) or subtracted (for subtraction, after adjusting the sign bit of the subtrahend), treating them as fixed-point integers with an implicit leading 1 in normalized form. The result may require normalization: if there is a carry-over creating an extra leading bit (e.g., sum exceeds 2 in the leading position), the mantissa is shifted right and the exponent incremented; conversely, leading zeros from cancellation in subtraction necessitate left shifts and exponent decrements. Finally, the normalized result is rounded to the destination format using the specified rounding mode, incorporating any guard, round, and sticky bits to achieve correctly rounded results as mandated by IEEE 754. Subtraction poses unique challenges due to potential catastrophic cancellation, where operands of similar magnitude and sign result in a near-zero difference, amplifying relative errors from prior rounding. For instance, subtracting two close values like 1+ϵ1 + \epsilon and 1, where ϵ\epsilon is near machine epsilon, can lead to significant loss of precision in the result if alignment shifts discard critical bits. Guard bits help bound this error to less than 2ϵ2\epsilon (machine epsilon) for a single operation, but severe cancellation still requires careful algorithmic handling in sensitive computations. The core algorithm can be outlined in pseudocode as follows:

function float_add(float a, float b): if exponent(a) > exponent(b): swap a and b // Ensure b has larger exponent delta_exp = exponent(b) - exponent(a) mantissa_a = shift_right(mantissa_a, delta_exp, guard_bits) // Align, preserve guards if signs_differ: result_mant = mantissa_b - mantissa_a // Subtraction else: result_mant = mantissa_b + mantissa_a // Addition exponent_result = exponent(b) normalize(result_mant, exponent_result) // Shift for leading 1, adjust exp round_to_format(result_mant, exponent_result, rounding_mode) // Apply rounding with guards return pack(sign_result, exponent_result, result_mant)

function float_add(float a, float b): if exponent(a) > exponent(b): swap a and b // Ensure b has larger exponent delta_exp = exponent(b) - exponent(a) mantissa_a = shift_right(mantissa_a, delta_exp, guard_bits) // Align, preserve guards if signs_differ: result_mant = mantissa_b - mantissa_a // Subtraction else: result_mant = mantissa_b + mantissa_a // Addition exponent_result = exponent(b) normalize(result_mant, exponent_result) // Shift for leading 1, adjust exp round_to_format(result_mant, exponent_result, rounding_mode) // Apply rounding with guards return pack(sign_result, exponent_result, result_mant)

This pseudocode assumes binary formats and handles special cases (e.g., infinities, NaNs) separately as per IEEE 754. For subtraction, the sign adjustment is implicit in the difference computation. IEEE 754 also recommends a fused multiply-add operation, which computes (x×y)+z(x \times y) + z with a single rounding step, reducing error accumulation compared to separate multiplication and addition, though it is distinct from basic add/subtract.

Multiplication and division methods

In floating-point multiplication, the process begins by multiplying the significands (mantissas) of the two operands, which are treated as fixed-point numbers with an implicit leading 1 for normalized values, resulting in a product that may require up to twice the precision of a single mantissa. The exponents are then added, with adjustment for the bias (typically subtracting the bias value once after addition to obtain the true exponent sum). Normalization follows by shifting the significand product left or right to restore the leading 1, potentially adjusting the exponent accordingly, after which the result is rounded to the target format's precision using the specified rounding mode. For floating-point division, the significands are divided to compute the quotient, again treating them as fixed-point values, while the exponents are subtracted (with bias adjustment: subtract the divisor's exponent from the dividend's and add the bias). The result is normalized by shifting if necessary and rounded, with special handling for cases where the divisor is zero or the result overflows/underflows, though these are deferred to exception mechanisms. To enhance performance, division often employs reciprocal approximation: first compute an approximate reciprocal of the divisor using table lookup or Newton-Raphson iteration, then multiply by the dividend, which leverages faster multiplication hardware. In hardware implementations, multiplication of the significands commonly uses a Wallace tree, a parallel reduction structure that efficiently sums partial products via carry-save adders, reducing the latency compared to serial methods and enabling high-throughput designs in processors. For division, the SRT (Sweeney-Robertson-Tocher) algorithm is prevalent, an iterative digit-recurrence method that selects quotient digits based on redundant representations to avoid trial divisions, achieving balanced speed and correctness in floating-point units. Software fallbacks, such as those in libraries for non-hardware-supported precisions, emulate these operations using integer arithmetic on significands scaled to avoid overflow. Exact results occur when the significand product or quotient fits precisely within the available mantissa bits after normalization, requiring no rounding—such as multiplying 1.5 (binary 1.1 × 2^0) by 2.0 (binary 1.0 × 2^1), yielding 3.0 (binary 1.1 × 2^1) exactly. Similarly, dividing 3.0 by 1.5 produces 2.0 without error, as the significand division (1.1 / 1.1 = 1.0) and exponent adjustment (0 - 0 + bias) align perfectly. These cases highlight scenarios where floating-point arithmetic preserves decimal-like exactness for powers-of-two alignments.

Square root methods

The square root operation, required by IEEE 754 to produce a correctly rounded result for supported formats, computes the principal (non-negative) square root of a non-negative operand. It is typically implemented using iterative refinement methods, such as the Newton-Raphson iteration, which starts with an initial approximation (often from a table lookup or hardware estimator) and converges quadratically by repeatedly applying the formula xn+1=12(xn+axn)x_{n+1} = \frac{1}{2} (x_n + \frac{a}{x_n}), where aa is the input, until the result stabilizes within the precision limits. Hardware implementations may use digit-by-digit calculation similar to SRT division or functional iteration for efficiency. Special cases include: square root of zero yields zero; square root of positive infinity yields positive infinity; square root of negative finite values signals an invalid operation exception and returns a quiet NaN; square root of NaN propagates NaN.

Special literal notations and constants

In programming languages that adhere to standards like ISO C99 and C++17, floating-point literals are typically expressed in decimal notation, such as 3.14 for a double-precision value or 3.14f to specify single-precision (float). Scientific notation is also supported, using e or E to denote the exponent, as in 1.0e-10 for a very small value. Suffixes like f or F indicate float type, while l or L denote long double; without a suffix, the literal defaults to double. Hexadecimal floating-point literals, introduced in C99 and later adopted in C++17, provide a way to represent values exactly in binary without decimal-to-binary conversion errors, using the syntax 0x or 0X followed by hexadecimal digits, an optional radix point, and a binary exponent prefixed by p or P. For example, 0x1.0p0 equals 1.0, and the exponent indicates powers of 2 rather than 10, ensuring precise mantissa specification. This format is particularly useful for embedding exact IEEE 754 binary representations directly in source code. IEEE 754 special values like infinity and Not-a-Number (NaN) are often represented through language-specific constants or expressions rather than direct literals. In C and C++, macros such as INFINITY from <math.h> or std::numeric_limits<double>::infinity() yield positive infinity (Inf), with negative infinity obtainable via negation; NaN is similarly provided by NAN. NaNs include payloads in their mantissa bits, distinguishing quiet NaNs (which propagate silently during operations) from signaling NaNs (which trigger exceptions); the leading mantissa bit is 1 for quiet and 0 for signaling per IEEE 754-2008. Standards define additional constants for precision and limits. In JavaScript, adhering to ECMAScript and IEEE 754 double-precision, Number.EPSILON is the difference between 1 and the next representable value greater than 1 (approximately 2.220446049250313e-16), useful for comparing approximate equality. A common pitfall arises with decimal literals like 0.1, which cannot be exactly represented in binary floating-point due to its infinite repeating binary expansion (0.0001100110011...), resulting in an approximation such as 0.1000000000000000055511151231257827021181583404541015625 in double-precision. To represent values that are exactly representable in binary, programmers may use hexadecimal floating-point literals. In Rust, the f64 type provides associated constants like f64::INFINITY for infinity and f64::NAN for a quiet NaN; legacy module-level constants (e.g., std::f64::NAN) are planned for deprecation in favor of type-associated ones for better clarity and consistency.

Exception Handling

Overflow, underflow, and denormalized numbers

In floating-point arithmetic, overflow occurs when the magnitude of an intermediate or final result exceeds the largest finite value representable in the given format, such as (1253)×21024(1 - 2^{-53}) \times 2^{1024} for binary64. According to IEEE 754, the default handling replaces the result with positive or negative infinity matching the sign of the intermediate value, while setting the overflow and inexact status flags; the exact outcome can vary slightly by rounding mode, such as delivering the largest finite number in roundTowardZero mode. This behavior ensures that operations involving overflow propagate infinities consistently in subsequent computations. For instance, in binary64, multiplying 1×103081 \times 10^{308} by 10 yields ++\infty. Underflow arises when a nonzero result has a magnitude smaller than the smallest normalized number, which is 210222^{-1022} for binary64. The IEEE 754 standard distinguishes tininess detection—whether the underflow occurs before or after rounding—with the default result being a subnormal number or zero, accompanied by underflow and potentially inexact flags. Implementations support two primary modes: gradual underflow, which preserves small values through subnormals for better numerical stability, or flush-to-zero, which abruptly sets tiny results to zero to avoid performance penalties. Gradual underflow is mandatory in IEEE 754 but can incur slowdowns on hardware without dedicated subnormal support, as denormalization may trigger software traps. Denormalized numbers, or subnormals, extend the representable range below the minimum normalized value by encoding an explicit leading zero in the significand and using the minimum biased exponent (zero in binary formats). This allows gradual underflow, filling the gap near zero and ensuring that distinct inputs produce nonzero differences, unlike abrupt flushing to zero. In binary64, subnormals range from the smallest positive value of approximately 4.94×103244.94 \times 10^{-324} (210742^{-1074}) up to just below 210222^{-1022}, though they sacrifice precision since the significand lacks the implicit leading 1. Operations on subnormals follow the same arithmetic rules as normals but may signal inexact exceptions due to limited precision.

Not-a-Number (NaN) and infinities

In the IEEE 754 standard for binary floating-point arithmetic, infinities are represented by an exponent field of all ones (e.g., 2047 in double precision) and a zero mantissa, with the sign bit determining positive (+∞) or negative (-∞) infinity. These values propagate through arithmetic operations in a manner consistent with limits; for instance, adding a finite number to an infinity yields an infinity of the same sign, such as +5=\infty + 5 = \infty. Not-a-Number (NaN) values, used to represent indeterminate or invalid results, are encoded with an all-ones exponent and a non-zero mantissa. There are two subtypes: quiet NaNs, identified by a leading 1 in the mantissa's most significant bit, which propagate silently through computations without raising exceptions; and signaling NaNs, with a leading 0, which are intended to trigger an invalid operation exception upon use. The remaining bits of the NaN mantissa form a payload that can store diagnostic information, such as error codes or identifiers for tracing computational faults, enabling applications to embed metadata like sequence numbers or fault types. The IEEE 754-2019 standard introduces recommended operations for getting and setting NaN payloads to enhance consistency in payload propagation and diagnostics. NaN values arise from operations producing indeterminate forms, such as 0/0=NaN0 / 0 = \text{NaN} or 1=NaN\sqrt{-1} = \text{NaN}
Add your contribution
Related Hubs
User Avatar
No comments yet.