Hubbry Logo
Machine epsilonMachine epsilonMain
Open search
Machine epsilon
Community hub
Machine epsilon
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Machine epsilon
Machine epsilon
from Wikipedia

Machine epsilon or machine precision is an upper bound on the relative approximation error due to rounding in floating point number systems. This value characterizes computer arithmetic in the field of numerical analysis, and by extension in the subject of computational science. The quantity is also called macheps and it has the symbols Greek epsilon .

There are two prevailing definitions, denoted here as rounding machine epsilon or the formal definition and interval machine epsilon or mainstream definition.

In the mainstream definition, machine epsilon is independent of rounding method, and is defined simply as the difference between 1 and the next larger floating point number.

In the formal definition, machine epsilon is dependent on the type of rounding used and is also called unit roundoff, which has the symbol bold Roman u.

The two terms can generally be considered to differ by simply a factor of two, with the formal definition yielding an epsilon half the size of the mainstream definition, as summarized in the tables in the next section.

Values for standard hardware arithmetics

[edit]

The following table lists machine epsilon values for standard floating-point formats.

IEEE 754 - 2008 Common name C++ data type Base Precision Rounding machine epsilon[a] Interval machine epsilon[b]
binary16 half precision N/A 2 11 (one bit is implicit) 2−11 ≈ 4.88e-04 2−10 ≈ 9.77e-04
binary32 single precision float 2 24 (one bit is implicit) 2−24 ≈ 5.96e-08 2−23 ≈ 1.19e-07
binary64 double precision double 2 53 (one bit is implicit) 2−53 ≈ 1.11e-16 2−52 ≈ 2.22e-16
extended precision, long double _float80[1] 2 64 2−64 ≈ 5.42e-20 2−63 ≈ 1.08e-19
binary128 quad(ruple) precision _float128[1] 2 113 (one bit is implicit) 2−113 ≈ 9.63e-35 2−112 ≈ 1.93e-34
decimal32 single precision decimal _Decimal32[2] 10 7 5 × 10−7 10−6
decimal64 double precision decimal _Decimal64[2] 10 16 5 × 10−16 10−15
decimal128 quad(ruple) precision decimal _Decimal128[2] 10 34 5 × 10−34 10−33
  1. ^ According to formal definition — used by Prof. Demmel, LAPACK and Scilab. It represents the largest relative rounding error in round-to-nearest mode. The rationale is that the rounding error is half the interval upwards to the next representable number in finite-precision. Thus, the relative rounding error for number is . In this context, the largest relative error occurs when , and is equal to , because real numbers in the lower half of the interval 1.0 ~ 1.0+ULP(1) are rounded down to 1.0, and numbers in the upper half of the interval are rounded up to 1.0+ULP(1). Here we use the definition of ULP(1) (Unit in Last Place) as the positive difference between 1.0 (which can be represented exactly in finite-precision) and the next greater number representable in finite-precision.
  2. ^ According to the mainstream definition — used by Prof. Higham; applied in language constants in Ada, C, C++, Fortran, MATLAB, Mathematica, Octave, Pascal, Python and Rust etc., and defined in textbooks like «Numerical Recipes» by Press et al. It represents the largest relative interval between two nearest numbers in finite-precision, or the largest rounding error in round-by-chop mode. The rationale is that the relative interval for number is where is the distance to upwards the next representable number in finite-precision. In this context, the largest relative interval occurs when , and is the interval between 1.0 (which can be represented exactly in finite-precision) and the next larger representable floating-point number. This interval is equal to ULP(1).

Alternative definitions for epsilon

[edit]

The IEEE standard does not define the terms machine epsilon and unit roundoff, so differing definitions of these terms are in use, which can cause some confusion.

The two terms differ by simply a factor of two. The more-widely used term (referred to as the mainstream definition in this article), is used in most modern programming languages and is simply defined as machine epsilon is the difference between 1 and the next larger floating point number. The formal definition can generally be considered to yield an epsilon half the size of the mainstream definition, although its definition does vary depending on the form of rounding used.

The two terms are described at length in the next two subsections.

Formal definition (Rounding machine epsilon)

[edit]

The formal definition for machine epsilon is the one used by Prof. James Demmel in lecture scripts,[3] the LAPACK linear algebra package,[4] numerics research papers[5] and some scientific computing software.[6] Most numerical analysts use the words machine epsilon and unit roundoff interchangeably with this meaning, which is explored in depth throughout this subsection.

Rounding is a procedure for choosing the representation of a real number in a floating point number system. For a number system and a rounding procedure, machine epsilon is the maximum relative error of the chosen rounding procedure.

Some background is needed to determine a value from this definition. A floating point number system is characterized by a radix which is also called the base, , and by the precision , i.e. the number of radix digits of the significand (including any leading implicit bit). All the numbers with the same exponent, , have the spacing, . The spacing changes at the numbers that are perfect powers of ; the spacing on the side of larger magnitude is times larger than the spacing on the side of smaller magnitude.

Since machine epsilon is a bound for relative error, it suffices to consider numbers with exponent . It also suffices to consider positive numbers. For the usual round-to-nearest kind of rounding, the absolute rounding error is at most half the spacing, or . This value is the biggest possible numerator for the relative error. The denominator in the relative error is the number being rounded, which should be as small as possible to make the relative error large. The worst relative error therefore happens when rounding is applied to numbers of the form where is between and . All these numbers round to with relative error . The maximum occurs when is at the upper end of its range. The in the denominator is negligible compared to the numerator, so it is left off for expediency, and just is taken as machine epsilon. As has been shown here, the relative error is worst for numbers that round to , so machine epsilon also is called unit roundoff meaning roughly "the maximum error that can occur when rounding to the unit value".

Thus, the maximum spacing between a normalised floating point number, , and an adjacent normalised number is .[7]

Arithmetic model

[edit]

Numerical analysis uses machine epsilon to study the effects of rounding error. The actual errors of machine arithmetic are far too complicated to be studied directly, so instead, the following simple model is used. The IEEE arithmetic standard says all floating-point operations are done as if it were possible to perform the infinite-precision operation, and then, the result is rounded to a floating-point number. Suppose (1) , are floating-point numbers, (2) is an arithmetic operation on floating-point numbers such as addition or multiplication, and (3) is the infinite precision operation. According to the standard, the computer calculates:

By the meaning of machine epsilon, the relative error of the rounding is at most machine epsilon in magnitude, so:

where in absolute magnitude is at most or u. The books by Demmel and Higham in the references can be consulted to see how this model is used to analyze the errors of, say, Gaussian elimination.

Mainstream definition (Interval machine epsilon)

[edit]

This alternative definition is significantly more widespread: machine epsilon is the difference between 1 and the next larger floating point number. This definition is used in language constants in Ada, C, C++, Fortran, MATLAB, Mathematica, Octave, Pascal, Python and Rust etc., and defined in textbooks like «Numerical Recipes» by Press et al.

By this definition, ε equals the value of the unit in the last place relative to 1, i.e. (where b is the base of the floating point system and p is the precision) and the unit roundoff is u = ε / 2, assuming round-to-nearest mode, and u = ε, assuming round-by-chop.

The prevalence of this definition is rooted in its use in the ISO C Standard for constants relating to floating-point types[8][9] and corresponding constants in other programming languages.[10][11][12] It is also widely used in scientific computing software[13][14][15] and in the numerics and computing literature.[16][17][18][19]

How to determine machine epsilon

[edit]

Where standard libraries do not provide precomputed values (as <float.h> does with FLT_EPSILON, DBL_EPSILON and LDBL_EPSILON for C and <limits> does with std::numeric_limits<T>::epsilon() in C++), the best way to determine machine epsilon is to refer to the table, above, and use the appropriate power formula. Computing machine epsilon is often given as a textbook exercise. The following examples compute interval machine epsilon in the sense of the spacing of the floating point numbers at 1 rather than in the sense of the unit roundoff.

Note that results depend on the particular floating-point format used, such as float, double, long double, or similar as supported by the programming language, the compiler, and the runtime library for the actual platform.

Some formats supported by the processor might not be supported by the chosen compiler and operating system. Other formats might be emulated by the runtime library, including arbitrary-precision arithmetic available in some languages and libraries.

In a strict sense the term machine epsilon means the accuracy directly supported by the processor (or coprocessor), not some accuracy supported by a specific compiler for a specific operating system, unless it's known to use the best format.

IEEE 754 floating-point formats have the property that, when reinterpreted as a two's complement integer of the same width, they monotonically increase over positive values and monotonically decrease over negative values (see the binary representation of 32 bit floats). They also have the property that , and (where is the aforementioned integer reinterpretation of ). In languages that allow type punning and always use IEEE 754–1985, we can exploit this to compute a machine epsilon in constant time. For example, in C:

typedef union {
  long long i64;
  double d64;
} dbl_64;

double machine_eps (double value)
{
    dbl_64 s;
    s.d64 = value;
    s.i64++;
    return s.d64 - value;
}

This will give a result of the same sign as value. If a positive result is always desired, the return statement of machine_eps can be replaced with:

    return (s.i64 < 0 ? value - s.d64 : s.d64 - value);

Example in Python:

def machineEpsilon(func=float):
    machine_epsilon = func(1)
    while func(1) + machine_epsilon != func(1):
        machine_epsilon_last = machine_epsilon
        machine_epsilon = func(machine_epsilon) / func(2)
    return machine_epsilon_last

64-bit doubles give 2.220446e-16, which is 2−52 as expected.

Approximation

[edit]

The following simple algorithm can be used to approximate[clarification needed] the machine epsilon, to within a factor of two of its true value, using a linear search.

epsilon = 1.0;

while (1.0 + 0.5 * epsilon) ≠ 1.0:
    epsilon = 0.5 * epsilon

The machine epsilon, can also simply be calculated as two to the negative power of the number of bits used for the mantissa.

Relationship to absolute relative error

[edit]

If is the machine representation of a number then the absolute relative error in the representation is [20]

Proof

[edit]

The following proof is limited to positive numbers and machine representations using round-by-chop.

If is a positive number we want to represent, it will be between a machine number below and a machine number above .

If , where is the number of bits used for the magnitude of the significand, then:

Since the representation of will be either or ,

Although this proof is limited to positive numbers and round-by-chop, the same method can be used to prove the inequality in relation to negative numbers and round-to-nearest machine representations.

See also

[edit]

Notes and references

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Machine epsilon, denoted ε_mach or simply ε, is the smallest positive floating-point number such that 1 + ε is distinguishable from 1 in the of a given computing system, serving as a measure of the relative precision inherent in floating-point representations. It quantifies the maximum relative error introduced when representing real numbers as floating-point approximations, typically arising from the finite number of bits allocated to the in formats like IEEE 754. This value is fundamental in , as it bounds the error in arithmetic operations and informs the design of stable algorithms. In the IEEE 754 standard, which defines the most widely used binary floating-point formats, machine epsilon for single precision (32 bits, 24-bit ) is 2^{-23} ≈ 1.19 × 10^{-7}, while for double precision (64 bits, 53-bit ) it is 2^{-52} ≈ 2.22 × 10^{-16}. These values correspond to the distance from 1 to the next representable floating-point number greater than 1, also known as the unit in the last place (ulp) at 1. The unit roundoff u, often u = ε/2, represents the maximum relative error for rounding to nearest, and machine epsilon is twice this quantity in symmetric rounding modes. Machine epsilon can be computed programmatically by iteratively halving a value starting from 1 until 1 + δ equals 1, highlighting its dependence on the system's arithmetic implementation. The concept is crucial for assessing the accuracy and stability of numerical computations, particularly in error propagation analysis, where floating-point operations introduce perturbations bounded by multiples of ε. For instance, the computed result fl(x ⊕ y) of satisfies |fl(x ⊕ y) - (x + y)| ≤ u |x + y|, ensuring predictable error growth in algorithms. While machine epsilon provides an upper bound on representation errors near 1, the relative precision varies across the due to normalization, being ulp(x)/|x| for numbers away from powers of the . Its role extends to validating software implementations of floating-point standards and guiding choices in tolerance settings for iterative solvers and in scientific .

Floating-Point Arithmetic Basics

Binary Floating-Point Representation

Binary floating-point numbers represent real numbers in computer systems using a fixed number of bits divided into three components: a , an exponent field, and a (also known as the mantissa). The , which is a single bit, indicates whether the number is positive (0) or negative (1). The exponent field consists of several bits that encode the scale of the number as a power of 2, while the field stores the of the number in . In normalized binary floating-point representation, as defined by the standard, the value xx of a finite non-zero number is given by the formula x=(1)s×m×2ebias,x = (-1)^s \times m \times 2^{e - \text{bias}}, where ss is the (0 or 1), mm is the interpreted as a value between 1 (inclusive) and 2 (exclusive) with an implied leading 1 followed by the stored fraction bits, ee is the stored biased exponent, and bias is a constant specific to the format (e.g., 127 for single precision). This normalization ensures that the always starts with a leading 1 in binary, maximizing precision by avoiding leading zeros. The unit in the last place (ulp) refers to the spacing or gap between two consecutive representable floating-point numbers in the same binade (range between powers of 2), equivalent to the value of the least significant bit in the for a given exponent. It provides a measure of the precision limits inherent in the representation, as the ulp varies with the magnitude of the number: larger exponents result in larger ulps, reflecting coarser spacing for bigger values. For example, the number 1.0 in single-precision binary floating-point (32 bits total) is represented with 0, biased exponent 127 (actual exponent 0), and fraction bits all 0 (implied leading 1, so m=1.0m = 1.0). In binary, this is 0 01111111 00000000000000000000000, or in 0x3F800000. The ulp at 1.0 is 2231.192×1072^{-23} \approx 1.192 \times 10^{-7}, the distance to the next representable number 1 + ulp.

Rounding in Arithmetic Operations

In floating-point arithmetic, basic operations such as addition, subtraction, multiplication, and division are performed by first computing an exact intermediate result, which is then rounded to the nearest representable value in the floating-point format to fit within the available precision. This rounding step is necessary because most real numbers and operation outcomes cannot be exactly represented due to the finite number of bits allocated for the significand. The IEEE 754 standard mandates correct rounding for these operations, ensuring the computed result is as close as possible to the mathematically exact value under the selected mode. The standard specifies five rounding modes to control how inexact results are approximated, with round-to-nearest (ties to even) as the default for most implementations. In round-to-nearest, the exact result is rounded to the closest representable floating-point number; if it lies exactly midway between two candidates, the tie is broken by rounding to the one with an even least significant bit in the . Alternative directed modes include round toward zero (, which discards excess magnitude), round toward positive infinity ( up for positive excesses), and round toward negative infinity ( down for negative excesses); a fifth mode, round to nearest with ties away from zero, was added in the 2019 revision for specific use cases like statistical computations. These modes allow programmers to influence behavior, such as in or when preserving sign in financial calculations. Rounding introduces an error when the exact result falls between two consecutive representable numbers, known as the unit in the last place (ulp), which is the spacing between those numbers at the given magnitude. In round-to-nearest mode, this rounding error is bounded by 0.5 ulp of the result, meaning the computed value deviates from the exact value by at most half the local spacing between representable points. The ulp itself varies with the exponent, being larger for bigger numbers since the significand's fixed scales the precision dynamically across magnitudes. A clear illustration of rounding error occurs in , where a small value added to a larger one may be effectively ignored if it falls below the resolution threshold. For instance, adding a tiny positive number ε to 1.0 yields exactly 1.0 + ε, but if ε is smaller than 0.5 ulp at 1.0, the sum rounds back to 1.0 in round-to-nearest mode, losing the small contribution entirely. This effect arises because the binary floating-point representation around 1.0 has a fixed ulp determined by the significand's precision, as covered in the structure of binary floating-point numbers. Such losses highlight the limitations of finite precision and can propagate in chained operations, emphasizing the need for awareness in numerical algorithms. The unit roundoff quantifies the maximum relative rounding error possible in these operations, providing a bound on how much the computed result can deviate from the exact one relative to the result's scale. In the context of round-to-nearest, this relative error is at most 0.5 ulp normalized by the result's magnitude, serving as a fundamental measure of the arithmetic's faithfulness to real-number mathematics.

Core Definitions

Rounding Machine Epsilon

The rounding machine epsilon, also known as the unit roundoff and denoted uu, represents the maximum relative error introduced when representing a in a floating-point system or performing arithmetic operations that round to the nearest representable value. In binary floating-point arithmetic, it is formally defined as u=2pu = 2^{-p}, where pp is the precision, corresponding to the number of bits in the (including the implicit leading bit). This is half the machine epsilon ϵ=21p\epsilon = 2^{1-p}. This definition arises within the standard model of floating-point arithmetic, which assumes that every operation rounds the exact result to the nearest representable number, with ties broken in a specified manner (such as round-to-nearest-even). Under this model, the computed result fl(xy)\mathrm{fl}(x \oplus y) of an operation \oplus satisfies fl(xy)=(xy)(1+δ)\mathrm{fl}(x \oplus y) = (x \oplus y)(1 + \delta), where δu|\delta| \leq u, bounding the relative error. The value of uu can be derived directly from the unit in the last place (ulp) at 1, which is ϵ=21p\epsilon = 2^{1-p}; thus, u=ϵ/2u = \epsilon / 2. In binary floating-point representation, the representable numbers near 1 have a spacing of 21p2^{1-p}, so the maximum rounding error is half this spacing relative to the number. The concept of rounding machine epsilon (unit roundoff) originated in early numerical analysis literature for bounding errors in computational processes, notably in Wilkinson's seminal work on rounding error analysis. Note that terminology varies; while some sources use "rounding machine epsilon" for uu, the more common term is "unit roundoff," with "machine epsilon" referring to the full spacing ϵ\epsilon.

Interval Machine Epsilon

The interval machine , denoted as ϵmach\epsilon_{\text{mach}}, is defined as the smallest positive floating-point number ϵ>0\epsilon > 0 such that the floating-point representation fl(1+ϵ)>1\mathrm{fl}(1 + \epsilon) > 1, where fl\mathrm{fl} denotes the floating-point mapping function. This value captures the smallest distinguishable addition to 1 in the floating-point system, independent of the specific rounding mode, and is also known as the standard machine . In binary floating-point formats with a pp-bit mantissa (including the implicit leading 1), the interval machine equals the unit in the last place (ulp) at 1.0, expressed as ϵ=21p\epsilon = 2^{1-p}. This spacing arises because representable numbers near 1 are separated by increments of 2(p1)2^{-(p-1)}, reflecting the granularity of the mantissa in the normalized binary representation. It differs from the rounding machine epsilon (unit roundoff uu), which bounds the maximum relative error in rounding operations and equals half the interval machine epsilon in round-to-nearest mode (u=ϵ/2u = \epsilon / 2). The interval variant emphasizes the representable gap for detection purposes, while the rounding variant focuses on error propagation in computations. Note that "machine epsilon" typically refers to this interval value in most literature and programming languages (e.g., DBL_EPSILON in C). One practical method to approximate the interval machine empirically involves iteratively halving an value until the to 1 becomes indistinguishable:

eps = 1.0 while (1.0 + eps != 1.0): eps = eps / 2.0 eps = eps * 2.0 // Adjust to the last distinguishable value

eps = 1.0 while (1.0 + eps != 1.0): eps = eps / 2.0 eps = eps * 2.0 // Adjust to the last distinguishable value

This loop exploits the floating-point to identify the threshold where 1+ϵ=11 + \epsilon = 1 in representation.

Computation Techniques

Theoretical Derivation

In binary floating-point arithmetic, a is represented as x=±(1.f)×2ex = \pm (1.f) \times 2^e, where ff is the consisting of p1p-1 bits, ee is the exponent, and the leading bit is implicitly 1, making the have pp bits in total. For numbers in the interval [1,2)[1, 2), the exponent e=0e = 0, and the significand ranges from 1.00021.000\ldots_2 (which is 1) to 1.1112=22(p1)1.111\ldots_2 = 2 - 2^{-(p-1)}. The representable values are thus 1+k×2(p1)1 + k \times 2^{-(p-1)} for integer k=0,1,,2p11k = 0, 1, \dots, 2^{p-1} - 1. The spacing between consecutive representable numbers in this binade, known as the unit in the last place (ulp), is therefore 2(p1)2^{-(p-1)}. The interval machine epsilon, denoted ϵ\epsilon, is defined as this ulp value at 1, representing the smallest positive floating-point number such that 1+ϵ>11 + \epsilon > 1 in the arithmetic. Thus, ϵ=2(p1)\epsilon = 2^{-(p-1)}. This derivation assumes an IEEE 754-like model with binary base (β=2\beta = 2), normalized representations (no denormals, which affect subnormal numbers far from 1), and no overflow or underflow near 1.0. The formula holds specifically for binary floating-point but generalizes to ϵ=β1p\epsilon = \beta^{1-p} for base-β\beta systems with pp-digit significands, such as decimal formats where the spacing differs accordingly.

Practical Approximation

A practical method for approximating machine epsilon in software involves an iterative algorithm that empirically determines the smallest positive floating-point number ϵ\epsilon such that 1+ϵ>11 + \epsilon > 1 in the given arithmetic environment. This approach begins with an initial value of ϵ=1.0\epsilon = 1.0 and repeatedly halves it until the condition 1+ϵ=11 + \epsilon = 1 holds true in floating-point evaluation, at which point ϵ\epsilon is doubled to recover the boundary value. The algorithm can be expressed in pseudocode as follows:

eps = 1.0 while (1.0 + eps > 1.0): eps = eps / 2.0 eps = eps * 2.0

eps = 1.0 while (1.0 + eps > 1.0): eps = eps / 2.0 eps = eps * 2.0

This yields the machine epsilon for the prevailing floating-point format. The method avoids underflow by starting from a value well above the smallest representable number and converging toward machine epsilon through division, ensuring the loop terminates long before reaching denormalized or zero values. It is portable and effective across common programming languages, including C, Python, and MATLAB, where it can be implemented directly without reliance on predefined constants. However, results may vary slightly due to optimizations that rearrange expressions or apply algebraic simplifications, potentially treating 1+ϵ>11 + \epsilon > 1 as ϵ>0\epsilon > 0 and yielding an incorrect (too small) value. Additionally, on architectures with registers, such as x86, intermediate computations might use higher precision, leading to a larger apparent unless strict floating-point modes or volatile qualifiers are employed to enforce the target precision.

Standard Values and Formats

IEEE 754 Single and Double Precision

The standard for binary floating-point arithmetic, initially published in 1985 and subsequently revised in 2008 and 2019, specifies formats that dominate modern computing hardware, including x86 and processor architectures. These formats ensure consistent representation and operations across systems, with single and double precision being the most prevalent for general-purpose computations. In single precision (binary32), the 32-bit format allocates 1 bit for the sign, 8 bits for the biased exponent, and 23 bits for the , yielding an effective precision of 24 bits due to the implicit leading 1 in normalized numbers. The machine , defined as the smallest positive value ϵ\epsilon such that 1+ϵ>11 + \epsilon > 1 in floating-point representation, is ϵ=2231.192×107\epsilon = 2^{-23} \approx 1.192 \times 10^{-7}. This value represents the unit in the last place (ulp) at 1.0, bounding the spacing between representable numbers in the vicinity of 1. Double precision (binary64) employs 64 bits: 1 , 11 exponent bits, and 52 bits, for an effective precision of 53 bits including the implicit 1. Here, the machine is ϵ=2522.220×1016\epsilon = 2^{-52} \approx 2.220 \times 10^{-16}, again corresponding to the ulp at 1.0. The unit roundoff uu, which quantifies the maximum relative error in round-to-nearest mode, is half the machine epsilon, or u=ϵ/2u = \epsilon / 2. The following table summarizes these key parameters for comparison:
FormatTotal BitsPrecision ppMachine Epsilon ϵ\epsilonUnit Roundoff uuULP at 1.0
Single (binary32)32242231.192×1072^{-23} \approx 1.192 \times 10^{-7}2245.961×1082^{-24} \approx 5.961 \times 10^{-8}2232^{-23}
Double (binary64)64532522.220×10162^{-52} \approx 2.220 \times 10^{-16}2531.110×10162^{-53} \approx 1.110 \times 10^{-16}2522^{-52}
These values are derived directly from the significand structure in the standard and are fundamental to assessing errors in numerical algorithms.

Other Common Formats

Half-precision floating-point, also known as binary16 in the -2008 standard, utilizes 11 bits for the (including an implicit leading 1), resulting in a machine epsilon of 2109.765×1042^{-10} \approx 9.765 \times 10^{-4}. This format is commonly employed in graphics processing and applications where reduced storage is beneficial, though it offers lower precision compared to single or double formats. Quadruple-precision floating-point, or binary128, extends the to 113 bits under IEEE 754-2008, yielding a machine epsilon of 21121.925×10342^{-112} \approx 1.925 \times 10^{-34}. It provides exceptionally high accuracy for scientific simulations requiring minimal errors, such as in . Decimal floating-point formats, introduced in IEEE 754-2008, operate in base-10 rather than base-2, altering the nature of spacing and . For decimal32, which supports 7 digits of precision, the machine epsilon is approximately 10610^{-6}, representing the unit in the last place around 1.0. This base-10 structure avoids some binary-to-decimal conversion issues in financial computations, though it generally results in larger epsilons than equivalent binary formats due to the difference. Historical hexadecimal floating-point formats, used in System/360 and later mainframes, employ base-16 with a of 6 hexadecimal digits (24 bits) for single precision. The machine epsilon is 2(p1)2^{-(p-1)} where p reflects the effective bit precision, but the hexadecimal base introduces uneven spacing, with the ulp at 1.0 being 165=2209.537×10716^{-5} = 2^{-20} \approx 9.537 \times 10^{-7}. This format's variable precision across binades—arising from normalization to non-zero leading hexadecimal digits—complicates error analysis compared to binary standards, though it was optimized for the era's hardware efficiency.

Error Analysis Connections

The relative rounding error in floating-point arithmetic quantifies the deviation between the exact result of an operation and its computed approximation, providing a measure of precision loss due to finite representation. For a generic arithmetic operation \oplus, the computed result satisfies fl(xy)(xy)/xyϵ/2\left| \mathrm{fl}(x \oplus y) - (x \oplus y) \right| / \left| x \oplus y \right| \leq \epsilon / 2, where fl()\mathrm{fl}(\cdot) denotes the floating-point evaluation and ϵ\epsilon is the machine epsilon. This bound arises from the standard's requirement for correctly rounded operations, ensuring the result is as if the exact value were computed and then rounded to the nearest representable number. Machine epsilon thus serves as an upper limit on the worst-case relative accuracy of basic arithmetic operations, with ϵ/2\epsilon / 2 acting as the unit roundoff that scales the potential . In practice, this connection implies that each operation introduces a small, controlled perturbation proportional to the input magnitudes, allowing analysts to predict and bound propagation in more complex computations. A illustrative example occurs in the of nn positive numbers, where the naive recursive accumulates that grow proportionally to nϵn \epsilon, potentially magnifying inaccuracies for large nn. This highlights how machine epsilon informs growth in sequential operations, as each addition contributes an bounded by ϵ/2\epsilon / 2 times the current partial sum. Understanding this link guides the selection of numerical algorithms, favoring those—like compensated summation or pairwise summation—that mitigate error accumulation and keep total relative error close to ϵ\epsilon rather than allowing it to scale with problem size.

Mathematical Proof

To rigorously establish the connection between machine epsilon and the maximum relative rounding error in floating-point arithmetic, consider the standard model defined by the IEEE 754 standard, which uses binary representation with base β=2\beta = 2, a mantissa of pp bits (including the implicit leading 1 for normalized numbers), and rounding to nearest (ties to even). Theorem. For any xx in the normalized range of the floating-point (i.e., xminxxmaxx_{\min} \leq x \leq x_{\max}, excluding subnormals for simplicity), the relative incurred by xx to the nearest floating-point number fl(x)\mathrm{fl}(x) satisfies fl(x)xxu=2p,\frac{|\mathrm{fl}(x) - x|}{|x|} \leq u = 2^{-p}, where uu is the unit roundoff. The machine ϵ\epsilon is defined as ϵ=2u=21p\epsilon = 2u = 2^{1-p}, which thus bounds the maximum relative by ϵ/2\epsilon/2. Proof. Assume rounding to nearest, so the absolute error satisfies δ=fl(x)x12ulp(x)|\delta| = |\mathrm{fl}(x) - x| \leq \frac{1}{2} \cdot \mathrm{ulp}(x), where ulp(x)\mathrm{ulp}(x) is the unit in the last place of xx, representing the spacing between consecutive floating-point numbers at the magnitude of xx. For a normalized xx with exponent ee (such that 2ex<2e+12^e \leq |x| < 2^{e+1}), ulp(x)=2e(p1)\mathrm{ulp}(x) = 2^{e - (p-1)}. Thus, δ2ep|\delta| \leq 2^{e - p}. The relative error is then δx2ep2e=2p=u\frac{|\delta|}{|x|} \leq \frac{2^{e - p}}{2^e} = 2^{-p} = u, since x2e|x| \geq 2^e. This bound holds uniformly because the relative spacing ulp(x)/x\mathrm{ulp}(x)/|x| is maximized when x|x| is just above a power of two (i.e., at the lower end of a binade), where it equals 21p2^{1-p}, but the rounding error is at most half that spacing, yielding the factor of 1/21/2. To derive ϵ\epsilon, note that near 1 (where e=0e = 0), ulp(1)=2(p1)=21p\mathrm{ulp}(1) = 2^{-(p-1)} = 2^{1-p}, so the smallest ϵ>0\epsilon > 0 such that fl(1+ϵ)>1\mathrm{fl}(1 + \epsilon) > 1 is this ulp value, confirming ϵ=21p=2u\epsilon = 2^{1-p} = 2u. The maximum relative error occurs at points midway between representable numbers, achieving equality with uu in the limit. This completes the proof, as the bound is tight and independent of the specific xx magnitude within the normalized range.

References

Add your contribution
Related Hubs
User Avatar
No comments yet.