Hubbry Logo
Round-off errorRound-off errorMain
Open search
Round-off error
Community hub
Round-off error
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Round-off error
Round-off error
from Wikipedia

In computing, a roundoff error,[1] also called rounding error,[2] is the difference between the result produced by a given algorithm using exact arithmetic and the result produced by the same algorithm using finite-precision, rounded arithmetic.[3] Rounding errors are due to inexactness in the representation of real numbers and the arithmetic operations done with them. This is a form of quantization error.[4] When using approximation equations or algorithms, especially when using finitely many digits to represent real numbers (which in theory have infinitely many digits), one of the goals of numerical analysis is to estimate computation errors.[5] Computation errors, also called numerical errors, include both truncation errors and roundoff errors.

When a sequence of calculations with an input involving any roundoff error are made, errors may accumulate, sometimes dominating the calculation. In ill-conditioned problems, significant error may accumulate.[6]

In short, there are two major facets of roundoff errors involved in numerical calculations:[7]

  1. The ability of computers to represent both magnitude and precision of numbers is inherently limited.
  2. Certain numerical manipulations are highly sensitive to roundoff errors. This can result from both mathematical considerations as well as from the way in which computers perform arithmetic operations.

Representation error

[edit]

The error introduced by attempting to represent a number using a finite string of digits is a form of roundoff error called representation error.[8] Here are some examples of representation error in decimal representations:

Notation Representation Approximation Error
17 0.142 857 0.142 857 0.000 000 142 857
ln 2 0.693 147 180 559 945 309 41... 0.693 147 0.000 000 180 559 945 309 41...
log10 2 0.301 029 995 663 981 195 21... 0.3010 0.000 029 995 663 981 195 21...
32 1.259 921 049 894 873 164 76... 1.25992 0.000 001 049 894 873 164 76...
2 1.414 213 562 373 095 048 80... 1.41421 0.000 003 562 373 095 048 80...
e 2.718 281 828 459 045 235 36... 2.718 281 828 459 045 0.000 000 000 000 000 235 36...
π 3.141 592 653 589 793 238 46... 3.141 592 653 589 793 0.000 000 000 000 000 238 46...

Increasing the number of digits allowed in a representation reduces the magnitude of possible roundoff errors, but any representation limited to finitely many digits will still cause some degree of roundoff error for uncountably many real numbers. Additional digits used for intermediary steps of a calculation are known as guard digits.[9]

Rounding multiple times can cause error to accumulate.[10] For example, if 9.945309 is rounded to two decimal places (9.95), then rounded again to one decimal place (10.0), the total error is 0.054691. Rounding 9.945309 to one decimal place (9.9) in a single step introduces less error (0.045309). This can occur, for example, when software performs arithmetic in x86 80-bit floating-point and then rounds the result to IEEE 754 binary64 floating-point.

Floating-point number system

[edit]

Compared with the fixed-point number system, the floating-point number system is more efficient in representing real numbers so it is widely used in modern computers. While the real numbers are infinite and continuous, a floating-point number system is finite and discrete. Thus, representation error, which leads to roundoff error, occurs under the floating-point number system.

Notation of floating-point number system

[edit]

A floating-point number system is characterized by integers:

  • : base or radix
  • : precision
  • : exponent range, where is the lower bound and is the upper bound

Any has the following form: where is an integer such that for , and is an integer such that .

Normalized floating-number system

[edit]
  • A floating-point number system is normalized if the leading digit is always nonzero unless the number is zero.[3] Since the significand is , the significand of a nonzero number in a normalized system satisfies . Thus, the normalized form of a nonzero IEEE floating-point number is where . In binary, the leading digit is always so it is not written out and is called the implicit bit. This gives an extra bit of precision so that the roundoff error caused by representation error is reduced.
  • Since floating-point number system is finite and discrete, it cannot represent all real numbers which means infinite real numbers can only be approximated by some finite numbers through rounding rules. The floating-point approximation of a given real number by can be denoted.
    • The total number of normalized floating-point numbers is where
      • counts choice of sign, being positive or negative
      • counts choice of the leading digit
      • counts remaining significand digits
      • counts choice of exponents
      • counts the case when the number is .

IEEE standard

[edit]

In the IEEE standard the base is binary, i.e. , and normalization is used. The IEEE standard stores the sign, exponent, and significand in separate fields of a floating point word, each of which has a fixed width (number of bits). The two most commonly used levels of precision for floating-point numbers are single precision and double precision.

Precision Sign (bits) Exponent (bits) Trailing Significand field (bits)
Single 1 8 23
Double 1 11 52

Machine epsilon

[edit]

Machine epsilon can be used to measure the level of roundoff error in the floating-point number system. Here are two different definitions.[3]

  • The machine epsilon, denoted , is the maximum possible absolute relative error in representing a nonzero real number in a floating-point number system.
  • The machine epsilon, denoted , is the smallest number such that . Thus, whenever .

Roundoff error under different rounding rules

[edit]

There are two common rounding rules, round-by-chop and round-to-nearest. The IEEE standard uses round-to-nearest.

  • Round-by-chop: The base- expansion of is truncated after the -th digit.
    • This rounding rule is biased because it always moves the result toward zero.
  • Round-to-nearest: is set to the nearest floating-point number to . When there is a tie, the floating-point number whose last stored digit is even (also, the last digit, in binary form, is equal to 0) is used.
    • For IEEE standard where the base is , this means when there is a tie it is rounded so that the last digit is equal to .
    • This rounding rule is more accurate but more computationally expensive.
    • Rounding so that the last stored digit is even when there is a tie ensures that it is not rounded up or down systematically. This is to try to avoid the possibility of an unwanted slow drift in long calculations due simply to a biased rounding.
  • The following example illustrates the level of roundoff error under the two rounding rules.[3] The rounding rule, round-to-nearest, leads to less roundoff error in general.
x Round-by-chop Roundoff Error Round-to-nearest Roundoff Error
1.649 1.6 0.049 1.6 0.049
1.650 1.6 0.050 1.6 0.050
1.651 1.6 0.051 1.7 −0.049
1.699 1.6 0.099 1.7 −0.001
1.749 1.7 0.049 1.7 0.049
1.750 1.7 0.050 1.8 −0.050

Calculating roundoff error in IEEE standard

[edit]

Suppose the usage of round-to-nearest and IEEE double precision.

  • Example: the decimal number can be rearranged into

Since the 53rd bit to the right of the binary point is a 1 and is followed by other nonzero bits, the round-to-nearest rule requires rounding up, that is, add 1 bit to the 52nd bit. Thus, the normalized floating-point representation in IEEE standard of 9.4 is

  • Now the roundoff error can be calculated when representing with .

This representation is derived by discarding the infinite tail from the right tail and then added in the rounding step.

Then .
Thus, the roundoff error is .


Measuring roundoff error by using machine epsilon

[edit]

The machine epsilon can be used to measure the level of roundoff error when using the two rounding rules above. Below are the formulas and corresponding proof.[3] The first definition of machine epsilon is used here.

Theorem

[edit]
  1. Round-by-chop:
  2. Round-to-nearest:

Proof

[edit]

Let where , and let be the floating-point representation of . Since round-by-chop is being used, it is In order to determine the maximum of this quantity, there is a need to find the maximum of the numerator and the minimum of the denominator. Since (normalized system), the minimum value of the denominator is . The numerator is bounded above by . Thus, . Therefore, for round-by-chop. The proof for round-to-nearest is similar.

  • Note that the first definition of machine epsilon is not quite equivalent to the second definition when using the round-to-nearest rule but it is equivalent for round-by-chop.

Roundoff error caused by floating-point arithmetic

[edit]

Even if some numbers can be represented exactly by floating-point numbers and such numbers are called machine numbers, performing floating-point arithmetic may lead to roundoff error in the final result.

Addition

[edit]

Machine addition consists of lining up the decimal points of the two numbers to be added, adding them, and then storing the result again as a floating-point number. The addition itself can be done in higher precision but the result must be rounded back to the specified precision, which may lead to roundoff error.[3]

  • For example, adding to in IEEE double precision as follows,

    This is saved as since round-to-nearest is used in IEEE standard. Therefore, is equal to in IEEE double precision and the roundoff error is .

This example shows that roundoff error can be introduced when adding a large number and a small number. The shifting of the decimal points in the significands to make the exponents match causes the loss of some of the less significant digits. The loss of precision may be described as absorption.[11]

Note that the addition of two floating-point numbers can produce roundoff error when their sum is an order of magnitude greater than that of the larger of the two.

  • For example, consider a normalized floating-point number system with base and precision . Then and . Note that but . There is a roundoff error of .

This kind of error can occur alongside an absorption error in a single operation.

Multiplication

[edit]

In general, the product of two p-digit significands contains up to 2p digits, so the result might not fit in the significand.[3] Thus roundoff error will be involved in the result.

  • For example, consider a normalized floating-point number system with the base and the significand digits are at most . Then and . Note that but since there at most significand digits. The roundoff error would be .

Division

[edit]

In general, the quotient of 2p-digit significands may contain more than p-digits.Thus roundoff error will be involved in the result.

  • For example, if the normalized floating-point number system above is still being used, then but . So, the tail is cut off.

Subtraction

[edit]

Absorption also applies to subtraction.

  • For example, subtracting from in IEEE double precision as follows, This is saved as since round-to-nearest is used in IEEE standard. Therefore, is equal to in IEEE double precision and the roundoff error is .

The subtracting of two nearly equal numbers is called subtractive cancellation.[3] When the leading digits are cancelled, the result may be too small to be represented exactly and it will just be represented as .

  • For example, let and the second definition of machine epsilon is used here. What is the solution to ?
    It is known that and are nearly equal numbers, and . However, in the floating-point number system, . Although is easily big enough to be represented, both instances of have been rounded away giving .

Even with a somewhat larger , the result is still significantly unreliable in typical cases. There is not much faith in the accuracy of the value because the most uncertainty in any floating-point number is the digits on the far right.

  • For example, . The result is clearly representable, but there is not much faith in it.

This is closely related to the phenomenon of catastrophic cancellation, in which the two numbers are known to be approximations.

Accumulation of roundoff error

[edit]

Errors can be magnified or accumulated when a sequence of calculations is applied on an initial input with roundoff error due to inexact representation.

Unstable algorithms

[edit]

An algorithm or numerical process is called stable if small changes in the input only produce small changes in the output, and unstable if large changes in the output are produced.[12] For example, the computation of using the "obvious" method is unstable near due to the large error introduced in subtracting two similar quantities, whereas the equivalent expression is stable.[12]

Ill-conditioned problems

[edit]

Even if a stable algorithm is used, the solution to a problem may still be inaccurate due to the accumulation of roundoff error when the problem itself is ill-conditioned.

The condition number of a problem is the ratio of the relative change in the solution to the relative change in the input.[3] A problem is well-conditioned if small relative changes in input result in small relative changes in the solution. Otherwise, the problem is ill-conditioned.[3] In other words, a problem is ill-conditioned if its conditions number is "much larger" than 1.

The condition number is introduced as a measure of the roundoff errors that can result when solving ill-conditioned problems.[7]

See also

[edit]

References

[edit]

Further reading

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Round-off error refers to the discrepancy between the exact mathematical value of a and its in , arising from the finite precision available in computer representations of real numbers. This error is inherent to digital computation, where real numbers are encoded using a fixed number of bits, leading to inexact representations and the need for during arithmetic operations. In systems adhering to the standard, floating-point numbers typically use formats such as single precision (32 bits) or double precision (64 bits), with the latter providing about 15 decimal digits of accuracy. The primary causes of round-off error include the inability to represent most irrational numbers exactly in —such as 0.1, which has a repeating binary expansion—and the that occurs when the result of an operation exceeds the available precision. For instance, basic operations like or multiplication introduce an error bounded by half a unit in the last place (ulp), often quantified relative to the (ε), the smallest positive floating-point number such that 1 + ε > 1, approximately 2^{-52} (or 2.22 × 10^{-16}) for double precision. These errors can accumulate over multiple operations, potentially magnifying in processes like iterative algorithms or summations, where naive pairwise of n terms may incur up to nε relative error. A notable implication of round-off error is , where subtraction of two nearly equal numbers leads to significant loss of precision; for example, in solving the ax² + bx + c = 0, the term b² - 4ac can result in substantial relative error if b² is much larger than 4ac. Mitigation strategies include using higher-precision arithmetic, compensatory algorithms like Kahan summation (which bounds summation error to roughly 2ε regardless of n), and careful ordering of operations to minimize cancellation. The standard addresses consistency by mandating rounding modes (e.g., to nearest) and features like guard digits to reduce subtraction errors. Overall, understanding and managing round-off error is crucial in , scientific computing, and to ensure reliable results.

Numerical Representation Basics

Representation Error

Representation error arises in numerical computing when a real number xx cannot be exactly stored in a finite-precision representation, such as the floating-point format used in computers. This error is defined as the difference between the exact value xx and its approximated floating-point representation fl(x)\mathrm{fl}(x), where fl(x)\mathrm{fl}(x) is the closest representable number in the system's finite set of values. In finite-precision systems, only a countable subset of s can be represented exactly, leading to inherent discrepancies for most or non-dyadic rational numbers. A classic example occurs in decimal systems, where the fraction 13=0.33310\frac{1}{3} = 0.333\ldots_{10} (repeating infinitely) must be truncated or rounded to a finite number of digits, such as 0.333, resulting in a representation error. This mirrors the challenge in binary floating-point systems, where decimal fractions like 0.1 cannot be expressed exactly because 0.1 requires an infinite repeating binary expansion (approximately 0.0001100110011\ldots_2). In binary floating-point with 24-bit precision (as in single precision), 0.1 is approximated as 1.100110011001100110011012×241.10011001100110011001101_2 \times 2^{-4}, introducing a persistent small error that affects subsequent computations. The magnitude of representation is quantified using absolute and relative measures. The absolute representation is xfl(x)|x - \mathrm{fl}(x)|, which depends on the scale of xx and the system's precision. The relative representation , xfl(x)x\frac{|x - \mathrm{fl}(x)|}{|x|}, normalizes this difference and is typically bounded by half a unit in the last place (ulp) in well-designed systems, providing a scale-invariant assessment of accuracy. These are foundational to round-off issues, as even exact mathematical values become inexact upon storage.

Floating-Point Number System

Floating-point number systems in computers approximate real numbers through a structured format consisting of three primary components: a , an exponent, and a (also referred to as the mantissa). The , typically a single bit, determines the polarity of the number, with 0 indicating positive and 1 indicating negative. The exponent, represented by a fixed number of bits, encodes the scale or of the number. The captures the significant digits, providing the precision of the representation. The general form of a floating-point number fl(x)\mathrm{fl}(x) for a real number xx is expressed as fl(x)=±(1+m)×be,\mathrm{fl}(x) = \pm (1 + m) \times b^{e}, where bb is the or base of the system (commonly 2 or 10), mm is the of the with 0m<10 \leq m < 1, and ee is the integer exponent. This formulation assumes a normalized representation, where the is scaled such that its leading digit is nonzero, maximizing the use of available digits for precision. In practice, the is stored as a fixed-length sequence of digits in base bb, and the exponent adjusts the position of the point. Normalized floating-point numbers require the leading digit of the significand to be nonzero, ensuring a canonical form that avoids redundant representations and optimizes precision within the allocated bits. For instance, a number like 0.d1d2×be0.d_1 d_2 \dots \times b^e is shifted to d1.d2×bekd_1.d_2 \dots \times b^{e-k} where d10d_1 \neq 0. Denormalized forms, conversely, occur when the exponent reaches its minimum value and the leading digit is zero, allowing gradual underflow by representing subnormal numbers with reduced precision near zero. This distinction helps mitigate abrupt transitions in representable values around the smallest normalized magnitude. The representable range is bounded by the minimum and maximum exponents, emine_{\min} and emaxe_{\max}, limiting numbers to approximately [bemin,bemax][b^{e_{\min}}, b^{e_{\max}}] for normalized forms. Precision is constrained by the length of the significand, typically measured in digits or bits, which determines the smallest distinguishable relative difference between numbers of similar magnitude. Overflow arises when a value exceeds bemaxb^{e_{\max}}, often resulting in a special infinity representation, while underflow occurs for values below beminb^{e_{\min}}, potentially flushing tiny results to zero or using denormalized forms for gradual precision loss. These bounds impose inherent limitations on the system's ability to capture arbitrary real numbers exactly. Binary floating-point systems with base b=2b = 2 predominate in modern computers owing to their hardware efficiency; binary representations facilitate simple shifts for exponent adjustments and align seamlessly with binary logic gates, reducing complexity in arithmetic units compared to higher-radix systems.

Floating-Point Standards and Notation

Notation of Floating-Point Systems

Floating-point numbers are typically represented in a standardized mathematical form to approximate real numbers within a finite precision system. The general notation for a floating-point number xx is given by x=±(d0.d1d2dp1)β×βe,x = \pm (d_0 . d_1 d_2 \dots d_{p-1})_\beta \times \beta^e, where β\beta is the base (radix, often 2 or 10), pp denotes the precision (number of digits in the significand), each did_i (for i=0,1,,p1i = 0, 1, \dots, p-1) is a digit satisfying 0di<β0 \leq d_i < \beta, and ee is the exponent, an integer within a defined range. This form mirrors scientific notation but is constrained to discrete values, leading to round-off errors when exact representation is impossible. In normalized floating-point systems, the significand (or mantissa) is adjusted such that the leading digit d0d_0 is nonzero, ensuring 1βd0.d1d2dp1<β1_\beta \leq d_0 . d_1 d_2 \dots d_{p-1} < \beta. This normalization eliminates leading zeros, providing a unique representation for nonzero numbers and maximizing precision. The exponent ee ranges from emine_{\min} to emaxe_{\max}, which bound the smallest and largest representable magnitudes; for example, underflow occurs for exponents below emine_{\min}, and overflow above emaxe_{\max}. The unit in the last place, denoted ulp(x)\operatorname{ulp}(x), quantifies the spacing between consecutive representable floating-point numbers near xx. For a normalized number xx with exponent ee, ulp(x)=βep+1\operatorname{ulp}(x) = \beta^{e - p + 1}, representing the value of the least significant digit in the significand. This measure is crucial for assessing representation granularity and potential rounding discrepancies. Under rounding to nearest, the unit roundoff uu defines the maximum relative rounding error, expressed as u=12β1pu = \frac{1}{2} \beta^{1-p}. This bounds the error in approximating any real number by the nearest floating-point representation, where fl(y)yuy|\operatorname{fl}(y) - y| \leq u |y| for a real yy and its floating-point approximation fl(y)\operatorname{fl}(y). This notation underpins standards like , which specify concrete parameters for β\beta, pp, emine_{\min}, and emaxe_{\max}.

IEEE 754 Standard

The IEEE 754 standard, originally published in 1985 as IEEE Std 754-1985, defines a technical framework for binary floating-point arithmetic to promote consistent representation and computation across diverse computer systems. Its primary purpose is to ensure portability of floating-point data and reproducibility of arithmetic results, addressing inconsistencies in earlier proprietary formats that hindered software development and scientific computing. The standard was revised in 2008 (IEEE Std 754-2008) to incorporate decimal floating-point formats, refine operations, and clarify exception handling, while maintaining backward compatibility with the original binary specifications. The latest revision, IEEE Std 754-2019, further expanded support by adding new decimal formats and updating exception handling mechanisms to better accommodate modern hardware and application needs. Central to the standard are its binary interchange formats, which encode floating-point numbers using a sign bit, an exponent field, and a significand (also called the fraction or mantissa). The single-precision format (binary32) occupies 32 bits total: 1 sign bit, 8 exponent bits, and 23 fraction bits, providing an effective significand precision of 24 bits including an implicit leading 1 for normalized numbers. The double-precision format (binary64) uses 64 bits: 1 sign bit, 11 exponent bits, and 52 fraction bits, yielding 53 bits of significand precision. For higher precision, the quad-precision format (binary128) employs 128 bits: 1 sign bit, 15 exponent bits, and 112 fraction bits, resulting in 113 bits of significand precision. Exponents in these formats are biased to allow representation of both positive and negative values; for example, in double precision, the 11-bit exponent field uses a bias of 1023, where the stored exponent value ee represents the true exponent as e1023e - 1023. The standard also defines special values to handle exceptional conditions in computations. Infinities are represented by setting the exponent to all ones (e.g., 2047 for double precision) and the significand to zero, with the sign bit indicating positive or negative infinity. Not a Number (NaN) values use the same all-ones exponent but with a non-zero significand, allowing distinction between quiet NaNs (propagating without signaling) and signaling NaNs (triggering exceptions); NaNs do not have a sign. Signed zeros are supported, where +0 and -0 are distinct representations (exponent and significand all zeros, differing only in the sign bit), preserving the sign of zero results from operations like subtraction near zero. These features, introduced in the 1985 standard and refined in subsequent revisions, enable robust error detection and consistent behavior in floating-point environments.

Quantifying Round-off Error

Machine Epsilon

Machine epsilon, denoted ϵ\epsilon, is defined as the smallest positive real number such that the floating-point representation fl(1+ϵ)>1\mathrm{fl}(1 + \epsilon) > 1, meaning it is the smallest ϵ>0\epsilon > 0 distinguishable from zero when added to 1 in . This value quantifies the precision limit of the floating-point system, representing the gap between 1 and the next larger representable number. In a floating-point system with base bb and precision pp (the number of digits in the ), machine epsilon derives from the spacing of representable numbers near 1, given by ϵ=b1p\epsilon = b^{1-p}. For the binary64 double-precision format, where b=2b = 2 and p=53p = 53, this yields ϵ2.22×1016\epsilon \approx 2.22 \times 10^{-16}. The unit roundoff uu, which bounds the maximum relative error, relates to machine epsilon by u=ϵ/2u = \epsilon / 2. Machine epsilon plays a central role in error analysis by providing an upper bound on relative errors in floating-point representations and basic operations. Specifically, for any xx in the normal range of the floating-point system, the representation error satisfies fl(x)xux|\mathrm{fl}(x) - x| \leq u |x|, or equivalently fl(x)x(ϵ/2)x|\mathrm{fl}(x) - x| \leq (\epsilon / 2) |x|. This bound ensures that the relative error in representing xx is at most uu, facilitating the analysis of in more complex computations.

Round-off Error Under Rounding Rules

In floating-point arithmetic, rounding rules determine how a non-representable real number is approximated by the closest representable value, thereby introducing round-off error. The IEEE 754 standard specifies four primary rounding modes: round to nearest, where the result is the representable value closest to the exact result (with ties resolved to the even mantissa); round toward zero (truncation), which discards excess bits beyond the precision; round toward positive infinity, which rounds up for positive numbers and down for negative; and round toward negative infinity, which rounds down for positive and up for negative. Round to nearest serves as the default mode, promoting minimal error magnitude, while directed modes (toward zero or infinity) are used in applications like interval arithmetic for bounding computations. An additional common rule, round away from zero, directs rounding toward the infinity opposite the sign (up for positive, down for negative), though it is not a distinct IEEE mode but can be emulated by sign-dependent selection of directed rounding. Under these rules, round-off error is quantified by the difference between the exact value xx and its floating-point representation fl(x)\mathrm{fl}(x). For round to nearest, the absolute error is bounded by half the unit in the last place (ulp) of xx: fl(x)x12ulp(x)|\mathrm{fl}(x) - x| \leq \frac{1}{2} \mathrm{ulp}(x). This bound arises because the representable numbers are spaced by ulp(x)\mathrm{ulp}(x) in the binade containing xx, and rounding selects the nearest point, ensuring the maximum deviation is halfway between points. In directed modes like toward zero or infinity, the error can reach a full ulp, leading to larger potential discrepancies: fl(x)xulp(x)|\mathrm{fl}(x) - x| \leq \mathrm{ulp}(x). The relative round-off error provides a scale-invariant measure, especially useful for normalized floating-point systems. Machine epsilon ϵ\epsilon, defined as the smallest positive floating-point number such that 1+ϵ>11 + \epsilon > 1, captures the spacing around unity and relates to relative precision. For operations on values within the normal range (away from underflow), the relative error under rounding satisfies fl(x)x/xϵ/2|\mathrm{fl}(x) - x| / |x| \leq \epsilon / 2 in binary systems, or more generally u\leq u where u=12β1pu = \frac{1}{2} \beta^{1-p} is the unit roundoff, with base β\beta and precision pp. A key result in normalized floating-point systems with round to nearest is the theorem that the relative round-off error is bounded by the unit roundoff: fl(x)x/xu|\mathrm{fl}(x) - x| / |x| \leq u. The proof relies on the structure of the mantissa and normalization. For a normalized x=mβex = m \cdot \beta^e with 1m<β1 \leq m < \beta, the ulp is βep+1\beta^{e - p + 1}, so 12ulp(x)=12βep+1\frac{1}{2} \mathrm{ulp}(x) = \frac{1}{2} \beta^{e - p + 1}. Dividing by xβe|x| \geq \beta^e yields fl(x)xx12βep+1βe=12β1p=u\frac{|\mathrm{fl}(x) - x|}{|x|} \leq \frac{\frac{1}{2} \beta^{e - p + 1}}{\beta^e} = \frac{1}{2} \beta^{1 - p} = u, confirming that fl(x)\mathrm{fl}(x) is the closest representable number and the error does not exceed this relative threshold. This bound holds assuming no overflow or underflow, emphasizing the role of in maintaining computational stability.

Errors in Floating-Point Arithmetic

Addition and Subtraction

In floating-point addition, the operands are first aligned by shifting the of the number with the smaller exponent to match the larger one, which may introduce errors if bits are shifted beyond the precision limit. The aligned significands are then added, producing a sum that may exceed the representable precision. This sum undergoes , shifting the left or right to restore the normalized form while adjusting the exponent accordingly. Finally, the result is rounded to the nearest representable floating-point number according to the system's mode, such as round-to-nearest in IEEE 754. The round-off error introduced by this process for satisfies the model fl(a+b)=(a+b)(1+δ)fl(a + b) = (a + b)(1 + \delta), where flfl denotes the floating-point result, and the relative error satisfies δu|\delta| \leq u, with uu being the unit roundoff (half the ). This bound holds under the standard's exact rounding requirement, assuming no overflow or underflow. The same error model applies to , fl(ab)=(ab)(1+δ)fl(a - b) = (a - b)(1 + \delta) with δu|\delta| \leq u, as subtraction is implemented by negating one and performing . Subtraction introduces a particular vulnerability known as when the operands are nearly equal in magnitude (aba \approx b), causing leading significant digits to cancel and resulting in a small difference with potentially amplified relative . In such cases, the absolute round-off remains bounded by uabu |a - b|, but the relative in the result can greatly exceed uu because the true difference is much smaller than the operands, effectively magnifying any prior errors in aa or bb or those from the itself. For instance, in a floating-point system with six significant digits, subtracting 9.99995 from 10.00000 yields an exact difference of 0.00005, but if the operands are rounded approximations, the result might be 0.00000 due to cancellation of all significant digits, leading to of precision in the difference.

Multiplication and Division

In floating-point arithmetic, multiplication of two numbers aa and bb, represented as a=maβeaa = m_a \cdot \beta^{e_a} and b=mbβebb = m_b \cdot \beta^{e_b} where ma,mbm_a, m_b are significands and β\beta is the base, involves multiplying the significands to form an intermediate product mambm_a \cdot m_b (which may require up to 2p2p digits for pp-digit precision) and adding the exponents ea+ebe_a + e_b. The result is then normalized if necessary and rounded to fit the destination format. The computed result satisfies fl(a×b)=ab(1+δ)\mathrm{fl}(a \times b) = a b (1 + \delta), where δu|\delta| \leq u and u=β1p/2u = \beta^{1-p}/2 is the , ensuring a relative error bounded by half a unit in the last place (ulp). This rounding error arises because the exact product may not be representable exactly, particularly when the product exceeds pp digits before normalization. To achieve correctly rounded results as required by , implementations use extra bits beyond the : a guard bit to hold the most significant discarded bit, a round bit for the next, and a that is set if any further bits are nonzero, allowing precise decisions for modes like round-to-nearest (ties to even). These mechanisms reduce the effective rounding error to at most 0.5 ulp. For example, multiplying a large number near the overflow threshold, such as 1030810^{308} by 2 in double precision, may produce an intermediate exponent that triggers overflow, resulting in if the final rounded value exceeds the maximum representable finite number. Division follows a similar process: the significands are divided to yield ma/mbm_a / m_b, the exponents are subtracted as eaebe_a - e_b, and the quotient is normalized and rounded. The relative error is likewise bounded: fl(a/b)=(a/b)(1+δ)\mathrm{fl}(a / b) = (a / b) (1 + \delta) with δu|\delta| \leq u. However, division introduces potential underflow for small quotients, where the result's magnitude falls below the smallest normalized number, leading to a subnormal or zero after rounding; the IEEE 754 standard mandates signaling underflow in such cases and provides gradual underflow via subnormals to minimize information loss. The absolute error in the quotient is (a/b)δua/b|(a / b) \delta| \leq u \cdot |a / b|, which depends on the quotient's magnitude and remains small relative to the result even for tiny values. Guard, round, and sticky bits are again employed during significand division (often via iterative algorithms like SRT) to ensure correct rounding.

Propagation and Accumulation of Errors

Error Accumulation in Computations

In floating-point computations involving multiple operations, individual s from basic arithmetic can propagate and accumulate, leading to a total that grows with the number of steps. The forward represents the overall discrepancy between the exact mathematical result and the computed floating-point result, which includes both the initial representation s in the input data and the propagated round-off s from subsequent operations. This accumulation arises because each operation introduces a relative bounded by the unit roundoff uu, and these s can compound through by subsequent factors in the computation. Backward error analysis provides a complementary perspective by interpreting the computed result as the exact solution to a slightly perturbed problem, where the input is modified by a small backward error ϵ\epsilon such that the algorithm applied to the perturbed input yields the observed output. In this framework, the computed result y^\hat{y} satisfies y^=f(x+δx)\hat{y} = f(x + \delta x) for some small δx\delta x with δxϵx\|\delta x\| \leq \epsilon \|x\|, allowing the forward error to be bounded as approximately ϵ\epsilon times the condition number of the problem. This approach is particularly useful for assessing stability in algorithms where round-off errors mimic small input perturbations, and the total forward error then combines representation errors with the propagated effects of these perturbations. A key example of error accumulation occurs in the summation of nn floating-point numbers using recursive (naive) addition, where the error bound without significant cancellation is O(nu)O(n u) times a measure of the input magnitude, such as xi\sum |x_i|. Specifically, for recursive summation s1=x1s_1 = x_1, sk=fl(sk1+xk)s_k = fl(s_{k-1} + x_k) for k=2,,nk = 2, \dots, n, the absolute error satisfies sni=1nxiγni=1nxi,|s_n - \sum_{i=1}^n x_i| \leq \gamma_n \sum_{i=1}^n |x_i|, where γn=nu1nu\gamma_n = \frac{n u}{1 - n u} (assuming nu<1n u < 1), reflecting the linear growth in error due to the sequential addition of bounded relative errors at each step. This bound arises from the first-order model of floating-point addition, where each operation incurs a relative error at most uu, and the errors propagate multiplicatively through the partial sums. Under the random sign errors model, which assumes that the rounding errors at each step behave like independent random variables with random signs (mean zero and bounded by uu), the accumulation resembles a , leading to a probabilistic error growth of order nu\sqrt{n} u
Add your contribution
Related Hubs
User Avatar
No comments yet.