Recent from talks
Nothing was collected or created yet.
Decimal floating point
View on Wikipedia
| Floating-point formats |
|---|
| IEEE 754 |
|
| Other |
| Alternatives |
| Tapered floating point |
| Computer architecture bit widths |
|---|
| Bit |
| Application |
| Binary floating-point precision |
| Decimal floating-point precision |
Decimal floating-point (DFP) arithmetic refers to both a representation and operations on decimal floating-point numbers. Working directly with decimal (base-10) fractions can avoid the rounding errors that otherwise typically occur when converting between decimal fractions (common in human-entered data, such as measurements or financial information) and binary (base-2) fractions.
The advantage of decimal floating-point representation over decimal fixed-point and integer representation is that it supports a much wider range of values. For example, while a fixed-point representation that allocates 8 decimal digits and 2 decimal places can represent the numbers 123456.78, 8765.43, 123.00, and so on, a floating-point representation with 8 decimal digits could also represent 1.2345678, 1234567.8, 0.000012345678, 12345678000000000, and so on. This wider range can dramatically slow the accumulation of rounding errors during successive calculations; for example, the Kahan summation algorithm can be used in floating point to add many numbers with no asymptotic accumulation of rounding error.
Implementations
[edit]Early mechanical uses of decimal floating point are evident in the abacus, slide rule, the Smallwood calculator, and some other calculators that support entries in scientific notation. In the case of the mechanical calculators, the exponent is often treated as side information that is accounted for separately.
The IBM 650 computer supported an 8-digit decimal floating-point format in 1953.[1] The otherwise binary Wang VS machine supported a 64-bit decimal floating-point format in 1977.[2] The Motorola 68881 supported a format with 17 digits of mantissa and 3 of exponent in 1984, with the floating-point support library for the Motorola 68040 processor providing a compatible 96-bit decimal floating-point storage format in 1990.[2]
Some computer languages have implementations of decimal floating-point arithmetic, including PL/I, .NET,[3] emacs with calc, and Python's decimal module.[4] In 1987, the IEEE released IEEE 854, a standard for computing with decimal floating point, which lacked a specification for how floating-point data should be encoded for interchange with other systems. This was subsequently addressed in IEEE 754-2008, which standardized the encoding of decimal floating-point data, albeit with two different alternative methods.
IBM POWER6 and newer POWER processors include DFP in hardware, as does the IBM System z9[5] (and later zSeries machines). SilMinds offers SilAx, a configurable vector DFP coprocessor.[6] IEEE 754-2008 defines this in more detail. Fujitsu also has 64-bit Sparc processors with DFP in hardware.[7][2]
IEEE 754-2008 encoding
[edit]The IEEE 754-2008 standard defines 32-, 64- and 128-bit decimal floating-point representations. Like the binary floating-point formats, the number is divided into a sign, an exponent, and a significand. Unlike binary floating-point, numbers are not necessarily normalized; values with few significant digits have multiple possible representations: 1×102=0.1×103=0.01×104, etc. When the significand is zero, the exponent can be any value at all.
| decimal32 | decimal64 | decimal128 | decimal(32k) | Format |
|---|---|---|---|---|
| 1 | 1 | 1 | 1 | Sign field (bits) |
| 5 | 5 | 5 | 5 | Combination field (bits) |
| 6 | 8 | 12 | w = 2×k + 4 | Exponent continuation field (bits) |
| 20 | 50 | 110 | t = 30×k−10 | Coefficient continuation field (bits) |
| 32 | 64 | 128 | 32×k | Total size (bits) |
| 7 | 16 | 34 | p = 3×t/10+1 = 9×k−2 | Coefficient size (decimal digits) |
| 192 | 768 | 12288 | 3×2w = 48×4k | Exponent range |
| 96 | 384 | 6144 | Emax = 3×2w−1 | Largest value is 9.99...×10Emax |
| −95 | −383 | −6143 | Emin = 1−Emax | Smallest normalized value is 1.00...×10Emin |
| −101 | −398 | −6176 | Etiny = 2−p−Emax | Smallest non-zero value is 1×10Etiny |
The exponent ranges were chosen so that the range available to normalized values is approximately symmetrical. Since this cannot be done exactly with an even number of possible exponent values, the extra value was given to Emax.
Two different representations are defined:
- One with a binary integer significand field encodes the significand as a large binary integer between 0 and 10p−1. This is expected to be more convenient for software implementations using a binary ALU.
- Another with a densely packed decimal significand field encodes decimal digits more directly. This makes conversion to and from binary floating-point form faster, but requires specialized hardware to manipulate efficiently. This is expected to be more convenient for hardware implementations.
Both alternatives provide exactly the same range of representable values.
The most significant two bits of the exponent are limited to the range of 0−2, and the most significant 4 bits of the significand are limited to the range of 0−9. The 30 possible combinations are encoded in a 5-bit field, along with special forms for infinity and NaN.
If the most significant 4 bits of the significand are between 0 and 7, the encoded value begins as follows:
s 00mmm xxx Exponent begins with 00, significand with 0mmm s 01mmm xxx Exponent begins with 01, significand with 0mmm s 10mmm xxx Exponent begins with 10, significand with 0mmm
If the leading 4 bits of the significand are binary 1000 or 1001 (decimal 8 or 9), the number begins as follows:
s 1100m xxx Exponent begins with 00, significand with 100m s 1101m xxx Exponent begins with 01, significand with 100m s 1110m xxx Exponent begins with 10, significand with 100m
The leading bit (s in the above) is a sign bit, and the following bits (xxx in the above) encode the additional exponent bits and the remainder of the most significant digit, but the details vary depending on the encoding alternative used.
The final combinations are used for infinities and NaNs, and are the same for both alternative encodings:
s 11110 x ±Infinity (see Extended real number line) s 11111 0 quiet NaN (sign bit ignored) s 11111 1 signaling NaN (sign bit ignored)
In the latter cases, all other bits of the encoding are ignored. Thus, it is possible to initialize an array to NaNs by filling it with a single byte value.
Binary integer significand field
[edit]This format uses a binary significand from 0 to 10p−1. For example, the Decimal32 significand can be up to 107−1 = 9999999 = 98967F16 = 1001100010010110011111112. While the encoding can represent larger significands, they are illegal and the standard requires implementations to treat them as 0, if encountered on input.
As described above, the encoding varies depending on whether the most significant 4 bits of the significand are in the range 0 to 7 (00002 to 01112), or higher (10002 or 10012).
If the 2 bits after the sign bit are "00", "01", or "10", then the exponent field consists of the 8 bits following the sign bit (the 2 bits mentioned plus 6 bits of "exponent continuation field"), and the significand is the remaining 23 bits, with an implicit leading 0 bit, shown here in parentheses:
s 00eeeeee (0)ttt tttttttttt tttttttttt s 01eeeeee (0)ttt tttttttttt tttttttttt s 10eeeeee (0)ttt tttttttttt tttttttttt
This includes subnormal numbers where the leading significand digit is 0.
If the 2 bits after the sign bit are "11", then the 8-bit exponent field is shifted 2 bits to the right (after both the sign bit and the "11" bits thereafter), and the represented significand is in the remaining 21 bits. In this case there is an implicit (that is, not stored) leading 3-bit sequence "100" in the true significand:
s 1100eeeeee (100)t tttttttttt tttttttttt s 1101eeeeee (100)t tttttttttt tttttttttt s 1110eeeeee (100)t tttttttttt tttttttttt
The "11" 2-bit sequence after the sign bit indicates that there is an implicit "100" 3-bit prefix to the significand.
Note that the leading bits of the significand field do not encode the most significant decimal digit; they are simply part of a larger pure-binary number. For example, a significand of 8000000 is encoded as binary 011110100001001000000000, with the leading 4 bits encoding 7; the first significand which requires a 24th bit (and thus the second encoding form) is 223 = 8388608.
In the above cases, the value represented is:
- (−1)sign × 10exponent−101 × significand
Decimal64 and Decimal128 operate analogously, but with larger exponent continuation and significand fields. For Decimal128, the second encoding form is actually never used; the largest valid significand of 1034−1 = 1ED09BEAD87C0378D8E63FFFFFFFF16 can be represented in 113 bits.
Densely packed decimal significand field
[edit]In this version, the significand is stored as a series of decimal digits. The leading digit is between 0 and 9 (3 or 4 binary bits), and the rest of the significand uses the densely packed decimal (DPD) encoding.
The leading 2 bits of the exponent and the leading digit (3 or 4 bits) of the significand are combined into the five bits that follow the sign bit. This is followed by a fixed-offset exponent continuation field.
Finally, the significand continuation field made of 2, 5, or 11 10-bit declets, each encoding 3 decimal digits.[8]
If the first two bits after the sign bit are "00", "01", or "10", then those are the leading bits of the exponent, and the three bits after that are interpreted as the leading decimal digit (0 to 7):[9]
Comb. Exponent Significand s 00 TTT (00)eeeeee (0TTT)[tttttttttt][tttttttttt] s 01 TTT (01)eeeeee (0TTT)[tttttttttt][tttttttttt] s 10 TTT (10)eeeeee (0TTT)[tttttttttt][tttttttttt]
If the first two bits after the sign bit are "11", then the second two bits are the leading bits of the exponent, and the last bit is prefixed with "100" to form the leading decimal digit (8 or 9):
Comb. Exponent Significand s 1100 T (00)eeeeee (100T)[tttttttttt][tttttttttt] s 1101 T (01)eeeeee (100T)[tttttttttt][tttttttttt] s 1110 T (10)eeeeee (100T)[tttttttttt][tttttttttt]
The remaining two combinations (11110 and 11111) of the 5-bit field are used to represent ±infinity and NaNs, respectively.
Floating-point arithmetic operations
[edit]The usual rule for performing floating-point arithmetic is that the exact mathematical value is calculated,[10] and the result is then rounded to the nearest representable value in the specified precision. This is in fact the behavior mandated for IEEE-compliant computer hardware, under normal rounding behavior and in the absence of exceptional conditions.
For ease of presentation and understanding, 7-digit precision will be used in the examples. The fundamental principles are the same in any precision.
Addition
[edit]A simple method to add floating-point numbers is to first represent them with the same exponent. In the example below, the second number is shifted right by 3 digits. We proceed with the usual addition method:
The following example is decimal, which simply means the base is 10.
123456.7 = 1.234567 × 105 101.7654 = 1.017654 × 102 = 0.001017654 × 105
Hence:
123456.7 + 101.7654 = (1.234567 × 105) + (1.017654 × 102)
= (1.234567 × 105) + (0.001017654 × 105)
= 105 × (1.234567 + 0.001017654)
= 105 × 1.235584654
This is nothing other than converting to scientific notation. In detail:
e=5; s=1.234567 (123456.7) + e=2; s=1.017654 (101.7654)
e=5; s=1.234567 + e=5; s=0.001017654 (after shifting) -------------------- e=5; s=1.235584654 (true sum: 123558.4654)
This is the true result, the exact sum of the operands. It will be rounded to 7 digits and then normalized if necessary. The final result is:
e=5; s=1.235585 (final sum: 123558.5)
Note that the low 3 digits of the second operand (654) are essentially lost. This is round-off error. In extreme cases, the sum of two non-zero numbers may be equal to one of them:
e=5; s=1.234567 + e=−3; s=9.876543
e=5; s=1.234567 + e=5; s=0.00000009876543 (after shifting) ---------------------- e=5; s=1.23456709876543 (true sum) e=5; s=1.234567 (after rounding/normalization)
Another problem of loss of significance occurs when approximations to two nearly equal numbers are subtracted. In the following example e = 5; s = 1.234571 and e = 5; s = 1.234567 are approximations to the rationals 123457.1467 and 123456.659.
e=5; s=1.234571 − e=5; s=1.234567 ---------------- e=5; s=0.000004 e=−1; s=4.000000 (after rounding and normalization)
The floating-point difference is computed exactly because the numbers are close—the Sterbenz lemma guarantees this, even in case of underflow when gradual underflow is supported. Despite this, the difference of the original numbers is e = −1; s = 4.877000, which differs more than 20% from the difference e = −1; s = 4.000000 of the approximations. In extreme cases, all significant digits of precision can be lost.[11][12] This cancellation illustrates the danger in assuming that all of the digits of a computed result are meaningful. Dealing with the consequences of these errors is a topic in numerical analysis; see also Accuracy problems.
Multiplication
[edit]To multiply, the significands are multiplied, while the exponents are added, and the result is rounded and normalized.
e=3; s=4.734612 × e=5; s=5.417242 ----------------------- e=8; s=25.648538980104 (true product) e=8; s=25.64854 (after rounding) e=9; s=2.564854 (after normalization)
Division is done similarly, but that is more complicated.
There are no cancellation or absorption problems with multiplication or division, though small errors may accumulate as operations are performed repeatedly. In practice, the way these operations are carried out in digital logic can be quite complex.
See also
[edit]- Binary-coded decimal (BCD)
References
[edit]- ^ Beebe, Nelson H. F. (2017-08-22). "Chapter H. Historical floating-point architectures". The Mathematical-Function Computation Handbook - Programming Using the MathCW Portable Software Library (1 ed.). Salt Lake City, UT, USA: Springer International Publishing AG. p. 948. doi:10.1007/978-3-319-64110-2. ISBN 978-3-319-64109-6. LCCN 2017947446. S2CID 30244721.
- ^ a b c Savard, John J. G. (2018) [2007]. "The Decimal Floating-Point Standard". quadibloc. Archived from the original on 2018-07-03. Retrieved 2018-07-16.
- ^ ".NET API Documentation for System.Decimal". learn.microsoft.com. Retrieved 2024-07-07.
- ^ "Python Documentation for decimal". docs.python.org. Retrieved 2024-07-07.
- ^ "IBM z9 EC and z9 BC — Delivering greater value for everyone" (PDF). 306.ibm.com. Retrieved 2018-07-07.
- ^ "Arithmetic IPs for Financial Applications - SilMinds". Silminds.com.
- ^ "Chapter 4. Data Formats". Sparc64 X/X+ Specification. Nakahara-ku, Kawasaki, Japan. January 2015. p. 13.
{{cite book}}: CS1 maint: location missing publisher (link) - ^ Muller, Jean-Michel; Brisebarre, Nicolas; de Dinechin, Florent; Jeannerod, Claude-Pierre; Lefèvre, Vincent; Melquiond, Guillaume; Revol, Nathalie; Stehlé, Damien; Torres, Serge (2010). Handbook of Floating-Point Arithmetic (1 ed.). Birkhäuser. doi:10.1007/978-0-8176-4705-6. ISBN 978-0-8176-4704-9. LCCN 2009939668.
- ^ Decimal Encoding Specification, version 1.00, from IBM
- ^ Computer hardware doesn't necessarily compute the exact value; it simply has to produce the equivalent rounded result as though it had computed the infinitely precise result.
- ^ Goldberg, David (March 1991). "What Every Computer Scientist Should Know About Floating-Point Arithmetic" (PDF). ACM Computing Surveys. 23 (1): 5–48. doi:10.1145/103162.103163. S2CID 222008826. Retrieved 2016-01-20. ([1], [2], [3])
- ^ US patent 3037701A, Huberto M Sierra, "Floating decimal point arithmetic control means for calculator", issued 1962-06-05
Further reading
[edit]- Decimal Floating-Point: Algorism for Computers, Proceedings of the 16th IEEE Symposium on Computer Arithmetic (Cowlishaw, Mike F., 2003)
Decimal floating point
View on GrokipediaFundamentals
Definition and Purpose
Decimal floating point is a computer arithmetic format that represents real numbers using a significand encoded in base-10 digits, multiplied by a power of 10, enabling the exact storage of decimal fractions such as 0.1 without the rounding errors inherent in binary representations. This format consists of three primary components: a sign bit to indicate positive or negative values, a biased exponent to specify the power of 10, and a significand comprising a fixed number of decimal digits that form the significant portion of the number. Unlike binary floating point, which approximates many decimal values due to the limitations of base-2 encoding, decimal floating point preserves the exact decimal nature of inputs, making it suitable for applications where fidelity to base-10 data is essential.[5] The primary purpose of decimal floating point is to address precision issues in computations involving decimal-based data, particularly in financial and commercial systems where even minor rounding discrepancies can lead to significant errors, such as in currency calculations like USD amounts (e.g., ensuring 0.10 + 0.20 equals exactly 0.30).[6] It provides greater accuracy for input/output operations that are inherently decimal, reducing the need for post-processing adjustments and supporting compliant arithmetic in standards like IEEE 754-2008. This format is especially valuable in business computing, where it facilitates reliable handling of monetary values and other decimal-centric quantities without introducing unintended approximations.[7] Historically, decimal floating point emerged in the 1950s alongside early computers designed for commercial applications, with implementations in systems like the IBM 650 (introduced in 1954), which included a decimal floating-point unit to support business-oriented calculations. Although it largely faded by the mid-1960s in favor of binary formats, interest revived in the late 20th century due to demands for accuracy in financial software, leading to its formalization in the IEEE 754-2008 standard under the influence of researchers like Mike Cowlishaw.[8] This modern specification, building on earlier proposals from the 1980s, integrated decimal floating point into mainstream computing to meet the needs of high-precision decimal arithmetic in enterprise environments.[5]Advantages Over Binary Floating Point
Decimal floating point provides exact representations for common decimal fractions such as 0.1 (which is 1/10) and 0.2 (which is 1/5), whereas binary floating point approximates these as infinite recurring series in base 2, leading to representation errors. For instance, in IEEE 754 binary floating point, 0.1 is stored as approximately 0.1000000000000000055511151231257827021181583404541015625, and adding 0.1 and 0.2 yields 0.30000000000000004 instead of exactly 0.3.[9] In contrast, decimal floating point encodes these values precisely using a base-10 significand, ensuring that decimal inputs produce exact decimal outputs without such approximation artifacts.[2] This exactness is particularly advantageous in domains requiring precise decimal arithmetic, such as financial computations where even minor rounding discrepancies can accumulate and lead to significant errors. A classic example is calculating 5% sales tax on $0.70: binary floating point might yield approximately 0.73499999999999999 (rounding to $0.73), while decimal floating point computes exactly 0.735 (rounding to $0.74), aligning with legal and manual calculation expectations.[10] Similarly, operations like 0.70 × 1.05 highlight how binary approximations distort results, whereas decimal formats preserve fidelity for base-10 inputs.[2] These properties make decimal floating point essential for applications like currency conversions, billing systems, and commercial databases, where surveys indicate up to 55% of columns involve decimal data and financial workloads can spend 90% of processing time on decimal operations.[2] Decimal floating point is widely adopted in legacy systems like COBOL, which natively supports decimal fixed-point arithmetic for commercial processing to ensure accuracy in monetary calculations.[10] It also underpins standards such as the XML Schema decimal type, defined to represent numbers with an exact fractional part for precise data interchange in web services and documents. However, decimal formats incur performance trade-offs compared to binary floating point, as base-10 operations lack native hardware acceleration on most processors optimized for base-2 arithmetic, resulting in software implementations that can be 100–1000 times slower in isolation.[10] In practice, for financial applications, optimized decimal arithmetic achieves acceptable overhead—often under 5% of total runtime—prioritizing "decimal exactness" over raw speed.[10]Representation Formats
General Structure
Decimal floating-point numbers are represented using a sign, an exponent, and a significand, which together encode a value in base 10. The sign is a single bit indicating whether the number is positive (0) or negative (1). The exponent is a biased integer that determines the scale of the number, while the significand is a sequence of decimal digits representing the significant portion of the value.[11] The significand is normalized such that its leading digit is non-zero for normal numbers, ensuring a unique representation and efficient use of precision; the length of the significand is fixed for a given precision level. For example, the IEEE 754 standard specifies significands of 7, 16, or 34 digits for its decimal formats. Subnormal numbers, which have a leading zero in the significand, are used to represent values smaller than the smallest normal number, enabling gradual underflow and a smoother transition toward zero without abrupt loss of precision.[11] The exponent bias is chosen to allow both positive and negative scaling while using an unsigned integer representation for the encoded exponent; typical values include 101 for 32-bit formats, 398 for 64-bit formats, and 6176 for 128-bit formats, as defined in the IEEE 754 standard. The numerical value of a decimal floating-point number is given by the formula: where is the sign bit, is the significand (interpreted as an integer divided by , with the precision in digits), is the encoded exponent, and is the bias. This structure supports exact representation of many decimal fractions that are problematic in binary floating-point systems.[11]IEEE 754 Decimal Formats
The IEEE 754-2008 standard defines three interchange formats for decimal floating-point numbers: decimal32, decimal64, and decimal128. These formats are designed to provide exact representation of decimal fractions commonly used in financial and commercial applications, with precisions of 7, 16, and 34 decimal digits, respectively. They correspond roughly to the precision levels of binary16 (half), binary32 (single), and binary64 (double) formats in the same standard, but operate in base 10 to avoid rounding errors in decimal-to-binary conversions.[12][1] Each format consists of a sign bit, a 5-bit combination field (shared between leading significand and exponent encoding), exponent continuation bits, and significand continuation bits encoding the coefficient consisting of exactly the specified number of decimal digits (interpretation varies by encoding scheme). For decimal32, the allocation is 1 sign bit, 5 combination bits, 6 exponent continuation bits (biased by 101, ranging from -95 to 96), and 20 significand continuation bits, totaling 32 bits. The decimal64 format scales this to 1 sign bit, 5 combination bits, 8 exponent continuation bits (biased by 398, ranging from -383 to 384), and 50 significand continuation bits, totaling 64 bits. Similarly, decimal128 uses 1 sign bit, 5 combination bits, 12 exponent continuation bits (biased by 6176, ranging from -6143 to 6144), and 110 significand continuation bits, totaling 128 bits. These allocations ensure sufficient bit width to encode the required decimal precision without loss.[12][1] The formats support special values including signed zero (±0), signed infinity (±∞), and not-a-number (NaN) variants. Infinities are represented using a dedicated combination field value with zero continuation fields, preserving the sign bit to distinguish positive and negative infinity. NaNs include both quiet NaNs (qNaN), which propagate without signaling exceptions, and signaling NaNs (sNaN), which trigger exceptions when used in operations; both types include a payload in the significand field for diagnostic information or implementation-specific data, allowing up to nearly the full significand width for this purpose. Subnormal numbers are also supported to fill gaps in the representable range near zero, using a reduced exponent and explicit leading zeros in the significand.[12][1] These decimal formats are backward-compatible with earlier standards, such as IEEE 854-1987 and the draft IEEE 754r, by retaining core concepts like biased exponents and special value encodings while extending precision and range for modern computing needs.[12][1]| Format | Total Bits | Precision (Decimal Digits) | Sign Bits | Combination Bits | Exponent Cont. Bits (Bias) | Significand Cont. Bits | Exponent Range |
|---|---|---|---|---|---|---|---|
| Decimal32 | 32 | 7 | 1 | 5 | 6 (101) | 20 | -95 to 96 |
| Decimal64 | 64 | 16 | 1 | 5 | 8 (398) | 50 | -383 to 384 |
| Decimal128 | 128 | 34 | 1 | 5 | 12 (6176) | 110 | -6143 to 6144 |
