Hubbry Logo
Fixed-point arithmeticFixed-point arithmeticMain
Open search
Fixed-point arithmetic
Community hub
Fixed-point arithmetic
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Fixed-point arithmetic
Fixed-point arithmetic
from Wikipedia

In computing, fixed-point is a method of representing fractional (non-integer) numbers by storing a fixed number of digits of their fractional part. Dollar amounts, for example, are often stored with exactly two fractional digits, representing the cents (1/100 of dollar). More generally, the term may refer to representing fractional values as integer multiples of some fixed small unit, e.g. a fractional amount of hours as an integer multiple of ten-minute intervals. Fixed-point number representation is often contrasted to the more complicated and computationally demanding floating-point representation.

In the fixed-point representation, the fraction is often expressed in the same number base as the integer part, but using negative powers of the base b. The most common variants are decimal (base 10) and binary (base 2). The latter is commonly known also as binary scaling. Thus, if n fraction digits are stored, the value will always be an integer multiple of bn. Fixed-point representation can also be used to omit the low-order digits of integer values, e.g. when representing large dollar values as multiples of $1000.

When decimal fixed-point numbers are displayed for human reading, the fraction digits are usually separated from those of the integer part by a radix character (usually "." in English, but "," or some other symbol in many other languages). Internally, however, there is no separation, and the distinction between the two groups of digits is defined only by the programs that handle such numbers.

Fixed-point representation was the norm in mechanical calculators. Since most modern processors have a fast floating-point unit (FPU), fixed-point representations in processor-based implementations are now used only in special situations, such as in low-cost embedded microprocessors and microcontrollers; in applications that demand high speed or low power consumption or small chip area, like image, video, and digital signal processing; or when their use is more natural for the problem. Examples of the latter are accounting of dollar amounts, when fractions of cents must be rounded to whole cents in strictly prescribed ways; and the evaluation of functions by table lookup, or any application where rational numbers need to be represented without rounding errors (which fixed-point does but floating-point cannot). Fixed-point representation is still the norm for field-programmable gate array (FPGA) implementations, as floating-point support in an FPGA requires significantly more resources than fixed-point support.[1]

Representation

[edit]
Fixed-point representation with scaling 1/100
Value
represented
Internal
representation
0.00 0
0.5 50
0.99 99
2 200
−14.1 −1410
314.160 31416

A fixed-point representation of a fractional number is essentially an integer that is to be implicitly multiplied by a fixed scaling factor. For example, the value 1.23 can be stored in a variable as the integer value 1230 with implicit scaling factor of 1/1000 (meaning that the last 3 decimal digits are implicitly assumed to be a decimal fraction), and the value 1 230 000 can be represented as 1230 with an implicit scaling factor of 1000 (with "minus 3" implied decimal fraction digits, that is, with 3 implicit zero digits at right). This representation allows standard integer arithmetic logic units to perform rational number calculations.

Negative values are usually represented in binary fixed-point format as a signed integer in two's complement representation with an implicit scaling factor as above. The sign of the value will always be indicated by the first stored bit (1 = negative, 0 = non-negative), even if the number of fraction bits is greater than or equal to the total number of bits. For example, the 8-bit signed binary integer (11110101)2 = −11, taken with −3, +5, and +12 implied fraction bits, would represent the values −11/2−3 = −88, −11/25 = −0.343 75, and −11/212 = −0.002 685 546 875, respectively.

Alternatively, negative values can be represented by an integer in the sign-magnitude format, in which case the sign is never included in the number of implied fraction bits. This variant is more commonly used in decimal fixed-point arithmetic. Thus the signed 5-digit decimal integer (−00025)10, taken with −3, +5, and +12 implied decimal fraction digits, would represent the values −25/10−3 = −25000, −25/105 = −0.00025, and −25/1012 = −0.000 000 000 025, respectively.

A program will usually assume that all fixed-point values that will be stored into a given variable, or will be produced by a given instruction, will have the same scaling factor. This parameter can usually be chosen by the programmer depending on the precision needed and range of values to be stored.

The scaling factor of a variable or formula may not appear explicitly in the program. Good programming practice then requires that it be provided in the documentation, at least as a comment in the source code.

Choice of scaling factors

[edit]

For greater efficiency, scaling factors are often chosen to be powers (positive or negative) of the base b used to represent the integers internally. However, often the best scaling factor is dictated by the application. Thus one often uses scaling factors that are powers of 10 (e.g. 1/100 for dollar values), for human convenience, even when the integers are represented internally in binary. Decimal scaling factors also mesh well with the metric (SI) system, since the choice of the fixed-point scaling factor is often equivalent to the choice of a unit of measure (like centimeters or microns instead of meters).

However, other scaling factors may be used occasionally, e.g. a fractional amount of hours may be represented as an integer number of seconds; that is, as a fixed-point number with scale factor of 1/3600.

Even with the most careful rounding, fixed-point values represented with a scaling factor S may have an error of up to ±0.5 in the stored integer, that is, ±0.5 S in the value. Therefore, smaller scaling factors generally produce more accurate results.

On the other hand, a smaller scaling factor means a smaller range of the values that can be stored in a given program variable. The maximum fixed-point value that can be stored into a variable is the largest integer value that can be stored into it, multiplied by the scaling factor; and similarly for the minimum value. For example, the table below gives the implied scaling factor S, the minimum and maximum representable values Vmin and Vmax, and the accuracy δ = S/2 of values that could be represented in 16-bit signed binary fixed point format, depending on the number f of implied fraction bits.

Parameters of some 16-bit signed binary fixed-point formats
f S δ Vmin Vmax
−3 1/2−3 = 8 4 262 144 +262 136
0 1/20 = 1 0.5 32 768 +32 767
5 1/25 = 1/32 < 0.016 −1024.000 00 +1023.968 75
14 1/214 = 1/16 384 < 0.000 031 −2.000 000 000 000 00 +1.999 938 964 843 75
15 1/215 = 1/32 768 < 0.000 016 −1.000 000 000 000 000 +0.999 969 482 421 875
16 1/216 = 1/65 536 < 0.000 008 −0.500 000 000 000 000 0 +0.499 984 741 210 937 5
20 1/220 = 1/1 048 576 < 0.000 000 5 −0.031 250 000 000 000 000 00 +0.031 249 046 325 683 593 75

Fixed-point formats with scaling factors of the form 2n−1 (namely 1, 3, 7, 15, 31, etc.) have been said to be appropriate for image processing and other digital signal processing tasks. They are supposed to provide more consistent conversions between fixed- and floating-point values than the usual 2n scaling. The Julia programming language implements both versions.[2]

Exact values

[edit]

Any binary fraction a/2m, such as 1/16 or 17/32, can be exactly represented in fixed-point, with a power-of-two scaling factor 1/2n with any nm. However, most decimal fractions like 0.1 or 0.123 are infinite repeating fractions in base 2. and hence cannot be represented that way.

Similarly, any decimal fraction a/10m, such as 1/100 or 37/1000, can be exactly represented in fixed point with a power-of-ten scaling factor 1/10n with any nm. This decimal format can also represent any binary fraction a/2m, such as 1/8 (0.125) or 17/32 (0.53125).

More generally, a rational number a/b, with a and b relatively prime and b positive, can be exactly represented in binary fixed point only if b is a power of 2; and in decimal fixed point only if b has no prime factors other than 2 and/or 5.

Comparison with floating-point

[edit]

Fixed-point computations can be faster and/or use less hardware than floating-point ones. If the range of the values to be represented is known in advance and is sufficiently limited, fixed point can make better use of the available bits. For example, if 32 bits are available to represent a number between 0 and 1, a fixed-point representation can have error less than 1.2 × 10−10, whereas the standard floating-point representation may have error up to 596 × 10−10 — because 9 of the bits are wasted with the sign and exponent of the dynamic scaling factor. Specifically, comparing 32-bit fixed-point to floating-point audio, a recording requiring less than 40 dB of headroom has a higher signal-to-noise ratio using 32-bit fixed.

Programs using fixed-point computations are usually more portable than those using floating-point since they do not depend on the availability of an FPU. This advantage was particularly strong before the IEEE Floating Point Standard was widely adopted when floating-point computations with the same data would yield different results depending on the manufacturer, and often on the computer model.

Many embedded processors lack an FPU, because integer arithmetic units require substantially fewer logic gates and consume much smaller chip area than an FPU; and software emulation of floating-point on low-speed devices would be too slow for most applications. CPU chips for the earlier personal computers and game consoles, like the Intel 386 and 486SX, also lacked an FPU.

The absolute resolution (difference between successive values) of any fixed-point format is constant over the whole range, namely the scaling factor S. In contrast, the relative resolution of a floating-point format is approximately constant over their whole range, varying within a factor of the base b; whereas their absolute resolution varies by many orders of magnitude, like the values themselves.

In many cases, the rounding and truncation errors of fixed-point computations are easier to analyze than those of the equivalent floating-point computations. Applying linearization techniques to truncation, such as dithering and/or noise shaping is more straightforward within fixed-point arithmetic. On the other hand, the use of fixed point requires greater care by the programmer. Avoidance of overflow requires much tighter estimates for the ranges of variables and all intermediate values in the computation, and often also extra code to adjust their scaling factors. Fixed-point programming normally requires the use of integer types of different widths. Fixed-point applications can make use of block floating point, which is a fixed-point environment having each array (block) of fixed-point data be scaled with a common exponent in a single word.

Applications

[edit]

A common use of decimal fixed-point is for storing monetary values, for which the complicated rounding rules of floating-point numbers are often a liability. For example, the open-source money management application GnuCash, written in C, switched from floating-point to fixed-point as of version 1.6, for this reason.[3]

Binary fixed-point (binary scaling) was widely used from the late 1960s to the 1980s for real-time computing that was mathematically intensive, such as flight simulation and in nuclear power plant control algorithms. It is still used in many DSP applications and custom-made microprocessors. Computations involving angles would use binary angular measurement.

Binary fixed point is used in the STM32G4 series CORDIC co-processors and in the discrete cosine transform algorithms used to compress JPEG images.

Electronic instruments such as electricity meters and digital clocks often use polynomials to compensate for introduced errors, e.g. from temperature or power supply voltage. The coefficients are produced by polynomial regression. Binary fixed-point polynomials can utilize more bits of precision than floating-point and do so in fast code using inexpensive CPUs. Accuracy, crucial for instruments, compares well to equivalent-bit floating-point calculations, if the fixed-point polynomials are evaluated using Horner's method (e.g. y = ((ax + b)x + c)x + d) to reduce the number of times that rounding occurs, and the fixed-point multiplications utilize rounding addends.

Operations

[edit]

Addition and subtraction

[edit]

To add or subtract two values with the same implicit scaling factor, it is sufficient to add or subtract the underlying integers; the result will have their common implicit scaling factor and can thus be stored in the same program variables as the operands. These operations yield the exact mathematical result, as long as no overflow occurs—that is, as long as the resulting integer can be stored in the receiving program variable. If overflow happens, it occurs like with ordinary integers of the same signedness. In the unsigned and signed-via-two's-complement cases, the overflow behaviour is well-known as a finite group.

If the operands have different scaling factors, then they must be converted to a common scaling factor before the operation.

Multiplication

[edit]

To multiply two fixed-point numbers, it suffices to multiply the two underlying integers, and assume that the scaling factor of the result is the product of their scaling factors.

(p/q) * (r/s) = pr/qs

The result will be exact, with no rounding, provided that it does not overflow the receiving variable. (Specifically, with integer multiplication, the product is up to twice the width of the two factors.)

For example, multiplying the numbers 123 scaled by 1/1000 (0.123) and 25 scaled by 1/10 (2.5) yields the integer 123×25 = 3075 scaled by (1/1000)×(1/10) = 1/10000, that is 3075/10000 = 0.3075. As another example, multiplying the first number by 155 implicitly scaled by 1/32 (155/32 = 4.84375) yields the integer 123×155 = 19065 with implicit scaling factor (1/1000)×(1/32) = 1/32000, that is 19065/32000 = 0.59578125.

In binary, it is common to use a scaling factor that is a power of two. After the multiplication, the scaling factor can be divided away by shifting right. Shifting is simple and fast in most computers.

When right-shifting or a typical integer-division instruction (such as C integer division and x86 idiv) is used, the result is equivalent to a flooring division (floor(x/y)). A method with rounding can be used to reduce the error introduced. Three variations are possible based on choice of tie-breaking:

  • Round-half-up is possible by adding a 'rounding addend' of half of the scaling factor before shifting. The proof: roundup(x/y) = floor(x/y + 0.5) = floor((x + y/2)/y). If y = 2^n, this is equivalent to (x + 2^(n−1)) >> n (where >> represents right shift).
  • Round-half-down is, by analogy, floor((x - y/2)/y) or (x - 2^(n-1)) >> n.
  • Round-half-to-even basically entails doing an extra decision on top of round-half-up. It is slightly more complicated but still requires no branching on a CPU.[4]

These rounding methods are usable in any scaling through integer division. For example, they are also applicable to the discussion on rescaling.

Division

[edit]

The division of fixed point numbers can be understood as the division of two fractions of potentially different denominators (scaling factors). With pq and rs (where p q r s are all integers), the naive approach is to rearrange the fraction to form a new scaling factor (s/q):

(p/q) / (r/s) = (p÷r) / (s÷q)

For example, division of 3456 scaled by 1/100 (34.56) and 1234 scaled by 1/1000 (1.234) yields the integer 3456÷1234 = 3 (rounded) with scale factor (1/100)/(1/1000) = 10, that is, 30. As another example, the division of the first number by 155 implicitly scaled by 1/32 (155/32 = 4.84375) yields the integer 3456÷155 = 22 (rounded) with implicit scaling factor (1/100)/(1/32) = 32/100 = 8/25, that is 22×32/100 = 7.04.

With very similar s and q, the above algorithm results in an overly coarse scaling factor. This can be improved by first converting the dividend to a smaller scaling factor. Say we reduce the scaling factor by n times, then we instead calculate:

(p/q) / (r/s) = (np/nq) / (r/s) = (np÷r) / (s÷nq)

For example, if a = 1.23 is represented as 123 with scaling 1/100, and b = 6.25 is represented as 6250 with scaling 1/1000, then simple division of the integers yields 123÷6250 = 0 (rounded) with scaling factor (1/100)/(1/1000) = 10. If a is first converted to 1,230,000 with scaling factor 1/1000000, the result will be 1,230,000÷6250 = 197 (rounded) with scale factor 1/1000 (0.197). The exact value 1.23/6.25 is 0.1968.

A different way to think about the scaling is to consider division the inverse operation of multiplication. If multiplication leads to a finer scaling factor, it is reasonable that the dividend needs to have a finer scaling factor as well to recover the original value given.

Scaling conversion

[edit]

In fixed-point computing it is often necessary to convert a value to a different scaling factor. This operation is necessary, for example:

  • To store a value into a program variable that has a different implicit scaling factor;
  • To convert two values to the same scaling factor, so that they can be added or subtracted;
  • To restore the original scaling factor of a value after multiplying or dividing it by another;
  • To improve the accuracy of the result of a division;
  • To ensure that the scaling factor of a product or quotient is a simple power like 10n or 2n;
  • To ensure that the result of an operation can be stored into a program variable without overflow;
  • To reduce the cost of hardware that processes fixed-point data.

To convert a number from a fixed point type with scaling factor R to another type with scaling factor S, the underlying integer must be multiplied by the ratio R/S. Thus, for example, to convert the value 1.23 = 123/100 from scaling factor R=1/100 to one with scaling factor S=1/1000, the integer 123 must be multiplied by (1/100)/(1/1000) = 10, yielding the representation 1230/1000.

If the scaling factor is a power of the base used internally to represent the integer, changing the scaling factor requires only dropping low-order digits of the integer, or appending zero digits. However, this operation must preserve the sign of the number. In two's complement representation, that means extending the sign bit as in arithmetic shift operations.

If S does not divide R (in particular, if the new scaling factor S is greater than the original R), the new integer may have to be rounded.

In particular, if r and s are fixed-point variables with implicit scaling factors R and S, the operation rr×s requires multiplying the respective integers and explicitly dividing the result by S. The result may have to be rounded, and overflow may occur.

For example, if the common scaling factor is 1/100, multiplying 1.23 by 0.25 entails multiplying 123 by 25 to yield 3075 with an intermediate scaling factor of 1/10000. In order to return to the original scaling factor 1/100, the integer 3075 then must be multiplied by 1/100, that is, divided by 100, to yield either 31 (0.31) or 30 (0.30), depending on the rounding policy used.

Similarly, the operation rr/s will require dividing the integers and explicitly multiplying the quotient by S. Rounding and/or overflow may occur here too.

Conversion to and from floating-point

[edit]

To convert a number from floating point to fixed point, one may multiply it by the scaling factor S, then round the result to the nearest integer. Care must be taken to ensure that the result fits in the destination variable or register. Depending on the scaling factor and storage size, and on the range input numbers, the conversion may not entail any rounding.

To convert a fixed-point number to floating-point, one may convert the integer to floating-point and then divide it by the scaling factor S. This conversion may entail rounding if the integer's absolute value is greater than 224 (for binary single-precision IEEE floating point) or of 253 (for double-precision). Overflow or underflow may occur if |S| is very large or very small, respectively.

Hardware support

[edit]

Scaling and renormalization

[edit]

Typical processors do not have specific support for fixed-point arithmetic. However, most computers with binary arithmetic have fast bit shift instructions that can multiply or divide an integer by any power of 2; in particular, an arithmetic shift instruction. These instructions can be used to quickly change scaling factors that are powers of 2, while preserving the sign of the number.

Early computers like the IBM 1620 and the Burroughs B3500 used a binary-coded decimal (BCD) representation for integers, namely base 10 where each decimal digit was independently encoded with 4 bits. Some processors, such as microcontrollers, may still use it. In such machines, conversion of decimal scaling factors can be performed by bit shifts and/or by memory address manipulation.

Some DSP architectures offer native support for specific fixed-point formats, for example, signed n-bit numbers with n−1 fraction bits (whose values may range between −1 and almost +1). The support may include a multiply instruction that includes renormalization—the scaling conversion of the product from 2n−2 to n−1 fraction bits.[citation needed] If the CPU does not provide that feature, the programmer must save the product in a large enough register or temporary variable, and code the renormalization explicitly.

Overflow

[edit]

Overflow happens when the result of an arithmetic operation is too large to be stored in the designated destination area. In addition and subtraction, the result may require one bit more than the operands. In multiplication of two unsigned integers with m and n bits, the result may have m+n bits.

In case of overflow, the high-order bits are usually lost, as the un-scaled integer gets reduced modulo 2n where n is the size of the storage area. The sign bit, in particular, is lost, which may radically change the sign and the magnitude of the value.

Some processors can set a hardware overflow flag and/or generate an exception on the occurrence of an overflow. Some processors may instead provide saturation arithmetic: if the result of an addition or subtraction were to overflow, they store instead the value with the largest magnitude that can fit in the receiving area and has the correct sign.[citation needed]

However, these features are not very useful in practice; it is generally easier and safer to select scaling factors and word sizes so as to exclude the possibility of overflow, or to check the operands for excessive values before executing the operation.

Computer language support

[edit]

Explicit support for fixed-point numbers is provided by a few programming languages, notably PL/I, COBOL, Ada, JOVIAL, and Coral 66. They provide fixed-point data types, with a binary or decimal scaling factor. The compiler automatically generates code to do the appropriate scaling conversions when doing operations on these data types, when reading or writing variables, or when converting the values to other data types such as floating-point.

Most of those languages were designed between 1955 and 1990. More modern languages usually do not offer any fixed-point data types or support for scaling factor conversion. That is also the case for several older languages that are still very popular, like FORTRAN, C and C++. The wide availability of fast floating-point processors, with strictly standardized behavior, has greatly reduced the demand for binary fixed-point support.[citation needed] Similarly, the support for decimal floating point in some programming languages, like C# and Python, has removed most of the need for decimal fixed-point support. In the few situations that call for fixed-point operations, they can be implemented by the programmer, with explicit scaling conversion, in any programming language.

On the other hand, all relational databases and the SQL notation support fixed-point decimal arithmetic and storage of numbers. PostgreSQL has a special numeric type for exact storage of numbers with up to 1000 digits.[5]

Moreover, in 2008 the International Organization for Standardization (ISO) published a draft technical report to extend the C programming language with fixed-point data types, for the benefit of programs running on embedded DSP processors. Two main kinds of data types are proposed, _Fract (fractional part with a minimum 7-bit precision) and _Accum (_Fract with at least 4 bits of integer part).[6] The GNU Compiler Collection (GCC) supports this draft.[7][8]

Detailed examples

[edit]

Decimal fixed point multiplication

[edit]

Suppose there is the following multiplication with two fixed-point, 3-decimal-place numbers.

Note how since there are 3 decimal places we show the trailing zeros. To re-characterize this as an integer multiplication we must first multiply by moving all the decimal places in to integer places, then we will multiply by to put them back the equation now looks like

This works equivalently if we choose a different base, notably base 2 for computing since a bit shift is the same as a multiplication or division by an order of 2. Three decimal digits is equivalent to about 10 binary digits, so we should round 0.05 to 10 bits after the binary point. The closest approximation is then 0.0000110011.

Thus our multiplication becomes

This rounds to 11.023 with three digits after the decimal point.

Binary fixed-point multiplication

[edit]

Consider the task of computing the product of 1.2 and 5.6 with binary fixed point using 16 fraction bits. To represent the two numbers, one multiplies them by 216, obtaining 78 643.2 and 367 001.6; and round these values the nearest integers, obtaining 78 643 and 367 002. These numbers will fit comfortably into a 32-bit word with two's complement signed format.

Multiplying these integers together gives the 35-bit integer 28 862 138 286 with 32 fraction bits, without any rounding. Note that storing this value directly into a 32-bit integer variable would result in overflow and loss of the most significant bits. In practice, it would probably be stored in a signed 64-bit integer variable or register.

If the result is to be stored in the same format as the data, with 16 fraction bits, that integer should be divided by 216, which gives approximately 440 401.28, and then rounded to the nearest integer. This effect can be achieved by adding 215 and then shifting the result by 16 bits. The result is 440 401, which represents the value 6.719 985 961 914 062 5. Taking into account the precision of the format, that value is better expressed as 6.719 986 ± 0.000 008 (not counting the error that comes from the operand approximations). The correct result would be 1.2 × 5.6 = 6.72.

For a more complicated example, suppose that the two numbers 1.2 and 5.6 are represented in 32-bit fixed point format with 30 and 20 fraction bits, respectively. Scaling by 230 and 220 gives 1 288 490 188.8 and 5 872 025.6, that round to 1 288 490 189 and 5 872 026, respectively. Both numbers still fit in a 32-bit signed integer variable, and represent the fractions

1.200 000 000 186 264 514 923 095 703 125 and
5.600 000 381 469 726 562 50

Their product is (exactly) the 53-bit integer 7 566 047 890 552 914, which has 30+20 = 50 implied fraction bits and therefore represents the fraction

6.720 000 458 806 753 229 623 609 513 510

If we choose to represent this value in signed 16-bit fixed format with 8 fraction bits, we must divide the integer product by 250−8 = 242 and round the result; which can be achieved by adding 241 and shifting by 42 bits. The result is 1720, representing the value 1720/28 = 6.718 75, or rather the interval between 3439/29 and 3441/29 (approximately 6.719 ± 0.002).

Notations

[edit]

Various notations have been used to concisely specify the parameters of a fixed-point format. In the following list, f represents the number of fractional bits, m the number of magnitude or integer bits, s the number of sign bits (0/1 or some other alternative representation), and b the total number of bits.

  • The Q notation was defined by Texas Instruments.[9] One writes Qf to specify a signed binary fixed-point value with f fraction bits; for example, Q15 specifies a signed integer in two's complement notation with a scaling factor 1/215. The code Qm.f specifies additionally that the number has m bits in the integer part of the value, not counting the sign bit. Thus Q1.30 would describe a binary fixed-point format with 1 integer bit and 30 fractional bits, which could be stored as a 32-bit 2's complement integer with scaling factor 1/230.[9][10]
    • A similar notation has been used by ARM, except that they count the sign bit in the value of m; so the same format above would be specified as Q2.30.[11][12]
    • The Embedded C proposal uses .f for unsigned fraction. s.f for signed fraction, m.f for unsigned accumulator, and sm.f for signed accumulator. This would translate the above to s1.30, though this is not a valid type for either fraction or accumulator: in valid versions, m is at least 4 and depending on the underlying type f is at least 7, 15, or 23. Note the non-italicized s: it is simply prepended as a letter.
  • The COBOL programming language originally supported decimal fixed-precision with arbitrary size and decimal scaling, whose format was specified "graphically" with the PIC directive. For example, PIC S9999V99 specified a sign-magnitude 6-digit decimal integer with two decimal fraction digits.[13]
  • The construct REAL FIXED BINARY (p,f) is used in the PL/I programming language, to specify a fixed-point signed binary data type with p total bits (not including sign) with f bits in the fraction part; that is a p+1 bit signed integer with a scaling factor of 1/2f. The latter could be positive or negative. One could specify COMPLEX instead of REAL, and DECIMAL instead of BINARY for base 10.
  • In the Ada programming language, a numeric data type can be specified by, for example,type F is delta 0.005 range -50.0 .. 50.0. The decimal bounds are translated to the next power of two, hence it means a fixed-point representation consisting of a signed binary integer in two's complement format with at least 8 fraction bits (providing a scaling factor 1/256) and 7 sign-and-magnitude bits (ensuring an actual range from −64.00 to almost +64.00): a minimum total of 15 bits. On a 16-bit computer, the spare bit is assigned to the fractional part. Asymmetrical range constraints are also allowed,[14] though the underlying implementation remains symmetric about 0.[15] Newer versions of Ada allow specifying an exact (including non-power-of-two) scaling factor using 'Small => 0.005 (aspect specification), or, if the factor is a power of 10, through a decimal fixed point.
  • The notation Bm has been used to mean a fixed binary format with m bits in the integer part; the rest of the word (typically 32 bits) being fraction bits. For example, the maximum and minimum values that can be stored in a signed B16 number are ≈32767.9999847 and −32768.0, respectively.
  • The VisSim company used fxm.b to denote a binary fixed-point value with b total bits and m bits in the integer part; that is, a b-bit integer with scaling factor 1/2bm. Thus fx1.16 would mean a 16-bit number with 1 bit in the integer part and 15 in the fraction.[16]
  • The PS2 GS ("Graphics Synthesizer") User's Guide uses the notation s:m:f, where s specifies the presence (0 or 1) of sign bit.[17] For example, 0:5:3 represents an unsigned 8-bit integer with a scaling factor of 1/23.
  • The LabVIEW programming language uses the notation <s,b,m> to specify the parameters of an 'FXP' fixed point numbers. The s component can be either '+' or '±', signifying either an unsigned or 2's complement signed number, respectively. The b component is the total number of bits, and m is the number of bits in the integer part.

Software application examples

[edit]
  • The popular TrueType font format uses 32-bit signed binary fixed-point with 26 bits to the left of the decimal for some numeric values in its instructions.[18] This format was chosen to provide the minimal amount of precision required for hinting and for performance reasons.[19]
  • With the exception of the Nintendo 64, all 3D games for the fifth generation of video game consoles, including the 3DO, PlayStation, Sega Saturn, and Atari Jaguar[20] use fixed-point arithmetic, as the systems lack hardware floating-point units. The PlayStation transformation coprocessor supports 16-bit fixed point with 12 fraction bits - whereas the Sega Saturn VDP coprocessors used a 32-bit fixed point format reserving the lower 16 bits for the fractional part.
  • The TeX typesetting software, widely used by scientists and mathematicians, uses 32-bit signed binary fixed point with 16 fraction bits for all position calculations. The values are interpreted as fractions of a typographer's point. TeX font metric files use 32-bit signed fixed-point numbers, with 12 fraction bits.
  • Tremor, Toast and MAD are software libraries which decode the Ogg Vorbis, GSM Full Rate and MP3 audio formats respectively. These codecs use fixed-point arithmetic because many audio decoding hardware devices do not have an FPU.
  • The WavPack lossless audio compressor uses fixed point arithmetic. The choice was justified by, among other things, the worry that different floating-point rounding rules in different hardware could corrupt the lossless nature of the compression.[21]
  • The Nest Labs Utilities library,[22] provides a limited set of macros and functions for fixed point numbers, particularly when dealing with those numbers in the context of sensor sampling and sensor outputs.
  • The OpenGL ES 1.x specification includes a fixed point profile, as it is an API aimed for embedded systems, which do not always have an FPU.
  • The dc and bc programs are arbitrary precision calculators, but only keep track of a (user-specified) fixed number of fractional digits.
  • Fractint represents numbers as Q2.29 fixed-point numbers,[23] to speed up drawing on old PCs with 386 or 486SX processors, which lacked an FPU.
  • Doom was the last first-person shooter game by id Software to use a 16.16 fixed point representation for all of its non-integer computations, including map system, geometry, rendering, and player movement. This representation is still used in modern Doom source ports.
  • The Q# programming language for the Azure quantum computers, that implement quantum logic gates, contains a standard numeric library for performing fixed-point arithmetic on registers of qubits.[24]

See also

[edit]

References

[edit]

Further reading

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Fixed-point arithmetic is a numerical representation for rational numbers in which the (or binary) point is located at a fixed position within the representation, allowing computations to be performed using integer arithmetic scaled by a power of the , typically base 2 in binary systems. This approach contrasts with by maintaining a constant scaling factor across all numbers, which simplifies hardware but limits the and precision compared to formats that adjust the point position dynamically. Commonly used in embedded systems, (DSP), and real-time applications, fixed-point arithmetic enables efficient operations on resource-constrained devices like microcontrollers. In binary fixed-point representations, numbers are stored as with an implied binary point at a predetermined position, often denoted in Qm.n format where m specifies the number of bits (including the ) and n the number of fractional bits. For example, in a signed Q1.15 format using 16 bits, the value is interpreted as the divided by 2152^{15}, providing a range from -1 to nearly 1 with a resolution of 2153.05×1052^{-15} \approx 3.05 \times 10^{-5}. Arithmetic operations such as and require alignment of the binary points, which is straightforward since the position is fixed, but multiplication typically doubles the bit width of the result, necessitating right-shifting to fit the output format and potentially introducing quantization errors. Scaling during conversion from floating-point values—by multiplying by 2n2^n—must account for overflow risks and precision loss. The primary advantages of fixed-point arithmetic include significantly faster execution times and lower power consumption relative to floating-point, as it avoids the complexity of exponent handling and normalization; for instance, on microcontrollers like the PIC32, a fixed-point may take 21 cycles versus 55 for floating-point. It is particularly suited for applications requiring predictable performance, such as audio processing, control systems, and rendering, where the fixed precision suffices and overflow can be managed through saturation or wrapping. However, its drawbacks include a restricted range—for a 32-bit signed format with 16 fractional bits, the maximum value is approximately 32767.9999—and susceptibility to accumulation of rounding errors in iterative computations, making it less ideal for high-precision scientific calculations. Modern compilers and libraries, such as those supporting the _Accum type in C, provide built-in fixed-point data types to facilitate its use in software.

Fundamentals

Definition and Basic Representation

Fixed-point arithmetic is a method for representing rational numbers in by storing them as s scaled by a fixed point position, which eliminates the need for explicit or binary points during storage. This approach treats the number as an multiple of a power of the (typically base 2 for binary systems), where the scaling factor corresponds to the position of the implied point. In essence, a fixed-point number xx can be expressed as x=I+Fx = I + F, where II is the part and FF is the , but both are combined into a single value interpreted relative to the fixed point. In basic storage, fixed-point numbers are represented using standard integer formats, typically in binary with a fixed word , and signed values often employ to handle negative numbers efficiently. The entire value is stored as a contiguous bit string without any explicit indicator of the radix point; instead, the position is predetermined by the format, allowing arithmetic operations to proceed as if on while the interpretation yields fractional results. For example, in an 8-bit signed representation, the bits might allocate space for both and fractional components, with the using to extend the range symmetrically around zero. The placement of the radix point defines the division between integer and fractional parts, commonly denoted in Qm.n format, where m represents the number of bits for the portion (excluding the ) and n the number of bits for the fractional portion. In this notation, the total bit width is typically m + n + 1 (including the for signed formats), and the value is interpreted as the bit pattern divided by 2n2^n. For instance, in Q3.4 format with 8 bits, the first 3 bits (after the sign) hold the part, and the remaining 4 bits the , enabling representation of values from -8 to nearly 8 with a resolution of 1/161/16. This representation offers advantages over pure arithmetic by allowing the encoding of fractional values through the implicit scaling, thus supporting applications requiring sub-unit precision without the overhead of dedicated floating-point units. By leveraging hardware, fixed-point formats facilitate efficient computation of rational numbers in resource-constrained environments, such as embedded systems.

Scaling Factors and Precision Choices

In fixed-point arithmetic, the scaling factor determines the position of the radix point relative to the stored bits, effectively defining the representation as a scaled value. For binary fixed-point systems, the scaling is typically a power of 2, denoted as 2f2^{-f} where ff is the number of fractional bits, allowing the value to be interpreted as v=i×2fv = i \times 2^{-f} for an ii. Selection of this scaling involves analyzing the application's required and resolution; for instance, allocating more bits to the integer part expands the representable range (e.g., from [2w1,2w1)[-2^{w-1}, 2^{w-1}) for ww total bits in signed two's complement) at the expense of fractional precision, while prioritizing fractional bits enhances accuracy for small values but risks overflow for larger ones. The trade-offs in scaling choices are critical for optimizing performance in resource-constrained environments like (DSP). Increasing the total word length (e.g., from 16 to 24 bits) improves both range and precision but raises hardware costs and power consumption; conversely, shorter words favor but necessitate frequent rescaling to prevent saturation. In practice, application-specific profiling guides this balance, such as using 16-bit formats for audio processing where moderate precision suffices, versus 32-bit for scientific computations demanding higher fidelity. Certain real numbers can be represented exactly in binary fixed-point arithmetic if they are dyadic rationals, i.e., fractions with denominators that are powers of 2. For example, 0.5 equals 1/21/2 and is exactly representable with one fractional bit as 0.12×21=0.50.1_2 \times 2^{-1} = 0.5, while 0.1 requires an infinite binary expansion and cannot be exact, leading to . Exactness holds only for values where the terminates in binary, limiting precise representation to sums of distinct powers of 1/21/2. Precision loss in fixed-point arises primarily from quantization, where non-representable values are approximated via or , introducing . The maximum quantization for rounding to nearest with nn fractional bits is ϵ=12n+1\epsilon = \frac{1}{2^{n+1}}, as the is bounded by half the least significant bit's value. yields a maximum of 12n\frac{1}{2^n}, often modeled as uniform with variance σ2=112×(2n)2\sigma^2 = \frac{1}{12} \times (2^{-n})^2. These errors accumulate in operations, necessitating scaling adjustments to maintain overall accuracy. The choice of radix significantly influences fixed-point efficiency, with binary (radix 2) predominant due to its alignment with hardware logic gates and shift-based scaling, enabling simple multiplications by powers of 2 via bit shifts. Decimal fixed-point ( 10), while intuitive for human-readable outputs like financial applications, requires more complex circuitry for and division, increasing area and latency by factors of 3-5 compared to binary implementations. Binary thus offers superior hardware efficiency for most computational tasks, though decimal persists where exact decimal fractions (e.g., 0.1) are essential.

Comparison with Floating-Point Arithmetic

Fixed-point arithmetic represents numbers using a fixed binary point position within a word, without an explicit exponent, resulting in a uniform scaling factor applied across the entire range of values. This contrasts with , which employs a mantissa () and an exponent to normalize the number, allowing the binary point to "float" and adapt to the magnitude, as defined in standards like IEEE 754. In fixed-point, all numbers share the same precision determined by the position of the binary point relative to the bits, enabling straightforward integer-like operations but restricting adaptability to varying scales. Regarding precision and dynamic range, fixed-point formats provide constant absolute precision throughout their representable range, with relative precision that decreases for larger magnitudes. In contrast, floating-point arithmetic provides approximately constant relative precision across its range due to the fixed number of mantissa bits (e.g., 23 bits for single-precision IEEE 754, leading to about 7 decimal digits), though absolute precision varies, being finer for smaller exponents. However, fixed-point's range is inherently limited by the word length and scaling choice, often requiring careful selection to balance integer and fractional parts without overflow, whereas floating-point supports a vastly wider range—spanning from approximately 103810^{-38} to 103810^{38} in single precision—though at the cost of precision degradation near the extremes. This makes floating-point suitable for scientific computing with disparate magnitudes, while fixed-point excels in applications needing uniform accuracy within a bounded domain. In terms of performance, fixed-point arithmetic is generally faster and more resource-efficient, as it leverages simple operations without the need for exponent alignment, normalization, or complex rounding hardware, reducing latency and power consumption in embedded systems. For instance, on processors lacking dedicated floating-point units, fixed-point can approach the speed of native arithmetic, whereas floating-point emulation introduces significant overhead due to these additional steps. This efficiency advantage is particularly pronounced in low-power devices like processors, where fixed-point implementations can achieve up to several times the throughput of floating-point equivalents for the same bit width. Error accumulation in fixed-point arithmetic tends to be more predictable and additive, stemming primarily from quantization and potential overflows that can be mitigated through scaling, leading to uniform distribution across operations. In contrast, floating-point errors arise from during normalization and exponent handling, which can propagate non-uniformly and introduce inconsistencies, such as in subtractions of close values, though standards like bound the relative to less than 1 unit in the last place (ulp) per operation. Fixed-point's deterministic error behavior facilitates easier analysis in control systems, while floating-point's variability requires compensatory techniques like guards.

Operations

Addition and Subtraction

In fixed-point arithmetic, addition and subtraction require that the operands share the same scaling factor to ensure their radix points align, preventing errors in the fractional representation. If the scaling factors differ, one operand must be adjusted by shifting its binary representation—typically by multiplying or dividing by a power of 2 (e.g., left-shifting by kk bits to increase the number of fractional bits from mm to m+km+k)—to match the other's scale before performing the operation. This alignment treats the fixed-point numbers as scaled integers, where the scaling factor s=2fs = 2^f reflects the position of the radix point after ff fractional bits. Once aligned, addition proceeds by performing standard addition on the scaled values, followed by applying the common scaling factor to obtain the result. For two fixed-point numbers aa and bb with the same scale ss, the is computed as: c=(as)+(bs)sc = \frac{(a \cdot s) + (b \cdot s)}{s} where asa \cdot s and bsb \cdot s are the representations. follows analogously: c=(as)(bs)s.c = \frac{(a \cdot s) - (b \cdot s)}{s}. For signed fixed-point numbers, these operations employ representation, where is implemented by adding the of the subtrahend to the minuend using the same circuit, ensuring consistent handling of negative values. Overflow in fixed-point and occurs if the result exceeds the representable range defined by the word length and scaling, such as surpassing the maximum positive or minimum negative value in (e.g., 2i2f2^{i} - 2^{-f} for an ii-bit integer part and ff-bit ). Detection typically involves checking the sign bits of the operands and result: for signed , overflow is flagged if both operands have the same sign but the result has the opposite sign; similar checks apply post-. To mitigate overflow, implementations may use saturation arithmetic, clamping the result to the nearest representable extreme rather than wrapping around.

Multiplication

In fixed-point arithmetic, multiplication of two numbers represented in formats Qm.nQ_{m.n} and Qp.qQ_{p.q} begins with treating them as s and performing standard , yielding a product in Qm+p.n+qQ_{m+p.n+q} format where the integer part spans m+pm+p bits and the fractional part spans n+qn+q bits. This doubling of the scaling factor arises because each fractional bit contributes to the overall precision in the product. To restore the desired scaling, such as matching the original fractional precision, the product undergoes a right shift by n+qn+q bits, effectively dividing by 2n+q2^{n+q} to reposition the binary point. This adjustment may involve of the least significant bits or to mitigate precision loss, depending on the . The operation can be expressed as: result=(a×b)(n+q)\text{result} = (a \times b) \gg (n + q) where \gg denotes an arithmetic right shift to handle sign extension in signed representations. For signed fixed-point numbers in two's complement format, multiplication preserves the sign through sign extension of partial products during accumulation, ensuring the result remains in two's complement. Hardware implementations often employ Booth's algorithm to enhance efficiency, which recodes the multiplier by examining bit pairs to replace strings of ones with fewer additions and subtractions, reducing the number of operations especially for numbers with long runs of identical bits. This technique, originally proposed by Andrew D. Booth in 1951, is particularly suited for signed binary multiplication in fixed-point systems.

Division and Scaling Conversions

In fixed-point arithmetic, division of two numbers with potentially different scaling factors requires careful adjustment to maintain precision and avoid overflow. Consider a represented as an DD scaled by factor sds_d (where the real value is DsdD \cdot s_d) and a as VV scaled by svs_v (real value VsvV \cdot s_v). The QQ in real terms is (Dsd)/(Vsv)(D \cdot s_d) / (V \cdot s_v), which simplifies to (D/V)(sd/sv)(D / V) \cdot (s_d / s_v). To compute this using operations, one common procedure multiplies the dividend by the divisor's scaling factor and performs division: the result is (Dsv)/V(D \cdot s_v) / V, which represents the scaled by sr=sd/svs_r = s_d / s_v. This approach leverages hardware multipliers for the scaling step, treating the operation as arithmetic while accounting for the scales; however, it demands sufficient bit width in intermediate results to prevent overflow, often using (e.g., 64 bits for 32-bit operands). For efficiency, especially in software implementations on DSP processors, division often employs reciprocal followed by . A prominent method uses the Newton-Raphson iteration to approximate the reciprocal 1/V1/V: starting with an initial guess x0x_0 (e.g., from a ), iterate xn+1=xn(2Vxn)x_{n+1} = x_n \cdot (2 - V \cdot x_n) until convergence, typically in 2-3 steps for fixed-point precision. The is then Dxnsd/svD \cdot x_n \cdot s_d / s_v, with the final handling the scale adjustment. This quadratic convergence makes it faster than for repeated operations, though it requires an initial accurate to a few bits. Scaling conversions between different fixed-point formats involve multiplying the integer representation by the ratio of the target scale to the source scale. To convert from scale s1s_1 to s2s_2, compute the new as (Is2)/s1(I \cdot s_2) / s_1, where II is the source ; the result is now scaled by s2s_2. If scales are powers of two (common for binary efficiency, e.g., s=2fs = 2^{-f}), this reduces to a bit shift: left shift by kk bits for s2/s1=2ks_2 / s_1 = 2^k (increasing fractional precision) or right shift for division, potentially with to mitigate precision loss. Virtual shifts reinterpret the bit positions without altering the value, adjusting only the implied scale (e.g., X(a,b)n=X(an,b+n)X(a, b) \gg n = X(a - n, b + n)), avoiding hardware shifts but requiring software tracking of the scale. Physical shifts modify the bits directly but risk overflow or if the word length is fixed. In hardware implementations, fixed-point division frequently uses digit-recurrence algorithms like restoring or non-restoring division for on resource-constrained devices such as FPGAs. The restoring algorithm proceeds iteratively: for each bit, subtract a shifted from the partial ; if negative, restore by adding back the divisor and set the quotient bit to 0, otherwise set to 1 and proceed without restoration. This requires an extra addition step per iteration. The non-restoring variant optimizes by alternating subtraction and addition based on the remainder's sign, eliminating the restore step: if the partial remainder is positive, subtract and set bit 1; if negative, add and set 0, with a final correction if needed. Both generate one bit per cycle and are suitable for fixed-point by aligning the binary point, offering low latency (n cycles for n-bit ) but higher area than multiplication-based methods.

Implementation

Hardware Support and Overflow Handling

Hardware implementations of fixed-point arithmetic are prevalent in digital signal processors (DSPs) and microcontrollers, featuring dedicated arithmetic logic units (ALUs) optimized for integer and fractional operations. For instance, the Cortex-M4 and Cortex-M7 processors include DSP extensions to the Thumb instruction set, providing a 32-bit ALU capable of single-cycle execution for most fixed-point arithmetic instructions, such as saturating additions and multiplications on 16-bit or 8-bit data via SIMD processing. Similarly, ' TMS320C55x DSP family incorporates a 40-bit ALU, a 16-bit ALU, and dual multiply-accumulate (MAC) units, enabling two fixed-point MAC operations per cycle with 40-bit accumulators that include 8 guard bits for . These units support common Q formats, such as Q15 (16-bit with 15 fractional bits) and Q31 (32-bit with 31 fractional bits), facilitating efficient handling of fractional data in tasks. Barrel shifters are integral to these designs for scaling; the C55x employs a 40-bit to perform rapid left or right shifts, essential for adjusting fixed-point representations during operations like where the result may double in bit width. Overflow handling in fixed-point hardware typically involves detection mechanisms and response strategies to prevent or manage loss of accuracy. Detection often relies on s or comparisons; in the processors, overflow is identified through dedicated flags in saturating instructions, while the C55x uses accumulator overflow flags (e.g., ACOV0) and a (CARRY) that monitors borrow or overflow at the 31st or 39th bit position, depending on the M40 mode bit. Common responses include wraparound, where the result modulo-wraps to the representable range, as in non-saturating additions like ARM's SADD16; saturation, which clamps the output to the maximum (e.g., 0x7FFF for 16-bit signed) or minimum (e.g., 0x8000) value; and sticky flags that persist overflow status for software intervention. The C55x implements saturation via SATD and SATA mode bits, clipping results to 7FFFh or 8000h when enabled, with intrinsics like _sadd enforcing this behavior. ARM's QADD and SSAT instructions similarly saturate on overflow, capping values to avoid wraparound artifacts in audio or image processing. Renormalization follows operations prone to scaling shifts, such as , to restore the fixed-point format and avert overflow. In the C55x, the FRCT bit automates a left shift by one bit after to eliminate an extra in fractional results, while intrinsics like _norm and _lnorm perform normalization by shifting 16-bit or 32-bit values left until the most significant bit is set. Post- renormalization often involves right-shifting the 64-bit product to fit the target Q format, using the for efficiency; for example, in Q15 , the result is shifted right by 15 bits before accumulation. In field-programmable gate arrays (FPGAs), Intel's (HLS) supports through ac_fixed types, where arbitrary Q formats (defined by total bits N and integer bits I) allow custom overflow modes like saturation during scaling operations. Field-programmable gate arrays (FPGAs) extend fixed-point support via configurable logic blocks implementing Q formats, often with dedicated DSP slices for ALUs and multipliers. Agilex FPGAs, for instance, feature variable-precision DSP blocks optimized for fixed-point multiplication and accumulation, supporting saturation and wraparound modes configurable in HLS via ac_fixed's overflow parameter (O), which handles excess bits by clamping or modular reduction. These implementations allow through explicit shifts in the design, ensuring scalability for custom Q formats in applications like filters or transforms.

Software and Language Support

Several programming languages provide built-in support for fixed-point arithmetic to ensure precise control over decimal precision in applications requiring exact fractional representations. In Ada, fixed-point types are a core language feature, categorized as either ordinary fixed-point types or decimal fixed-point types, where the error bound is defined by a delta parameter specifying the smallest representable difference between values. Similarly, natively employs fixed-point decimal arithmetic for all numeric computations, treating operands as packed decimal formats unless explicitly designated otherwise, which guarantees consistent handling of financial and business data without floating-point approximations. In languages lacking native fixed-point types, such as C and C++, developers typically emulate fixed-point arithmetic using integer types combined with scaling factors and macros to manage fractional parts. For instance, the libfixmath library implements Q16.16 fixed-point operations in C, providing functions for addition, multiplication, and trigonometric computations while avoiding floating-point dependencies for embedded systems. Emulation techniques often involve defining user structs or classes to encapsulate an integer value and a fixed scaling constant, such as representing a value vv as int×2s\text{int} \times 2^{-s} where ss is the scaling exponent, enabling straightforward arithmetic by shifting and scaling results. In C++, operator overloading further enhances usability by allowing custom fixed-point classes to mimic built-in arithmetic operators, such as redefining + and * to perform scaled integer operations internally. Specialized libraries extend fixed-point support across various ecosystems. Python's decimal module offers arbitrary-precision decimal arithmetic that can emulate fixed-point behavior through context settings for precision and , making it suitable for applications demanding exact decimal fractions like calculations. In Java, the BigDecimal class provides immutable, arbitrary-precision signed numbers with a configurable scale, effectively supporting fixed-point decimal operations for high-accuracy computations in . For digital signal processing on ARM-based microcontrollers, the CMSIS-DSP library includes optimized fixed-point functions in formats like Q15 and Q31, leveraging integer instructions for efficient filtering and transforms on Cortex-M processors. Portability remains a key consideration in fixed-point implementations, as consistent scaling and precision must be maintained across diverse platforms without relying on floating-point hardware variations. Emulations using pure operations ensure bit-exact results regardless of or differences, though careful definition of scaling factors is essential to avoid platform-specific overflow behaviors.

Conversion Between Fixed- and Floating-Point

Converting a fixed-point number to floating-point involves interpreting the fixed-point integer representation by dividing it by the scaling factor, which typically equals 2Q2^Q where QQ is the number of fractional bits in the fixed-point format. This process scales the value to its true magnitude while preserving the sign for signed formats. For example, in hardware implementations like ARM's VCVT instruction, the fixed-point value is loaded into a register, the fractional bits are accounted for via a specified fbitsfbits parameter (ranging from 0 to 32), and the result is rounded to nearest for floating-point output. Normalization may be required if the fixed-point format lacks an implied leading 1, involving a left shift to align the most significant bit and adjustment of the exponent accordingly. Overflow occurs if the scaled value exceeds the floating-point format's range (e.g., IEEE 754 single-precision maximum of approximately 3.4×10383.4 \times 10^{38}), in which case the result is set to infinity or clamped, while underflow to subnormal values may produce denormalized floating-point numbers if the magnitude falls below the minimum normalized value. The formula for fixed-point to floating-point conversion is: fp=fixed_int2Qfp = \frac{\text{fixed\_int}}{2^Q} where fixed_int\text{fixed\_int} is the signed or unsigned stored in the fixed-point representation, and fpfp is the resulting floating-point value. For signed formats using , the sign bit is extended appropriately during the division. In cases where the fixed-point value is exactly representable in floating-point (e.g., powers of 2 within the mantissa precision), the conversion is exact; otherwise, to nearest ensures minimal , though precision loss can occur if QQ exceeds the floating-point mantissa bits (e.g., 23 for single-precision). Denormals in the output floating-point arise when the fixed-point value's scaled magnitude is very small, below 21262^{-126} for single-precision, allowing gradual underflow without abrupt zeroing. This conversion is common in mixed-precision systems, such as embedded processors where fixed-point hardware accelerates certain operations but floating-point is needed for broader in intermediate results. The reverse conversion, from floating-point to fixed-point, multiplies the floating-point value by the scaling factor 2Q2^Q and rounds the result to the nearest to fit the fixed-point representation. modes vary by implementation; ARM's VCVT uses round towards zero for this direction to avoid introducing extraneous precision, while general software algorithms often employ round to nearest for better accuracy. The formula is: fixed_int=\round(fp×2Q)\text{fixed\_int} = \round\left( fp \times 2^Q \right) where \round\round denotes the chosen rounding function. Precision loss is inevitable if the floating-point value has more significant digits than the fixed-point format can represent, potentially leading to quantization errors; for instance, values not aligned with the fixed-point grid after scaling are truncated or rounded, with error bounded by 2(Q+1)2^{-(Q+1)}. Overflow and underflow are handled by clamping to the fixed-point range (e.g., from 2I1-2^{I-1} to 2I12Q2^{I-1} - 2^{-Q} for signed formats with II integer bits) or raising exceptions, as in IEEE 754-compliant systems. In mixed-precision environments, such conversions enable seamless integration, such as converting floating-point inputs to fixed-point for efficient DSP processing while monitoring for inexact results that could propagate errors.

Applications and Examples

Practical Applications

Fixed-point arithmetic is widely employed in embedded systems, particularly in resource-constrained environments like microcontrollers, where its predictability and low computational overhead enable efficient real-time operations. In , for instance, proportional-integral-derivative (PID) controllers often utilize fixed-point representations to manage tasks, such as stabilizing quad-rotor flying robots, by performing closed-loop calculations with 16-bit fixed-point fractions to balance precision and speed without the variability of floating-point units. This approach is especially valuable in vision-based applications within embedded platforms, where fixed-point scaling avoids the precision issues of floating-point while supporting compact, power-efficient hardware. In digital signal processing (DSP), fixed-point arithmetic underpins numerous algorithms in audio and video processing, offering superior efficiency over floating-point in scenarios demanding high throughput on limited hardware. Fixed-point implementations are common in finite impulse response (FIR) filters and fast Fourier transforms (FFTs) used in audio codecs, where they reduce power consumption and cost compared to floating-point alternatives, making them ideal for portable devices and real-time streaming. For video codecs, fixed-point DSPs handle tasks like motion compensation and transform coding with deterministic behavior, ensuring consistent performance in high-volume, battery-operated systems without the dynamic range overhead of floating-point processors. In recent years, fixed-point arithmetic has gained prominence in tiny (TinyML) applications for edge devices. Quantized fixed-point representations enable efficient of neural networks on microcontrollers and FPGAs, minimizing memory usage and power consumption while maintaining acceptable accuracy for tasks like and . As of 2024, frameworks supporting fixed-point quantization have accelerated deployment of models in IoT devices. Financial software frequently adopts fixed-point arithmetic to represent values precisely, mitigating the errors inherent in binary floating-point that could lead to discrepancies in transactions or billing. By scaling amounts to integers (e.g., treating dollars as multiples of cents), banking systems ensure exact decimal handling, avoiding cumulative errors that might otherwise accumulate to significant financial losses, such as millions in annual billing inaccuracies. This method aligns with human-readable decimal norms, providing reliability in applications like calculations and account balancing, where even minor precision losses are unacceptable. In video games and physics simulations, fixed-point arithmetic facilitates consistent, low-latency computations on low-end hardware, such as older consoles or mobile devices, by replacing floating-point operations with integer-based scaling for predictable results. For physics engines simulating gravity, collisions, and , fixed-point methods maintain uniform behavior across platforms, reducing resource demands and enabling real-time rendering without floating-point's variability. This is particularly evident in resource-optimized simulations, where fixed-point conversions between rendering and physics modules preserve accuracy while minimizing hardware overhead on embedded GPUs.

Detailed Numerical Examples

To illustrate binary fixed-point multiplication using an unsigned format, consider the numbers 1.5 and 0.75 represented in UQ1.3 format (1 integer bit and 3 fractional bits, total 4 bits, scaling factor 23=0.1252^{-3} = 0.125). The binary representation of 1.5 is 1100 (decimal 12, value 12×0.125=1.512 \times 0.125 = 1.5), and 0.75 is 0110 (decimal 6, value 6×0.125=0.756 \times 0.125 = 0.75). The multiplication proceeds by treating the representations as unsigned integers: 12×6=7212 \times 6 = 72 (binary 1001000). In UQ1.3 format, the product of two such numbers yields a UQ2.6 intermediate result (scaling 26=0.0156252^{-6} = 0.015625), so the value is 72×0.015625=1.12572 \times 0.015625 = 1.125. To fit back into UQ1.3, shift the integer product right by 3 bits (equivalent to dividing by 8): 72÷8=972 \div 8 = 9 (binary ). The result 1001 in UQ1.3 represents 9×0.125=1.1259 \times 0.125 = 1.125, matching the exact product 1.5×0.75=1.1251.5 \times 0.75 = 1.125. In binary, the 7-bit product 1001000 shifted right by 3 becomes 1001 (padded as 00001001 in 8 bits for clarity before to 4 bits). For a decimal fixed-point multiplication example, take 12.34 and 5.67 with a scaling factor of 100 (two decimal places, so representations are integers 1234 and 567). Multiply the scaled integers: 1234×567=6996781234 \times 567 = 699678. The product scaling is 100×100=10000100 \times 100 = 10000, so divide by 10000: 699678÷10000=69.9678699678 \div 10000 = 69.9678. Rounding to two decimal places (common for this scale) gives 69.97, approximating the exact 12.34×5.67=69.967812.34 \times 5.67 = 69.9678. This method avoids floating-point operations while preserving precision up to the scale. An example of fixed-point addition in binary uses unsigned UQ2.2 format (2 integer bits and 2 fractional bits, total 4 bits, scaling 22=0.252^{-2} = 0.25) for 3.25 and 1.75. The representation of 3.25 is 1101 (decimal 13, value 13×0.25=3.2513 \times 0.25 = 3.25), and 1.75 is 0111 (decimal 7, value 7×0.25=1.757 \times 0.25 = 1.75). Add the integers directly: 13+7=2013 + 7 = 20 (binary 10100). The sum 10100 in an extended UQ3.2 format (to handle the carry) represents 20×0.25=5.0020 \times 0.25 = 5.00, matching 3.25+1.75=5.003.25 + 1.75 = 5.00. Note the overflow beyond 2 integer bits (maximum 3.75 in UQ2.2), requiring a wider format for the result. For fixed-point division using a scaled reciprocal, consider 10 divided by 3 in unsigned UQ4.4 format (4 bits and 4 fractional bits, total 8 bits, scaling 24=0.06252^{-4} = 0.0625). Represent 10 as 10100000 (decimal 160, value 160×0.0625=10160 \times 0.0625 = 10) and 3 as 00110000 (decimal 48, value 48×0.0625=348 \times 0.0625 = 3). To divide, approximate the reciprocal of 3 with scaling 28=2562^8 = 256: 256÷385.333256 \div 3 \approx 85.333, take 85 (binary 01010101 in UQ0.8, value 85×280.332085 \times 2^{-8} \approx 0.3320). Multiply the by this reciprocal: 10×0.3320=3.32010 \times 0.3320 = 3.320. In terms, scale 10 by 24=162^4 = 16 to 160, multiply by 85: 160×85=13600160 \times 85 = 13600, then adjust scaling by dividing by 24+8=40962^{4+8} = 4096: 13600÷40963.320313600 \div 4096 \approx 3.3203. To fit to UQ4.4, scale 3.3203 by 16 ≈ 53.1248 and round to 53 (binary 00110101, value 53×0.0625=3.312553 \times 0.0625 = 3.3125), providing a close to 10÷33.333310 \div 3 \approx 3.3333, with due to ; higher scaling (e.g., 2122^{12}) improves accuracy.

Notations and Representations

Fixed-point arithmetic employs various notations to specify the format of numbers, particularly the position of the binary or decimal point and the allocation of bits for and fractional parts. The most widely used notation for binary fixed-point representations is the Q format, originally developed by for describing signed fixed-point numbers in applications. In Q notation, a format denoted as Qm.nQ_{m.n} indicates a signed binary fixed-point number with mm bits for the portion (excluding the ), nn bits for the fractional portion, and an implicit binary point between them, resulting in a total word length of m+n+1m + n + 1 bits including the . For instance, the Q15.16Q_{15.16} format, common in 32-bit audio processing, allocates 15 bits to the part, 16 bits to the , and 1 , allowing representation of values from 215-2^{15} to 2152162^{15} - 2^{-16} with a resolution of 2162^{-16}. Variations of Q notation distinguish between signed and unsigned formats. Signed formats use the standard Qm.nQ_{m.n} to denote two's complement representation, where the most significant bit serves as the sign. Unsigned formats are prefixed with U, as in UQm.nUQ_{m.n}, which omits the sign bit and uses m+nm + n bits total, representing non-negative values from 0 to 2m+n2n2^{m+n} - 2^{-n}. Some contexts employ SQ for explicitly signed Q formats and UQ for unsigned, particularly in documentation for fixed-point libraries and hardware descriptions, to clarify the interpretation without altering the core m.nm.n structure. In decimal fixed-point contexts, notations often resemble programming language specifications, such as fixed(p,q)(p,q), where pp denotes the total number of digits and qq the number of digits after the point. This format appears in languages like C++ for arithmetic libraries or in hardware description languages for simulating fixed-point operations, emphasizing precision in financial or legacy systems. Block floating-point serves as a hybrid notation related to fixed-point, where a block of numbers shares a common scaling exponent while maintaining fixed-point mantissas within the block, enabling efficient emulation of floating-point behavior on fixed-point hardware without individual exponents. Historically, early computers like the used fixed-point notations based on packed decimal formats, storing two digits per byte with an implied decimal point position specified by the , and a sign in the low-order , supporting commercial applications with up to 16 digits per word.

References

Add your contribution
Related Hubs
User Avatar
No comments yet.