Hubbry Logo
Q (number format)Q (number format)Main
Open search
Q (number format)
Community hub
Q (number format)
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Q (number format)
Q (number format)
from Wikipedia

The Q notation is a way to specify the parameters of a binary fixed point number format. Specifically, how many bits are allocated for the integer portion, how many for the fractional portion, and whether there is a sign-bit.

For example, in Q notation, Q7.8 means that the signed fixed point numbers in this format have 7 bits for the integer part and 8 bits for the fraction part. One extra bit is implicitly added for signed numbers.[1] Therefore, Q7.8 is a 16-bit word, with the most significant bit representing the two's complement sign bit.

Q7.8 Sign bit 7-bit integer 8-bit fraction
Bit Value

There is an ARM variation of the Q notation that explicitly adds the sign bit to the integer part. In ARM Q notation, the above format would be called Q8.8.

A number of other notations have been used for the same purpose.

Definition

[edit]

General Format

[edit]

Texas Instruments version

[edit]

The Q notation, as defined by Texas Instruments,[1] consists of the letter Q followed by a pair of numbers m.n, where m is the number of bits used for the integer part of the value, and n is the number of fraction bits.

By default, the notation describes signed binary fixed point format, with the unscaled integer being stored in two's complement format, used in most binary processors. As such, the first bit always gives the sign of the value (1 = negative, 0 = non-negative), and it is not counted in the m parameter. Thus, the total number w of bits used is 1 + m + n.

For example, the specification Q3.12 describes a signed binary fixed-point number with word-size w = 16 bits in total, comprising the sign bit, three bits for the integer part, and 12 bits that are the fraction. This can be seen as a 16-bit signed (two's complement) integer, that is implicitly multiplied by the scaling factor .

In particular, when n is zero, the numbers are just integers. If m is zero, all bits except the sign bit are fraction bits; then the range of the stored number is from −1.0 (inclusive) to +1.0 (exclusive).

The m and the dot may be omitted, in which case they are inferred from the size of the variable or register where the value is stored. Thus, Q12 means a signed integer with any number of bits, that is implicitly multiplied by .

The letter U can be prefixed to the Q to denote an unsigned binary fixed-point format. For example, UQ1.15 describes values represented as unsigned 16-bit integers with an implicit scaling factor of , which range from to .

ARM version

[edit]

A variant of the Q notation has been in use by ARM in which the m number also counts the sign bit. For example, a 16-bit signed integer which the TI variant denotes as Q15.0, would be Q16.0 in the ARM variant.[2][3] Unsigned numbers are the same across both variants.

While technically the sign-bit belongs just as much to the fractional part as the integer part, ARM's notation has the benefit that there are no implicit bits, so the size of the word is always .

Characteristics

[edit]

The resolution (difference between successive values) of a Qm.n or UQm.n format is always 2n. The range of representable values depends on the notation used:

Range of representable values in Q notation
Format TI Notation ARM Notation
Signed Qm.n −2m to +2m − 2n −2m−1 to +2m−1 − 2n
Unsigned UQm.n 0 to 2m − 2n 0 to 2m − 2n

For example, a Q14.1 format number requires 14+1+1 = 16 bits, has resolution 2−1 = 0.5, and the representable values range from −214 = −16384.0 to +214 − 2−1 = +16383.5. In hexadecimal, the negative values range from 0x8000 to 0xFFFF followed by the non-negative ones from 0x0000 to 0x7FFF.

Math operations

[edit]

Q numbers are a ratio of two integers: the numerator is kept in storage, the denominator is equal to 2n.

Consider the following example:

  • The Q8 denominator equals 28 = 256
  • 1.5 equals 384/256
  • 384 is stored, 256 is inferred because it is a Q8 number.

If the Q number's base is to be maintained (n remains constant) the Q number math operations must keep the denominator constant. The following formulas show math operations on the general Q numbers and . (If we consider the example as mentioned above, is 384 and is 256.)

Because the denominator is a power of two, the multiplication can be implemented as an arithmetic shift to the left and the division as an arithmetic shift to the right; on many processors shifts are faster than multiplication and division.

To maintain accuracy, the intermediate multiplication and division results must be double precision and care must be taken in rounding the intermediate result before converting back to the desired Q number.

Using C the operations are (note that here, Q refers to the fractional part's number of bits) :

Addition

[edit]
int16_t q_add(int16_t a, int16_t b)
{
    return a + b;
}

With saturation

int16_t q_add_sat(int16_t a, int16_t b)
{
    int16_t result;
    int32_t tmp;

    tmp = (int32_t)a + (int32_t)b;
    if (tmp > 0x7FFF)
        tmp = 0x7FFF;
    if (tmp < -1 * 0x8000)
        tmp = -1 * 0x8000;
    result = (int16_t)tmp;

    return result;
}

Unlike floating point ±Inf, saturated results are not sticky and will unsaturate on adding a negative value to a positive saturated value (0x7FFF) and vice versa in that implementation shown. In assembly language, the Signed Overflow flag can be used to avoid the typecasts needed for that C implementation.

Subtraction

[edit]
int16_t q_sub(int16_t a, int16_t b)
{
    return a - b;
}

Multiplication

[edit]
// precomputed value:
#define K   (1 << (Q - 1))
 
// saturate to range of int16_t
int16_t sat16(int32_t x)
{
	if (x > 0x7FFF) return 0x7FFF;
	else if (x < -0x8000) return -0x8000;
	else return (int16_t)x;
}

int16_t q_mul(int16_t a, int16_t b)
{
    int16_t result;
    int32_t temp;

    temp = (int32_t)a * (int32_t)b; // result type is operand's type
    // Rounding; mid values are rounded up
    temp += K;
    // Correct by dividing by base and saturate result
    result = sat16(temp >> Q);

    return result;
}

Division

[edit]
int16_t q_div(int16_t a, int16_t b)
{
    /* pre-multiply by the base (Upscale to Q16 so that the result will be in Q8 format) */
    int32_t temp = (int32_t)a << Q;
    /* Rounding: mid values are rounded up (down for negative values). */
    /* OR compare most significant bits i.e. if (((temp >> 31) & 1) == ((b >> 15) & 1)) */
    if ((temp >= 0 && b >= 0) || (temp < 0 && b < 0)) {   
        temp += b / 2;    /* OR shift 1 bit i.e. temp += (b >> 1); */
    } else {
        temp -= b / 2;    /* OR shift 1 bit i.e. temp -= (b >> 1); */
    }
    return (int16_t)(temp / b);
}

See also

[edit]

References

[edit]

Further reading

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
The Q format, also known as Q notation, is a binary fixed-point representation used to denote rational numbers by specifying the position of an implied binary point within a fixed word , typically dividing the bits into and fractional portions. This format enables efficient arithmetic operations on fractional values using hardware, avoiding the overhead of floating-point units. In the standard Qm.n notation for signed numbers, m represents the number of integer bits excluding the , while n indicates the fractional bits after the binary point, resulting in a total word length of m + n + 1 bits including the . The value of a number in this format is interpreted as a signed scaled by 2n2^{-n}, providing a resolution of 2n2^{-n} and a range from 2m-2^{m} to 2m2n2^{m} - 2^{-n}. For example, in Q3.4 (8-bit signed), the value 00001001₂ equals 9×24=0.5625109 \times 2^{-4} = 0.5625_{10}, while unsigned variants (UQm.n) omit the for positive ranges starting from 0. Variations in convention exist across vendors; for instance, some notations include the in m. Shorthand notations like Q15 in 16-bit DSP contexts typically refer to Q0.15 (1 + 15 fractional bits) to maximize fractional precision. Q format is particularly prominent in resource-constrained environments like digital signal processors (DSPs), field-programmable gate arrays (FPGAs), and smart contracts in programming, where it facilitates , , and scaling without floating-point support. Arithmetic operations require aligning binary points—such as right-shifting one operand by n bits for in Qm.n—to maintain format consistency, often followed by or to prevent overflow. Its advantages include deterministic performance, lower power consumption, and simpler hardware implementation compared to floating-point, though it demands careful management of quantization errors and .

Introduction

Definition and Notation

Fixed-point arithmetic provides an efficient method for representing and processing fractional numbers in digital systems, serving as a practical alternative to in resource-constrained environments like embedded processors and field-programmable gate arrays (FPGAs), where it reduces hardware overhead and power usage compared to the more complex floating-point units. The format is a widely adopted notation for specifying fixed-point representations, denoted as Qm.n for signed numbers, where m indicates the number of bits dedicated to the integer portion—including the —and n specifies the bits for the fractional portion, resulting in a total bit width of m + n. This convention uses encoding, with the binary point implied after the m-th bit from the left (most significant bit). The numerical value is obtained by interpreting the entire bit string as a signed and dividing it by the scaling factor 2n2^n, which aligns the fractional bits correctly. For instance, the Q1.3 format employs 4 bits total: 1 for the integer part (encompassing the sign) and 3 for the fractional part. The bit pattern 0011 in binary represents the signed 3, scaled by 23=1/82^{-3} = 1/8, yielding 3×(1/8)=0.3753 \times (1/8) = 0.375 in ; similarly, 1011 ( for -5) scales to 5×(1/8)=0.625-5 \times (1/8) = -0.625. Vendor-specific implementations may introduce minor adaptations to this standard notation for particular hardware or software ecosystems.

Historical Development

The Q number format emerged in the alongside the development of early digital signal processors (DSPs), driven by the need for efficient numerical representation in resource-constrained environments where floating-point hardware was prohibitively expensive and power-hungry. The notation was defined by for use in their DSP products. Fixed-point arithmetic, including early forms of Q notation, enabled real-time by leveraging operations to approximate fractional values, thus avoiding the complexity of dedicated floating-point units. A key milestone occurred in 1982 when introduced the , the first commercial fixed-point DSP in the series, which popularized fixed-point formats for embedded applications and laid the groundwork for standardized notations like Qm.n. This adoption was motivated by fixed-point's advantages in embedded systems, including lower hardware costs, reduced power consumption compared to floating-point alternatives, and more predictable, deterministic behavior that minimized issues like rounding variability in time-critical tasks. In the early 2000s, ARM extended the use of Q formats through software support in its Cortex-M processor family, beginning with the Cortex-M3 in 2004 and further solidified by the CMSIS-DSP library's inclusion of fixed-point functions (such as Q7, Q15, and Q31) starting around 2010, facilitating broader integration in low-power microcontrollers. Although IEEE 754 focused on floating-point standards without formalizing fixed-point, the Q notation gained traction post-1990s through DSP literature and discussions on portable fixed-point arithmetic, establishing it as a widely recognized convention in the field.

Format Specifications

General Qm.n Notation

The Qm.n notation provides a standardized way to describe fixed-point binary representations, where m denotes the number of bits allocated to the portion excluding the for signed variants and n denotes the number of bits for the fractional portion, resulting in a total of 1 + m + n bits including the . Note that conventions vary: in some (e.g., certain FPGA implementations), m includes the , leading to total m + n bits; the exclusion convention is common in DSP contexts. In this format, the binary point is fixed between the and fractional bits, allowing for efficient hardware in resource-constrained environments like processors. For signed Qm.n formats, the MSB serves as the , followed by m bits representing the magnitude, and then n bits for the , with negative values encoded using arithmetic to facilitate straightforward operations. Unsigned variants, often denoted as UQm.n or similar, omit the and allocate m + n bits to the unsigned and s, restricting representation to non-negative values without . This distinction ensures compatibility with both positive-only and bipolar needs, such as in audio or control systems. The flexibility of Qm.n lies in selecting m and n to balance and precision according to application requirements; for instance, in 16-bit DSP for audio processing with signals in [-1, 1), Q1.15 (1 sign bit, 0 additional integer bits, 15 fractional bits, total 16 bits) provides high fractional resolution with range -1 to almost 1 and resolution 2152^{-15}. Such choices stem from early practices, where fixed-point formats optimized for 16- or 32-bit architectures. Conversion between integer and Qm.n formats follows simple bit manipulation rules: to represent an integer k in Qm.n, left-shift its binary form by n bits (effectively multiplying by 2n2^n) and truncate or round to fit the bit width, preserving the implied scaling. Conversely, extracting the integer part from a Qm.n value involves right-shifting by n bits using for signed numbers (to maintain the ) or for unsigned, with overflow handled via saturation or wrapping depending on the implementation, though the core rule ensures the fractional bits are discarded.

Vendor-Specific Variants

Texas Instruments implements Q formats in its digital signal processors (DSPs), particularly emphasizing saturation arithmetic to prevent overflow wrap-around by clipping results to the maximum or minimum representable values. In the TMS320C55x family of 16-bit fixed-point DSPs, the Q15 format is commonly used, where arithmetic operations support saturation mode via the SATD bit in the , ensuring results stay within the range of -1 to 0.999969482421875 for 16-bit signed fractional representation. This saturation feature is integral to the DSP library (DSPLIB), which optimizes functions like and for Q15 data, saturating outputs to 0x7FFF or 0x8000 as needed. For higher-performance applications, the C6000 series, including the TMS320C64x and C67x devices, supports Q formats through the IQmath library, which provides scalable fixed-point routines in Q15, Q31, and other variants with optional saturation during operations like and scaling. These libraries leverage the C6000's 32-bit and 40-bit extended-precision accumulators in earlier models for intermediate computations, enhancing precision before final Q-format storage. ARM's implementation of formats is standardized in the CMSIS-DSP for Cortex-M processors, focusing on efficient for embedded systems. The defines Q7, Q15, and Q31 as purely fractional formats (Qm.n with m=0), where the is implied, and the remaining bits represent the fractional part scaled by 2n2^{-n}; for example, Q31 uses 32 bits with 31 fractional bits, mapping integer values from 231-2^{31} to 23112^{31}-1 to the real range -1 to almost 1. These formats are optimized for Cortex-M4 and M7 cores with DSP extensions, supporting operations like and filtering with built-in saturation to avoid overflow. The Q31.0 variant treats 32-bit signed integers as fractions by implying a binary point after the , facilitating seamless scaling in tasks without explicit integer-to-fractional conversion. NXP (formerly Freescale) adapts Q formats in its automotive microcontrollers and digital signal controllers, particularly through the Automotive Math and Motor Control Library (AMMCLib) for families like S32K and MPC57xx. This library employs fixed-point fractional formats equivalent to Q15 (16-bit signed fractional) and Q31 (32-bit signed fractional) for high-performance computations in and applications, with saturation arithmetic in core math functions to clip overflows. A notable variant in 16-bit contexts is Q2.14 for balanced integer-fractional representation, allowing a range up to approximately ±2 with 14-bit fractional precision for tasks like angle calculations or PID control. Differences in implementation include support for little-endian byte order in ARM-based NXP devices and variable accumulator widths, such as 40-bit extensions in legacy Freescale DSP cores for intermediate results. Vendor-specific Q formats generally align with the general Qm.n notation but introduce compatibility challenges due to implied scaling and overflow behaviors. For instance, ARM's corresponds to the general Q0.15, requiring when interfacing with standard types, while TI's Q15 in saturation mode may produce clipped results incompatible with wrap-around expectations in non-TI environments. These mappings necessitate careful and scaling factors during data exchange between vendor ecosystems.

Numerical Properties

Representable Range

In the Qm.n fixed-point format, signed numbers are represented using arithmetic, where m denotes the number of bits allocated to the portion (including the ) and n the fractional portion. The representable range for signed Qm.n is from 2m1-2^{m-1} to 2m12n2^{m-1} - 2^{-n}. For example, in Q1.3 format, the range spans from -1 to 0.875. For unsigned variants, denoted as UQm.n, the format omits the and uses m bits for the part and n for the , yielding a total of m + n bits. The representable range is from 0 to 2m2n2^m - 2^{-n}. An example is the unsigned Q0.3 format, which covers 0 to 0.875. The total bit width in Qm.n formats directly influences the , particularly in applications where it quantifies the span from the smallest nonzero representable value to the maximum magnitude. For signed formats, this is approximately 6.02(m + n - 1) dB, highlighting how additional bits exponentially expand the achievable magnitude span relative to the resolution. In general Qm.n implementations using , overflow results in wrap-around behavior, where values exceeding the representable bounds modularly revert to the opposite end of the range (e.g., from maximum positive to minimum negative), distinguishing it from saturation modes employed in some vendor-specific hardware.

Precision and Rounding

In the Qm.n fixed-point number format, the fractional resolution, or the smallest distinguishable positive fractional value, is given by 2n2^{-n}, where nn is the number of bits allocated to the . For instance, in the common Q1.15 format with 16 total bits (1 integer bit including the and 15 fractional bits), the resolution is 215=1327680.0000305175781252^{-15} = \frac{1}{32768} \approx 0.000030517578125. This resolution determines the granularity with which real numbers can be approximated, directly impacting the accuracy of representations in applications like . When converting a to its Qm.n representation, quantization error arises due to the finite resolution. For to the nearest representable value, the maximum quantization error is bounded by half the resolution, or 2(n+1)2^{-(n+1)}. , a simpler mode that discards excess fractional bits toward zero, introduces a maximum error of up to 2n2^{-n} and a systematic , as positive errors are always less than the full resolution while negative values may accumulate offsets. To achieve rounding to nearest, a common implementation adds 2(n+1)2^{-(n+1)} (half the least significant bit) to the value before performing a right shift or , minimizing the average error magnitude. In digital signal processors (DSPs), such as those from , convergent rounding (also known as round-to-nearest-even) is often employed for values exactly halfway between representable numbers; it rounds to the nearest even to further reduce long-term in iterative computations. These modes balance computational efficiency with error control, though remains prevalent in resource-constrained embedded systems due to its simplicity. A key in Qm.n design is the allocation of bits between the integer part (mm) and (nn) for a fixed total bit width. Increasing nn enhances fractional precision and reduces quantization error but correspondingly decreases mm, limiting the representable range and increasing overflow risk for larger magnitudes. This necessitates careful selection based on the expected and required accuracy of the application.

Arithmetic Operations

Addition and Subtraction

Addition and subtraction of numbers in Q format are performed by treating them as scaled integers, leveraging the fixed position of the binary point to simplify operations compared to floating-point arithmetic. When both operands share the same scale (i.e., identical m and n in Qm.n), the binary points are already aligned, allowing direct integer addition or subtraction on the underlying bit representations using two's complement arithmetic. The result retains the same Qm.n format, provided no overflow occurs, as the scaling factor (2^n) applies uniformly to both inputs and the output. For operands with differing scales, alignment is achieved by shifting one number to match the other's binary point position. If the fractional parts differ (different n), the number with the smaller n is left-shifted by the difference in fractional bits to increase its precision, while the number with the larger n may be right-shifted to match, though this risks losing least significant bits and precision. The integer parts are aligned by sign-extending the shorter one if necessary. The resulting sum or difference uses the maximum of the input scales, Q_max(m_a, m_b).max(n_a, n_b), potentially requiring an extra bit for the integer part to accommodate carry or borrow. follows the same alignment but negates one operand via before . Consider two numbers in Q2.2 format (5 bits total: sign + 2 integer + 2 fractional, scaling by 2^2 = 4). The binary representation 00110 ( 6) represents 6/4 = 1.5, and 11011 ( -5) represents -5/4 = -1.25. Adding them: 00110 + 11011 = 00001 (1 in ), or 1/4 = 0.25 in Q2.2. For scale mismatch, say Q2.2 (1.5 as 00110) and Q2.1 (0.5 as 0001 in 4 bits, 1, scaled by 2^1 = 2). Left-shift the Q2.1 to Q2.2 by 1 bit, sign-extending to 5 bits: 00010 ( 2, value 2/4 = 0.5). Adding: 00110 + 00010 = 01000 (8 in , value 8/4 = 2.0), but this requires an extended format like Q3.2 to avoid , as Q2.2 max is 1.75. Overflow detection is critical, as the result must stay within the representable range [-2^{m}, 2^{m}-2^{-n}]. For addition, monitor the sign bits: if both inputs have the same sign but the result differs, overflow has occurred. Underflow for is similar. To mitigate, use a wider format (e.g., Q(m+1).n) for intermediate results or saturation arithmetic to clamp extremes.

Multiplication and Scaling

In with the Qm.n format, multiplication begins by interpreting the operands as and performing a standard multiplication, yielding a product with doubled bit precision: specifically, up to 2m integer bits (excluding sign) and 2n fractional bits for unsigned representations, or adjusted for in signed cases. This intermediate result represents the true product scaled by 22n2^{2n}, as each operand is implicitly scaled by 2n2^n. To restore the Qm.n format and preserve fractional precision, the product is right-shifted by n bits, effectively dividing by 2n2^n and aligning the scaling factor. For two operands in Qa.b and Qc.d formats, the unscaled product is in Q(a+c).(b+d); shifting right by (b + d - n) bits fits it into the target Qm.n, where the integer bits may expand to up to 2m before potential or clipping. This discards the least significant n bits of the , introducing potential precision loss through , though can be applied by adding 2n12^{n-1} prior to shifting for improved accuracy. For example, multiplying two numbers in Q1.3 format (each as 4-bit unsigned integers scaled by 232^3) produces an 8-bit product in Q2.6 format; right-shifting by 3 bits yields a result in Q2.3 format, with the expanded integer bits accommodating the doubled range. Consider operands 1.5 (121012_{10}, binary 1100) and 0.875 (7107_{10}, binary 0111): their integer product is 841084_{10} (binary 1010100); shifting right by 3 bits gives 101010_{10} (binary 1010), representing approximately 1.25 in Q2.3 (exact value 1.3125, with truncation error). If the product overflows the target bit width before or after scaling—such as exceeding the representable range in the part—saturation is applied in vendor-specific implementations to clip the result to the maximum or minimum value, preventing wrap-around errors common in . For instance, ' IQmath library functions like _IQNrsmpy perform multiplication with built-in rounding and saturation to the Q format limits (e.g., ±32 for Q26), ensuring numerical stability in digital signal processing applications. This approach contrasts with simple , prioritizing bounded errors over unbounded overflow.

Division and Reciprocals

Division in Q format fixed-point arithmetic is typically approximated by computing the reciprocal of the divisor and then multiplying it by the dividend, leveraging multiplication hardware for efficiency in resource-constrained environments like digital signal processors (DSPs). This reciprocal multiplication method avoids the complexity of direct long division, which is costly in fixed-point implementations without dedicated divider units. Lookup tables or iterative approximations are used to estimate the reciprocal (1/divisor), with the result scaled to match the Q format's fractional bits. The Newton's method, also known as Newton-Raphson iteration, provides a quadratic convergence approach for reciprocal computation in fixed-point arithmetic. Starting with an initial guess x0x_0 for the reciprocal of the divisor dd (normalized to [0.5, 1) in Q1.m format for simplicity), subsequent estimates are refined using the iteration xk+1=xk(2dxk),x_{k+1} = x_k (2 - d x_k), where operations are performed in an extended precision format (e.g., doubling the bit width to prevent overflow) and then scaled back. The initial guess can be obtained from a small lookup table based on the leading bits of dd, providing 4-8 bits of accuracy. Each iteration roughly doubles the number of correct bits, making it suitable for Q formats with 16 or fewer bits. For a practical example in Q1.15 format (16-bit total, 1 integer bit, 15 fractional bits), consider dividing a nn by a dd (both in Q1.15). Compute the reciprocal rr of dd in Q1.15 such that r215/dr \approx 2^{15} / d (using with normalized dd). Multiply nn by rr in (e.g., 32-bit, yielding a Q2.30 intermediate), and right-shift the product by 15 bits to obtain the Q1.15 . Using , an initial 4-bit guess refined over 3 iterations achieves the full 15-bit fractional precision required. Error analysis of iterative reciprocal methods in 16-bit formats shows that 3-4 Newton-Raphson steps from a modest initial approximation (e.g., 8-bit table lookup) suffice to reach full machine precision, with relative errors bounded by 2162^{-16} after convergence. Quadratic convergence ensures rapid error reduction, though one bit of precision may be lost per due to fixed-point , necessitating a final correction step via scaling. This approach yields higher signal-to-quantization-noise ratios (SQNR > 55 dB) compared to alternatives like in DSP applications.

Practical Applications

In Digital Signal Processing

In (DSP), Q formats enable efficient implementation of algorithms on fixed-point hardware, supporting real-time operations such as filtering and transforms in resource-constrained environments. Common formats like Q15 (or Q1.15) are widely used for (FIR) and (IIR) filters, where the single integer bit and 15 fractional bits provide sufficient resolution for signals normalized to the range (-1, 1). These formats are also applied in (FFT) computations within audio and video codecs, leveraging the fixed scaling to maintain precision during frequency-domain processing. The primary benefits of Q formats in DSP include predictable execution timing and absence of issues that plague . Unlike floating-point operations, which can incur performance penalties from handling subnormal numbers near zero, fixed-point Q formats perform uniform integer-like arithmetic, ensuring consistent cycle counts essential for real-time applications. This efficiency supports implementations in decoding, where fixed-point processors handle inverse discrete cosine transforms and bit allocation without floating-point overhead, and in acoustic cancellation, where adaptive filters converge reliably on low-power devices. A key challenge in applying Q formats to DSP filters is coefficient scaling to prevent overflow during accumulation. Filter taps must often be scaled by dividing by 2k2^k (where kk is chosen based on the maximum expected sum of absolute coefficients) to keep intermediate results within the representable range, such as ensuring the accumulator does not exceed the word length in Q31 operations. This technique, known as headroom scaling, trades some dynamic range for stability but is critical in cascaded IIR structures to avoid wraparound errors. A notable is the use of (TI) DSPs in mobile phones from the 1990s to 2010s, where Q31 formats facilitated 32-bit accumulators for operations in voice and audio processing. TI's family, dominant in early digital cell phones, employed Q31 for efficient fixed-point in tasks like channel equalization and , enabling the widespread adoption of DSP in portable devices during that era.

In Embedded and Microcontroller Systems

In embedded and systems, the Q number format is widely adopted for its efficiency in resource-constrained environments, particularly in processors. The CMSIS-DSP library provides optimized functions for Q7, Q15, and Q31 formats, enabling in applications such as and without relying on floating-point hardware. For instance, the library includes PID controller functions like arm_pid_init_q15 and arm_pid_reset_q15, which process Q15 data for real-time feedback loops in brushless drives, achieving cycle-accurate performance on Cortex-M4 and M7 cores. In tasks, such as combining and data for attitude estimation, fixed-point formats facilitate low-latency implementations, reducing computational overhead compared to floating-point alternatives. A key advantage of Q formats in these systems is the elimination of the need for a (FPU), which lowers microcontroller die size and manufacturing costs while minimizing power consumption—critical for battery-powered IoT devices. Cortex-M0 and M3 cores, lacking FPUs, leverage Q formats to maintain precision in control algorithms, with fixed-point hardware occupying significantly less silicon area than equivalent floating-point accelerators. Q formats are commonly used in PID controllers for temperature or position regulation on such s, ensuring deterministic execution times essential for real-time systems. Implementation in C code for microcontrollers often involves bit shifting for , such as in multiplications, to maintain scaling and handle mixed integer and fixed-point operations. Saturation and overflow checks help prevent wrapping in control loops, ensuring stability in mixed-precision code. Developments since 2019 have extended quantized formats to microcontrollers for edge AI, where libraries such as PULP-NN optimize INT8 convolutions on clusters like the GAP8 for tasks like keyword spotting, providing energy efficiency improvements over other fixed-point implementations. This trend supports scalable deployment in IoT sensors, balancing accuracy and efficiency in post-training quantization workflows. As of 2023, hybrid Q formats have been applied in neuromorphic processors to enable energy-efficient fixed-point multiplication for AI .

References

Add your contribution
Related Hubs
User Avatar
No comments yet.