Hubbry Logo
X87X87Main
Open search
X87
Community hub
X87
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
X87
X87
from Wikipedia

x87 is a floating-point-related subset of the x86 architecture instruction set. It originated as an extension of the 8086 instruction set in the form of optional floating-point coprocessors (FPU) that work in tandem with corresponding x86 CPUs. These microchips have names ending in "87". This is also known as the NPX (numeric processor extension). Like other extensions to the basic instruction set, x87 instructions are not strictly needed to construct working programs, but provide hardware and microcode implementations of common numerical tasks, allowing these tasks to be performed much faster than corresponding machine code routines can. The x87 instruction set includes instructions for basic floating-point operations such as addition, subtraction and comparison, but also for more complex numerical operations, such as the computation of the tangent function and its inverse, for example.

Most x86 processors since the Intel 80486 have had these x87 instructions implemented in the main CPU, but the term is sometimes still used to refer to that part of the instruction set. Before x87 instructions were standard in PCs, compilers or programmers had to use rather slow library calls to perform floating-point operations, a method that is still common in (low-cost) embedded systems.

Description

[edit]

The x87 registers form an eight-level deep non-strict stack structure ranging from ST(0) to ST(7) with registers that can be directly accessed by either operand, using an offset relative to the top, as well as pushed and popped. (This scheme may be compared to how a stack frame may be both pushed/popped and indexed.)

There are instructions to push, calculate, and pop values on top of this stack; unary operations (FSQRT, FPTAN etc.) then implicitly address the topmost ST(0), while binary operations (FADD, FMUL, FCOM, etc.) implicitly address ST(0) and ST(1). The non-strict stack model also allows binary operations to use ST(0) together with a direct memory operand or with an explicitly specified stack register, ST(x), in a role similar to a traditional accumulator (a combined destination and left operand). This can also be reversed on an instruction-by-instruction basis with ST(0) as the unmodified operand and ST(x) as the destination. Furthermore, the contents in ST(0) can be exchanged with another stack register using an instruction called FXCH ST(x).

These properties make the x87 stack usable as seven freely addressable registers plus a dedicated accumulator (or as seven independent accumulators). This is especially applicable on superscalar x86 processors (such as the Pentium of 1993 and later), where these exchange instructions (codes D9C8..D9CFh) are optimized down to a zero clock penalty by using one of the integer paths for FXCH ST(x) in parallel with the FPU instruction. Despite being natural and convenient for human assembly language programmers, some compiler writers have found it complicated to construct automatic code generators that schedule x87 code effectively. Such a stack-based interface potentially can minimize the need to save scratch variables in function calls compared with a register-based interface[1] (although, historically, design issues in the 8087 implementation limited that potential.[2][3])

The x87 provides single-precision, double-precision and 80-bit double-extended precision binary floating-point arithmetic as per the IEEE 754-1985 standard. By default, the x87 processors all use 80-bit double-extended precision internally (to allow sustained precision over many calculations, see IEEE 754 design rationale). A given sequence of arithmetic operations may thus behave slightly differently compared to a strict single-precision or double-precision IEEE 754 FPU.[4] As this may sometimes be problematic for some semi-numerical calculations written to assume double precision for correct operation, to avoid such problems, the x87 can be configured using a special configuration/status register to automatically round to single or double precision after each operation. Since the introduction of SSE2, the x87 instructions are not as essential as they once were, but remain important as a high-precision scalar unit for numerical calculations sensitive to round-off error and requiring the 64-bit mantissa precision and extended range available in the 80-bit format.

Performance

[edit]

Clock cycle counts for examples of typical x87 FPU instructions (only register-register versions shown here).[5]

The A...B notation (minimum to maximum) covers timing variations dependent on transient pipeline status and the arithmetic precision chosen (32, 64 or 80 bits); it also includes variations due to numerical cases (such as the number of set bits, zero, etc.). The L → H notation depicts values corresponding to the lowest (L) and the highest (H) maximal clock frequencies that were available.

x87 implementation FADD FMUL FDIV FXCH FCOM FSQRT FPTAN FPATAN Max clock
(MHz)
Peak FMUL
(millions/s)
FMUL§
rel. 5 MHz 8087
8087 70…100 90…145 193…203 10…15 40…50 180…186 30…540 250…800 5 → 10 0.034…0.055 → 0.100…0.111 1 → 2× as fast
80287 (original) 6 → 12 0.041…0.066 → 0.083…0.133 1.2 → 2.4×
80387 (and later 287 models) 23…34 29…57 88…91 18 24 122…129 191…497 314…487 16 → 33 0.280…0.552 → 0.580…1.1 ~10 → 20×
80486 (or 80487) 8…20 16 73 4 4 83…87 200…273 218…303 16 → 50 1.0 → 3.1 ~18 → 56×
Cyrix 6x86, Cyrix MII 4…7 4…6 24…34 2 4 59…60 117…129 97…161 66 → 300 11…16 → 50…75 ~320 → 1400×
AMD K6 (including K6 II/III) 2 2 21…41 2 3 21…41 ? ? 166 → 550 83 → 275 ~1500 → 5000×
Pentium / Pentium MMX 1…3 1…3 39 1 (0*) 1…4 70 17…173 19…134 60 → 300 20…60 → 100…300 ~1100 → 5400×
Pentium Pro 1…3 2…5 16…56 1 28…68 ? ? 150 → 200 30…75 → 40…100 ~1400 → 1800×
Pentium II / III 1…3 2…5 17…38 1 27…50 ? ? 233 → 1400 47…116 → 280…700 ~2100 → 13000×
Athlon (K7) 1…4 1…4 13…24 1…2 16…35 ? ? 500 → 2330 125…500 → 580…2330 ~9000 → 42000×
Athlon 64 (K8) 1000 → 3200 250…1000 → 800…3200 ~18000 → 58000×
Pentium 4 1…5 2…7 20…43 multiple
cycles
1 20…43 ? ? 1300 → 3800 186…650 → 543…1900 ~11000 → 34000×
* An effective zero clock delay is often possible, via superscalar execution.
§ The 5 MHz 8087 was the original x87 processor. Compared to typical software-implemented floating-point routines on an 8086 (without an 8087), the factors would be even larger, perhaps by another factor of 10 (i.e., a correct floating-point addition in assembly language may well consume over 1000 cycles).

Manufacturers

[edit]

Companies that have designed or manufactured[a] floating-point units compatible with the Intel 8087 or later models include AMD (287, 387, 486DX, 5x86, K5, K6, K7, K8), Chips and Technologies (the Super MATH coprocessors), Cyrix (the FasMath, Cx87SLC, Cx87DLC, etc., 6x86, Cyrix MII), Fujitsu (early Pentium Mobile etc.), Harris Semiconductor (manufactured 80387 and 486DX processors), IBM (various 387 and 486 designs), IDT (the WinChip, C3, C7, Nano, etc.), IIT (the 2C87, 3C87, etc.), LC Technology (the Green MATH coprocessors), National Semiconductor (the Geode GX1, Geode GXm, etc.), NexGen (the Nx587), Rise Technology (the mP6), ST Microelectronics (manufactured 486DX, 5x86, etc.), Texas Instruments (manufactured 486DX processors etc.), Transmeta (the TM5600 and TM5800), ULSI (the Math·Co coprocessors), VIA (the C3, C7, and Nano, etc.), Weitek (the 1067, 1167, 3167 and 4167), and Xtend (the 83S87SX-25 and other coprocessors).

Architectural generations

[edit]

8087

[edit]

The 8087 was the first math coprocessor for 16-bit processors designed by Intel. It was released in 1980 to be paired with the Intel 8088 or 8086 microprocessors. (Intel's earlier 8231 and 8232 floating-point processors, marketed for use with the i8080 CPU, were in fact licensed versions of AMD's Am9511 and Am9512 FPUs from 1977 and 1979.[6])

80C187

[edit]
16 MHz version of the Intel 80C187

Although the original 1982 datasheet for the (NMOS based) 80188 and 80186 seem to mention specific math coprocessors,[7] both chips were actually paired with an 8087.

However, in 1987, in order to work with the refreshed CMOS based Intel 80C186 CPU, Intel introduced the 80C187[8] math coprocessor. The 80C187 interface to the main processor is the same as that of the 8087, but its core is essentially that of an 80387SX and is thus fully IEEE 754-compliant and capable of executing all the 80387's extra instructions.[9]

80287

[edit]

The 80287 (i287), released in 1982, is the math coprocessor for the Intel 80286 series of microprocessors. Intel's models included variants with specified upper frequency limits ranging from 6 up to 12 MHz. The NMOS version were available 6, 8 and 10 MHz.[10] The available 10 MHz Intel 80287-10 Numerics Coprocessor version was for 250 USD in quantities of 100.[11] These boxed version of 80287, 80287-8, and 80287-10 were available for USD $212, $326, and $374 respectively. There was boxed version of 80C287A available for USD $457.[12]

Other 287 models with 387-like performance are the Intel 80C287, built using CHMOS III, and the AMD 80EC287 manufactured in AMD's CMOS process, using only fully static gates.

Later followed the i80287XL with 387SX microarchitecture with a 287 pinout,[13] the i80287XLT, a special version intended for laptops, as well as other variants. It contains an internal 3/2 multiplier, so that motherboards that ran the coprocessor at 2/3 CPU speed could instead run the FPU at the same speed of the CPU. Both 80287XL and 80287XLT offered 50% better performance, 83% less power consumption, and additional instructions.[14]

The 80287 works with the 80386 microprocessor and was initially the only coprocessor available for the 80386 until the introduction of the 80387 in 1987. However, the 80387 is strongly preferred for its higher performance and more capable instruction set.

80387

[edit]
Intel 80387 CPU die image

The 80387 (387 or i387) is the first Intel coprocessor to be fully compliant with the IEEE 754-1985 standard. Released in 1987,[15] two years after the 386 chip, the i387 includes much improved speed over Intel's previous 8087/80287 coprocessors and improved characteristics of its trigonometric functions. It was made available for USD $500 in quantities of 100.[16] Shortly afterwards, it was made available through Intel's Personal Computer Enhancement Operation for a retail market price of USD $795.[17] The 25 MHz version was available in retail channel for USD $1395.[18] The Intel M387 math coprocessor met under MIL-STD-883 Rev. C standard. This device was tested which includes temperature cycling between -55 and 125 °C, hermeticity sealed and extended burn-in. This military version operates at 16 MHz. This military version was available in 68-lead PGA and quad flatpack. This military version was available for USD $1155 in 100-unit of quantities for the PGA version.[19] The 33 MHz version of 387DX was available and it has the performance of 3.4 megawhetstones per second.[20] The following boxed version of 16-, 20-, 25-, and 33-MHz 387DX math coprocessor were available for USD $570, $647, $814, and $994 respectively.[21] The 8087 and 80287's FPTAN and FPATAN instructions are limited to an argument in the range ±π/4 (±45°), and the 8087 and 80287 have no direct instructions for the SIN and COS functions.[22][full citation needed]

Without a coprocessor, the 386 normally performs floating-point arithmetic through (relatively slow) software routines, implemented at runtime through a software exception handler. When a math coprocessor is paired with the 386, the coprocessor performs the floating-point arithmetic in hardware, returning results much faster than an (emulating) software library call.

The i387 is compatible only with the standard i386 chip, which has a 32-bit processor bus. The later cost-reduced i386SX, which has a narrower 16-bit data bus, can not interface with the i387's 32-bit bus. The i386SX requires its own coprocessor, the 80387SX, which is compatible with the SX's narrower 16-bit data bus. Intel released the low power version of 387SX coprocessor.[20]

In addition, to pair with the i386SL used in laptops, Intel released the i387SL (N80387SL).[23] Marketed as "Intel387 SL Mobile Math CoProcessor", it included power-management features which allowed it to run without significantly reducing battery life. There are two battery-saving power-down features. The first one stops the coprocessor's clock when the CPU goes into "stop clock" mode; the 387SL consumes about 25 microamperes when its clock is stopped. The second one operates automatically when the CPU is running, putting the 387SL into "idle mode" when it is not executing an instruction. When active, the 387SL typically consumes 30 percent less battery power (about 100 mA) than the 387SX. In idle mode, it consumes 4 mA, a 96 percent power reduction compared to the active mode. It works in the range of 16 to 25 MHz and does not require BIOS or hardware reconfiguration.[24] It was initially available for USD $189.[25]

80487

[edit]
i487SX

Introduced in 1991, the i487SX (P23N) was marketed as a floating-point unit coprocessor for Intel i486SX machines. It actually contained a full-blown i486DX implementation. When installed into an i486SX system, the i487 disabled the main CPU and took over all CPU operations. The i487 manual claims that the unit would not function without an i486SX in place, but independent testing has revealed otherwise.[26][27]

The i487 used a special 169-pin socket with an unconnected (physical keying) pin to prevent insertion into the regular 168-pin 486 socket. One source claims that the socket is the same as Socket 1, the upgrade socket for i486 OverDrive, a processor replacement in a similar vein.[28]

The FPU instruction set of i486DX/i487SX was not different from the 387, but integration provided a bus utilisation benefit. On-chip algorithms were also improved.

Nx587

[edit]

NexGen's Nx587 FPU for the Nx586 processor, released in 1995, was the last x87 coprocessor to be manufactured separately from the CPU.

See also

[edit]

Notes

[edit]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
The x87 (FPU) is a specialized integrated into 64 and processors, designed to perform high-precision arithmetic operations on , , and (BCD) data types, in compliance with the IEEE Standard 754 for binary floating-point arithmetic. Introduced originally as the separate 8087 math in 1980 to extend the capabilities of the 8086 processor, the x87 FPU evolved to become an on-chip component starting with the 80486 processor, providing across , , and 64-bit mode while supporting applications in , scientific computing, , and . Architecturally, the x87 FPU employs a stack-based consisting of eight 80-bit data registers (ST0 through ST7), managed by a top-of-stack (TOP) pointer in the 16-bit status word, which also tracks condition codes and exceptions. It supports three primary floating-point formats—single-precision (32 bits), double-precision (64 bits), and double-extended precision (80 bits with a 64-bit mantissa and 15-bit exponent)—along with and packed BCD types, enabling automatic conversions and operations like , , division, square roots, and transcendental functions such as sine and logarithm. The unit includes dedicated control, status, and tag registers to configure rounding modes (e.g., round to nearest or toward zero), precision control, and for conditions like overflow, underflow, and invalid operations, with masking options to prevent interrupts. Complementing the x87 FPU, modern processors incorporate SIMD extensions like SSE and AVX for vectorized floating-point processing, but the x87 remains essential for legacy compatibility, high-precision scalar computations, and via instructions such as FINIT for initialization and FSAVE/FRSTOR for saving and restoring the full execution environment. Over 70 instructions form the x87 instruction set, categorized into data transfer, arithmetic, comparison, and control operations, ensuring robust performance in environments requiring exact compliance, including handling of NaNs, infinities, and denormalized numbers.

Overview

Purpose and Evolution

The x87 is the original (FPU) and associated hardware for the x86 family of processors, introduced to provide dedicated support for operations that were absent in the base -focused 8086 and 8088 microprocessors of the late 1970s. Developed by in the late 1970s and early 1980s, the x87 addressed the limitations of early x86 CPUs, which handled only computations and relied on software emulation for floating-point tasks, resulting in significant penalties for numerical applications. The initial implementation, the 8087 coprocessor, was announced in 1980 alongside the 8086 to enable of mathematical operations essential for scientific, engineering, and workloads. The primary motivation for x87's creation stemmed from the growing demand in personal computing for efficient , particularly in fields like scientific and , where software-based floating-point emulation on processors could slow computations by orders of magnitude—up to 100 times slower without dedicated hardware. enlisted numerical analyst as a consultant in 1976 to a robust floating-point system, leading to the x87's emphasis on accuracy and standardization. This collaboration influenced the broader standard for binary , with x87 providing implementations for single-precision (32-bit), double-precision (64-bit), and an 80-bit extended-precision format to support higher accuracy in intermediate calculations. x87's architecture evolved from a discrete model, where the 8087 interfaced with the main CPU via a shared bus and specialized instructions (opcodes D8h-DFh) to invoke floating-point operations, synchronized by the WAIT instruction (later aliased as FWAIT) to ensure completion before proceeding. This design allowed optional integration, boosting adoption in systems like the IBM PC. By 1989, with the introduction of the 80486 microprocessor, Intel integrated the x87 FPU directly on-chip in the 80486DX variant, eliminating the need for a separate and improving latency and efficiency for floating-point tasks. Subsequent x86 generations, including the 80287 and 80387, refined this approach before full on-die integration became standard, solidifying x87's role in x86 evolution.

Role in x86 Computing

The x87 functions as a , originally designated the Numeric Processor eXtension (NPX), interfacing with the 8086 and 80286 processors through a shared multiplexed address-data bus and dedicated control lines for synchronization and status signaling. This connection enables the x87 to access the same system memory as the integer unit, facilitating data transfer via common memory locations, while internal status and control words manage operational states such as exception masks and rounding modes. Synchronization between the main CPU and x87 is achieved primarily through the FWAIT instruction, which halts the CPU until the x87 completes any pending operations and resolves unmasked exceptions, ensuring sequential execution in mixed integer-floating-point code. Error conditions in the x87, such as overflows or invalid operations, are signaled via interrupt flags, with interrupt 16 (#MF) triggered for floating-point errors when the numeric error (NE) flag in CR0 is enabled, allowing software to handle exceptions through dedicated handlers. In software, x87 integrates via assembly instructions such as FLD for loading values onto the register stack and FSTP for storing results and popping the stack, enabling direct manipulation of floating-point data in low-level . Early high-level support emerged with compilers like C 5.0 in 1987, which by default generated inline instructions for 8087 or 80287 coprocessors to handle floating-point operations, with fallback to software emulation libraries like 87.LIB for systems lacking hardware. Later x86 processors maintain with x87 instructions, integrating the FPU on-chip while preserving the original interface for legacy code execution. In x86-64 mode, x87 remains available despite the mandate for support in floating-point operations, ensuring full architectural compliance for applications relying on extended-precision formats or historical binaries. The x87's integration profoundly influenced the x86 ecosystem by enabling efficient floating-point computation in early DOS and Windows applications, such as scientific simulations and that previously depended on slow software emulation. Operating systems like supported this through software floating-point emulators in compiler libraries, allowing x87-compatible code to run on hardware without a dedicated , thus broadening accessibility for math-intensive programs.

Core Architecture

Register Stack and Data Handling

The x87 FPU employs eight 80-bit floating-point registers, denoted ST(0) through ST(7), arranged in a stack-based that facilitates operand management for floating-point computations. The registers operate on a last-in, first-out (LIFO) , with ST(0) serving as the top of the stack. A 3-bit top-of-stack (TOS) pointer, located in bits 11 through 13 of the status word, dynamically indicates which register currently occupies the top position, enabling implicit addressing relative to ST(0). This design allows instructions to reference without explicit register numbering, promoting efficient stack manipulation while limiting direct access to eight physical registers. Core stack operations include pushing and popping values to handle data flow. The FLD instruction pushes a value onto the stack by loading it into the current ST(0) and decrementing the TOS pointer, effectively shifting existing stack contents downward (e.g., the previous ST(0) becomes ST(1)). Conversely, the FSTP instruction pops the top value by storing the contents of ST(0) to or another location and incrementing the TOS pointer, restoring the previous top to ST(0). For scenarios requiring non-destructive access, such as swapping operands without altering the stack depth, the FXCH instruction exchanges the contents of ST(0) with another stack register, preserving the TOS position. These mechanisms ensure seamless data handling, though they can lead to or underflow if the TOS exceeds the 0-7 range, triggering an invalid-operation exception. The status word, a 16-bit register, encapsulates critical runtime information about the FPU's operational state. It includes the TOS pointer for stack tracking, condition codes C0 through C3 that reflect comparison outcomes (e.g., greater than, less than, or equal), and flags for floating-point exceptions such as invalid operation, denormal operand, , overflow, underflow, and precision. Additional bits cover the exception summary (indicating any unmasked pending exception), stack fault (signaling overflow or underflow), and busy flag (denoting ongoing FPU activity). The following table outlines the status word's bit structure:
Bit PositionFieldDescription
15BBusy flag: 1 indicates the FPU is executing an instruction.
14C3Condition code 3: Used in specific comparison and transcendental operations.
13-11TOPTop-of-stack pointer: 3-bit value (0-7) pointing to the current ST(0).
10ESException summary: 1 if any unmasked exception is pending.
9SFStack fault: 1 if stack overflow or underflow occurred.
8PEPrecision exception: 1 if a precision error happened.
7UEUnderflow exception: 1 if underflow occurred.
6OEOverflow exception: 1 if overflow occurred.
5ZEZero divide exception: 1 if division by zero attempted.
4DEDenormal operand exception: 1 if a denormal operand used.
3IEInvalid operation exception: 1 for invalid operations (e.g., NaN operands).
2C2Condition code 2: Indicates equality or sign in comparisons.
1C1Condition code 1: Used for parity and ordering in comparisons.
0C0Condition code 0: Least significant bit for comparison results (e.g., less than).
This structure enables software to query and respond to FPU conditions efficiently. Complementing the status word, the 16-bit control word governs the FPU's behavioral parameters, including exception masking to suppress interrupts for specific errors, precision selection (24-bit for single, 53-bit for double, or 64-bit for extended), and rounding modes to direct result truncation (round to nearest or even, toward zero, toward positive infinity, or toward negative infinity). Masking allows graceful degradation in non-critical applications, while precision and rounding settings align computations with standards or application needs. The control word's bit layout is as follows:
Bit PositionFieldDescription
15-14ReservedMust be 1 for compatibility.
13PMPrecision exception mask: 1 to mask precision errors.
12UMUnderflow exception mask: 1 to mask underflow.
11OMOverflow exception mask: 1 to mask overflow.
10ZM exception mask: 1 to mask zero divide.
9DMDenormal operand exception mask: 1 to mask denormals.
8IMInvalid operation exception mask: 1 to mask invalid ops.
7-6PCPrecision control: 00=single (24-bit), 10=double (53-bit), 11=extended (64-bit).
5-4RCRounding control: 00=nearest, 01=toward -∞, 10=toward +∞, 11=toward 0.
3-0ReservedMust be 0.
Instructions like FLDCW load values into this word to configure the FPU dynamically. The tag word, another 16-bit register, optimizes data handling by tagging each of the eight stack registers with a 2-bit code indicating its content type: valid (non-zero, non-special finite), zero, special (, infinity, denormal, or unsupported format), or empty. This tracking prevents unnecessary computations on invalid or unused registers and aids in error detection during stack operations. For instance, loading a value sets the tag for the affected register to valid or special as appropriate, while popping marks it empty. The tag word's structure assigns 2 bits per register, starting from ST(0) at bits 0-1 up to ST(7) at bits 14-15:
BitsRegister00 (Valid)01 (Zero)10 (Special)11 (Empty)
0-1ST(0)Finite non-zero+0 or -0, ∞, denormalNo content
2-3ST(1)Finite non-zero+0 or -0, ∞, denormalNo content
4-5ST(2)Finite non-zero+0 or -0, ∞, denormalNo content
6-7ST(3)Finite non-zero+0 or -0, ∞, denormalNo content
8-9ST(4)Finite non-zero+0 or -0, ∞, denormalNo content
10-11ST(5)Finite non-zero+0 or -0, ∞, denormalNo content
12-13ST(6)Finite non-zero+0 or -0, ∞, denormalNo content
14-15ST(7)Finite non-zero+0 or -0, ∞, denormalNo content
The FPU automatically updates tags during load and store operations, enhancing performance by skipping operations on empty or special entries where possible.

Supported Data Types

The x87 floating-point unit (FPU) natively supports three real number formats for high-precision arithmetic, including an Intel-specific format alongside single and double precision standards. These formats enable the x87 to handle a wide range of numerical computations with varying levels of accuracy and range, primarily stored in its register stack. The serves as the default internal representation during operations, while single and double precision are used for compatibility with loads and stores. Extended precision is an 80-bit format unique to the x87 architecture, consisting of 1 , a 15-bit exponent with a of 16383, and a 64-bit explicit mantissa that includes the bit without an implied hidden bit. This structure provides approximately 19 decimal digits of precision and an exponent range from -4931 to +4932, making it suitable for intermediate computations requiring maximal accuracy. Unlike formats, the explicit leading bit in the mantissa allows for exact representation of s up to 2^64 - 1. Double precision follows the standard in a 64-bit format, featuring 1 , an 11-bit exponent biased by , and a 52-bit mantissa with an implied leading 1 for normalized numbers (providing about 15-16 digits of precision). This format is loaded into the x87 registers using instructions like FLD and stored via FSTP, often after internal computations are rounded to match the specified precision. The exponent range spans approximately -1022 to +. Single precision adheres to in a 32-bit format, with 1 , an 8-bit exponent biased by 127, and a 23-bit mantissa including an implied leading 1 (offering around 6-7 digits of precision). It is supported primarily for legacy compatibility and I/O operations, loaded and stored similarly to double precision, with an exponent range from -126 to +127. In addition to real numbers, the x87 handles several integer types for conversions and scaling operations. These include 16-bit (word), 32-bit (doubleword), and 64-bit (quadword) signed or unsigned integers, which can be loaded via FILD and stored with FIST or FISTP. A specialized 80-bit packed (BCD) format encodes up to 18 digits (72 bits) plus a in the tenth byte, enabling precise arithmetic without floating-point conversion errors. Temporary integers are also generated internally during operations like scaling or , but they are not directly storable as persistent data types. The x87 fully supports special values across its formats, including , , and denormalized numbers. NaNs are encoded with an all-1s exponent and non-zero mantissa, distinguishing quiet NaNs (which propagate without exceptions) from signaling NaNs (which raise invalid operation exceptions). result from exponent all 1s with zero mantissa, indicating ±∞ from overflows or . Denormals use an all-0s exponent with non-zero mantissa, lacking the implied leading 1 to represent subnormal values near zero. Gradual underflow is managed through denormals, gradually reducing precision as values approach zero, while exponent overflow triggers or the maximum finite value, potentially raising a numeric overflow exception depending on the control word masking.
FormatBitsSignExponent (Bias)MantissaPrecision (Decimal Digits)Exponent Range
Extended80115 (16383)64 (explicit)~19-4931 to +4932
Double64111 (1023)52 (implied 1)~15-16-1022 to +1023
Single3218 (127)23 (implied 1)~6-7-126 to +127

Instruction Set Fundamentals

The x87 instruction set forms the core of floating-point computation in the x86 architecture, comprising over 70 instructions that enable arithmetic, data transfer, comparison, conversion, and control operations on an 8-level register stack. These instructions treat the stack top, ST(0), as the primary operand and destination, with operations often pushing or popping values to manage the stack pointer (TOP). Precision and rounding modes are configurable via the FPU control word, which influences the execution of arithmetic and conversion instructions. Arithmetic operations provide the foundational computations for . The primary instructions are for , FSUB for subtraction, FMUL for multiplication, and FDIV for division, each capable of operating on ST(0) and either another stack register ST(i) or a , with the result replacing the destination value. Variants such as FADDP, FSUBP, FMULP, and FDIVP perform the same operations but store the result in ST(i) and pop ST(0) from the stack, facilitating efficient two-operand calculations without explicit exchange. Integer-to-floating-point variants like FIADD and FIDIV convert integers to floating-point before applying the operation. Comparison instructions evaluate relationships between operands and update condition flags (C0, C2, C3) in the FPU status word for subsequent branching. FCOM and FCOMP compare ST(0) with ST(i) or a memory value in an unordered manner, handling cases by setting the unordered flag (C0=0, C2=0, C3=1); FCOMP additionally pops ST(0). FCOMPP extends this by comparing ST(0) with ST(1), popping twice. FXAM inspects ST(0) without a second , classifying its contents as zero, positive/negative, infinity, , empty, or denormalized, and sets flags accordingly. Integer comparison variants like FICOM support direct operands from . Data movement instructions handle loading, storing, and exchanging values to and from the stack. FLD pushes a value from or ST(i) onto ST(0), decrementing TOP. FST copies ST(0) to or ST(i) without altering the stack, while FSTP performs the same copy but then pops ST(0) by incrementing TOP. FXCH swaps the contents of ST(0) and ST(i) with no net stack change. Dedicated constant loads include FLD1 for the value +1.0 and FLDPI for π (approximately 3.14159), both pushing the constant onto ST(0). Conversion instructions bridge floating-point and integer or BCD formats. FIST rounds ST(0) to an integer using the current control word settings and stores it to memory without popping, while FISTP does the same but pops ST(0). FBSTP converts ST(0) to an 18-digit packed BCD representation (with sign) and stores it to memory, popping the stack. The control word's precision control bits (PC) determine the result format for these operations, selectable as single (24-bit), double (53-bit), or extended (64-bit) mantissa. Exception handling and control instructions manage FPU state and errors. FINIT initializes the FPU by loading default control word values (single precision, round to nearest), clearing exception flags, and setting TOP to 0. FNOP executes a no-operation, useful for instruction padding or synchronization without affecting registers or flags. FLDCW loads a new control word from memory to adjust precision, rounding, and exception masks dynamically. FCLEX clears pending floating-point exceptions by resetting the status word flags. These instructions ensure reliable operation across the full set of over 70 x87 opcodes, grouped functionally for arithmetic, data handling, and state management.

Performance Aspects

Execution Model

The (FPU) employs a featuring a three-stage for instruction : decode, execute, and normalize/round/store. In the decode stage, x87 instructions are interpreted and prepared for operation, often breaking down into micro-operations (μops) in later implementations to facilitate . The execute stage performs the core arithmetic or logical computation using dedicated subunits, such as those for addition, multiplication, or division. Finally, the normalize/round/store stage adjusts the result for precision and according to the control word settings before storing it in the register stack or . This pipelined allows for sequential handling of floating-point operations independent of the main CPU . Throughput in the x87 FPU reaches 1 instruction per cycle for simple operations like (FADD) and multiplication (FMUL) in pipelined implementations, enabling overlapping execution where subsequent instructions can enter the without stalling on basic arithmetic. Latency varies by operation and precision; for example, FMUL typically incurs 3-5 cycles from dispatch to result availability, while FDIV ranges from 10-40 cycles due to its non-pipelined nature and dependence on iterative algorithms like SRT division. Some implementations employ reciprocal approximation techniques to accelerate division by computing 1/ first and then multiplying, reducing effective latency in software-optimized paths, though hardware FDIV remains the standard for direct execution. Status checking in the x87 FPU relies on condition code flags (C0 through C3) in the 16-bit status word, which are set following comparison instructions like FCOM or arithmetic operations to indicate outcomes such as equality (C3=1, others=0), greater than, or less than. These flags enable conditional execution through instructions like FCMOVcc (e.g., FCMOVE for equality), which move data based on flag states without branching. Branching itself is handled by the main CPU, typically via FSTSW to store the status word into an register (e.g., AX), followed by a conditional jump on the extracted flags, ensuring integration with control flow. The exception model in x87 is synchronous, with six maskable numeric exceptions—invalid operation (#I), denormal (#D), divide-by-zero (#Z), overflow (#O), underflow (#U), and inexact result (#P)—controlled by bits 0-5 in the 16-bit control word. Masked exceptions (default: all masked, control word 037FH) set corresponding flags in the status word and continue execution with a default result, such as for overflow. Unmasked exceptions trigger an (#MF, vector 16) via the FERR# pin or CR0.NE flag, halting the FPU until the handler saves the complete state (status, control, tag, instruction pointer, and data pointer words) using instructions like FSTENV, processes the error, and restores via FLDENV to resume. Power consumption for early standalone coprocessors like the 8087 is approximately 2 typical under load, reflecting its separate die and clocking. Integrated x87 units in later processors, such as the 80486 onward, exhibit lower overall power draw due to shared die space, reduced pin count, and unified clocking with the CPU core, though specific figures vary by generation and process node.

Optimization and Limitations

The x87 FPU's eight-register stack model, with ST(0) as the top, facilitates data handling through implicit pushes and pops during operations, but mismatched sequences can cause (pushing beyond the eighth register) or underflow (popping from an empty stack), resulting in an invalid-operation exception (#IS). These faults are mitigated by employing the FXCH instruction to swap registers without changing stack depth, enabling programmers to reorder operands and reuse values efficiently while monitoring the TOP pointer and tag word to prevent overflows. Precision discrepancies between the 80-bit extended format used in registers and the 64-bit double format in memory can lead to loss of accuracy, particularly through the double-rounding problem, where intermediate extended-precision results undergo an unintended second rounding upon storage, yielding results that deviate from IEEE 754 expectations. This is resolved by setting the precision control (PC) field in the FPU control word (bits 8-9) to single (24-bit) or double (53-bit) before store operations, ensuring rounding aligns with the target format and avoids erroneous outcomes in applications requiring strict double-precision semantics. Comparisons via instructions like FCOM update FPU condition codes, which can be transferred to integer flags using FSTSW AX followed by SAHF for branching, without requiring FWAIT in integrated FPUs, though this sequence introduces some latency that hampers performance in conditional code paths. The FXAM instruction offers a workaround by classifying ST(0)'s content (e.g., zero, NaN, infinity) directly into condition flags C0, C2, and C3, allowing flag-based decisions without full comparison overhead. Software techniques further enhance efficiency, such as loop unrolling to sustain FPU pipeline throughput during iterative arithmetic and eschewing transcendentals like FSIN—with reciprocal throughputs of 11-60 cycles on recent AMD Zen processors— in favor of approximations to reduce stalls. Inherent limitations of x87 include its scalar-only design, which processes individual values without vectorization support, constraining throughput on data-parallel tasks relative to SIMD alternatives. Non-IEEE 754 compliance manifests in behaviors like pseudo-denormals and divergent gradual underflow handling, alongside support for directed rounding modes (to +∞, -∞, or zero) that, while functional, complicate portability when intermixed with standard round-to-nearest operations. For legacy compatibility, compilers such as GCC offer the -mfpmath=387 flag to mandate x87 arithmetic, though this enforces 80-bit temporaries that may alter numerical reproducibility across platforms.

Hardware Implementations

8087 Coprocessor

The Intel 8087 Numeric Data Processor, introduced in 1980, served as the inaugural floating-point coprocessor for the 8086 and 8088 microprocessors, enabling high-speed numeric computations in early personal computers. Housed in a 40-pin dual in-line package (DIP), it employed HMOS (high-performance NMOS) fabrication technology on a 3-micrometer process, aligning its operation with the host CPU's clock frequency of 3 to 5 MHz. The chip's die measured approximately 5 mm by 6 mm and incorporated around 40,000 transistors to handle arithmetic, transcendental, and data transfer operations. At its core, the 8087 supported 68 numeric instructions, delivering internally for real numbers (comprising a 64-bit , 15-bit exponent, and ) while accommodating single-precision (32-bit), double-precision (64-bit), , and packed BCD data types. These operations were orchestrated via a 4 KB ROM that implemented microprograms for complex functions like logarithms and trigonometrics, using an innovative two-bits-per-transistor encoding scheme to maximize density despite the era's budget constraints. Power consumption stood at roughly 2 W, reflecting the NMOS design's efficiency relative to software emulation alternatives. The coprocessor's architecture emphasized a stack-based , allowing parallel execution with the host CPU for non-conflicting instructions. The 8087 interfaced seamlessly with the 8086 family through a shared multiplexed bus, utilizing 16 address/data lines (or 8 for the 8088 variant) and queue status signals for instruction decoding. Synchronization relied on the coprocessor's BUSY output pin, which the host CPU monitored via the WAIT instruction (opcode 9B) to ensure completion of ongoing operations before proceeding, preventing data corruption in pipelined execution. This design supported up to 10 MHz in compatible systems, though standard variants operated at lower speeds. Production occurred primarily at facilities, with second-sourcing by firms like Harris Semiconductor to meet demand; retail pricing in 1982 hovered around $230 per unit, dropping in subsequent years with volume scaling. Key limitations included protracted execution times due to microcoded complexity, with basic arithmetic like addition requiring over 100 cycles and divisions spanning 80 to 140 cycles depending on precision, while transcendental functions could exceed 1,000 cycles—far slower than later integrated FPUs. Early revisions suffered from quirks, necessitating explicit WAIT prefixes for every instruction to poll the reliably; subsequent steppings refined error handling and timing to mitigate hangs in multi-tasking environments. These traits underscored the 8087's role as a pioneering but transitional component in x86 numeric processing.

80287 and 80C187 Variants

The Intel 80287, introduced in 1982 as the coprocessor for the 80286 microprocessor, was fabricated using HMOS-II technology and supported clock speeds ranging from 5 to 12.5 MHz. It was packaged in a 68-pin PLCC, enabling integration with 80286-based systems for enhanced numeric processing. The 80287 featured a three-stage pipeline in its numeric execution unit, which improved throughput over the 8087 by allowing better overlap of bus interface and computation tasks. Key enhancements included superior for conditions such as overflow and underflow, with support for masking and default fix-ups, and full compatibility with the 80286's for . Performance gains were evident in operations like division, which required 193-203 cycles compared to longer latencies in the 8087, alongside faster add and subtract instructions taking 70-120 clock cycles. However, compatibility with 8086 systems was limited due to pinout and interface differences, necessitating adapters for any attempted substitution. Production reflected widespread adoption in 80286-based personal computers. The 80C187, released in 1986, represented a CMOS variant optimized for low power, using 1.5-micron CHMOS III technology and operating at 5V with a maximum power dissipation of 1W. Available in 40-pin CERDIP or 44-pin PLCC packaging, it supported clock speeds up to 12.5 MHz in direct mode or 16 MHz in divide-by-2 mode, making it suitable for embedded and portable applications. Designed primarily for the 80C186 microcontroller, it extended floating-point capabilities while maintaining backward object-code compatibility with 8087 software, including support for IEEE 754-1985 binary floating-point arithmetic and transcendental functions. This low-power design facilitated its use in early portable systems, such as variants of the Compaq Portable 286, where battery life was critical, and it preserved the three-stage for efficient execution in constrained environments. The 80C187's enhancements in mirrored those of the 80287, ensuring reliable operation in while reducing overall system power draw compared to NMOS predecessors.

80387 Integration

The Intel 80387 math coprocessor was introduced in 1987 as the dedicated for the Intel 80386 , marking a significant evolution in x87 architecture tailored to the 32-bit CPU. Fabricated using Intel's CHMOS-III technology on a 1.5-micron process, it was housed in an 82-pin ceramic (PGA) package and operated at clock speeds ranging from 16 to 25 MHz, fully synchronous with the 80386. This design ensured compatibility with the CPU's 32-bit external data bus, enabling efficient pipelined and non-pipelined memory operations without the 16-bit bus limitations of prior coprocessors. Key architectural advancements in the 80387 included wider internal data paths supporting 80-bit formats with 64-bit significands for intermediate computations, enhancing accuracy in complex numerical tasks. It delivered improved performance for transcendental functions, such as the FSIN instruction, which typically required 70-100 clock cycles depending on input range and precision mode. New instructions like FYL2XP1 were added to compute more precise logarithms by approximating ylog2(x+1)y \cdot \log_2 (x + 1), reducing error in logarithmic and exponential operations compared to earlier x87 implementations. These features positioned the 80387 as the first fully IEEE 754-compliant in the x87 lineage. As an optional component, the 80387 integrated into 80386DX systems via a dedicated socket, while the pin-compatible 80387SX variant supported the 16-bit bus of 80386SX processors in cost-sensitive designs. Although its 82-pin configuration necessitated a new socket distinct from the 68-pin 80287, the interface remained backward-compatible at the instruction level, allowing seamless upgrades in 386-based motherboards. Production involved approximately 120,000 transistors, contributing to its deployment in high-end workstations like the Model 70, where it accelerated scientific and engineering workloads. At launch, OEM pricing stood at about $570 per unit in 1986 quantities of 1,000. In 1990, introduced the 80387SL variant optimized for , incorporating low-power modes and reduced voltage operation to suit battery-powered laptops while maintaining core 80387 functionality. This version addressed the growing demand for portable 386 systems by minimizing energy consumption without sacrificing floating-point performance.

80487 and Nx587 Developments

The Intel 80487 was introduced in 1991 as the final major discrete x87 floating-point coprocessor, designed specifically for the 80486SX microprocessor, which featured a disabled internal FPU to reduce costs. Fabricated using Intel's 1-micrometer CHMOS-IV process technology, it was housed in a 169-pin PGA package with a keyed pin to prevent incorrect insertion into the 80486 socket. Available in clock speeds ranging from 20 MHz to 50 MHz, the 80487 matched the host 80486SX frequency for synchronous operation. Unlike prior coprocessors, the 80487 integrated a complete 80486DX core, including an 8 KB on-chip unified write-back cache that handled both code and data, including FPU status information, to minimize bus traffic and improve performance. Upon installation, it disabled the original 80486SX and assumed all processing duties, effectively upgrading the system to DX-level capabilities with full x87 instruction support. Key enhancements included a refined floating-point , reducing divide latency to approximately 35 cycles for single-precision operations, alongside compatibility with 80486 wait-state protocols for seamless integration. A low-power variant, the 80487SX, targeted battery-operated systems by operating at reduced voltages and frequencies starting from 15 MHz. The Nx587 designation encompassed third-party x87 developments in the early , particularly as aftermarket upgrades for 80486SX and older 386/486 systems lacking integrated FPUs. Notable examples included the FasMath 83S87 and IIT 1A87XL, which maintained pin-compatible interfaces with Intel's 80387SX/80487 for . These coprocessors operated at accelerated clock speeds up to 60 MHz, surpassing Intel's offerings, and featured optimized transcendental functions like sine, cosine, and logarithms through enhanced implementations. A prominent implementation was NexGen's Nx587, released in 1994 as an optional companion to the Nx586 processor—a superscalar 80486-compatible CPU aimed at high-performance upgrades. The Nx587 used a 183-pin CPGA package and a dedicated 64-bit bus for tight integration with the host, supporting x86 floating-point modes while enabling to reduce latency in arithmetic pipelines. It bridged the era of discrete x87 units to on-chip integration in subsequent processors like the , remaining popular in niche aftermarket kits for legacy 386/486 platforms during the mid-1990s.

Manufacturers and Production

Intel served as the primary designer and manufacturer of all official x87 floating-point coprocessors, from the 8087 through the 80487, with fabrication facilities located and . The company's production dominated the market for these components, particularly as integration became standard in x86 processors starting with the 80486DX in 1989. By the mid-1990s, discrete x87 production had largely ceased, with the FPU subsequently integrated into all x86 CPUs. Several companies acted as second sources under licensing agreements with Intel during the 1980s, producing compatible versions of early x87 coprocessors like the 8087 and 80287 to ensure supply reliability. AMD manufactured licensed 8087 and 80287 variants, such as the limited-run D8087, contributing to broader availability in the initial years of x87 adoption. Harris Semiconductor (later Intersil) produced second-source 80287 and 80387 coprocessors, focusing on CMOS variants for improved power efficiency. Fujitsu, based in Japan, also served as a second-source manufacturer under Intel agreements, primarily for 80287-compatible units until the late 1980s. Non-licensed clones and upgrades emerged from other firms to offer cost-effective or enhanced alternatives, particularly for later generations. produced the Cx83S87, a compatible upgrade for 80386 systems, as one of its early products following its initial 8087 clone. Integrated Information Technology (IIT) developed the 387SX series, providing pin-compatible improvements over Intel's 80387SX with better performance in select operations. ULSI Technology created the 82C87 and similar 80387-compatible chips, emphasizing low-power designs for embedded and portable applications. Early production of the 8087 faced significant challenges, including low yields due to the chip's complex design and the limits of 1980s fabrication technology, which drove up costs and delayed widespread adoption. Second-source versions occasionally exhibited variances in timing and speed compared to Intel's originals, requiring careful system integration to avoid compatibility issues.

Legacy and Compatibility

Transition to SIMD Extensions

The transition from x87 to SIMD extensions began with the introduction of (SSE) in 1999 alongside the processor, which added eight 128-bit XMM registers dedicated to vectorized single-precision floating-point operations, while x87 continued to handle scalar computations requiring extended 80-bit precision. This marked the first major step toward parallel floating-point processing in x86 architectures, enabling applications like and scientific to exploit SIMD parallelism without relying solely on x87's scalar model. SSE2, released in 2001 with the , extended this shift by incorporating double-precision floating-point support into the XMM registers, providing a complete alternative to x87 for 64-bit scalar operations and establishing SSE as the preferred path for new floating-point code in compilers such as GCC and Microsoft Visual C++, which defaulted to SSE2 generation for improved performance and consistency. The primary drivers for this evolution included x87's stack-based register architecture, which imposed inefficiencies for vector workloads through mandatory push and pop operations to manage the eight-register stack, contrasted with SSE's flat, non-stack that offered lower latency for arithmetic operations and simpler code generation for SIMD. In the early 2000s, many applications adopted a hybrid model, leveraging SSE for basic arithmetic and vector tasks while falling back to x87 for transcendental functions such as , due to the absence of hardware equivalents in SSE until software polynomial approximations became viable; differences in denormal ( handling also necessitated careful mixing, as x87's gradual underflow support contrasted with SSE's flush-to-zero default, potentially leading to divergent numerical outcomes. The advent of (AVX) in 2011 with the processors accelerated deprecation of x87 by doubling SIMD width to 256 bits via YMM registers, emphasizing fully vectorized floating-point as the modern paradigm and rendering x87's scalar focus increasingly obsolete for performance-critical code. Despite this progression, x87 remains essential for executing legacy binaries in 2025, ensuring backward compatibility in operating systems like Windows and .

Usage in Modern Systems

All x86-64 processors from Intel and AMD include the x87 floating-point unit (FPU) as a mandatory component for maintaining backward compatibility with legacy 32-bit and 64-bit software that relies on it. The x87 FPU is enabled by default upon processor initialization, with its extended state management (including save and restore operations) controlled via the FXSR flag in the CR4 register, ensuring seamless integration in modern environments. In contemporary compiler ecosystems, tools like GCC and Clang default to SSE instructions for floating-point operations on x86-64 targets since GCC version 4.2 in 2007, invoked via the -mfpmath=sse option, to leverage vectorized performance and avoid x87's stack-based model. Despite this shift, certain system libraries such as glibc continue to employ x87 for operations requiring 80-bit extended precision, particularly in mathematical functions where higher accuracy is beneficial for intermediate computations. x87 persists in niche modern applications, including legacy compatibility modes in scientific computing software like , where older codebases or precision-sensitive algorithms may invoke x87 for extended-range calculations; legacy game engines from the and early , such as those in titles running on 5-8; and emulators like , which replicate x87 behavior to accurately run period-correct DOS and early Windows software. trends have accelerated in the 2020s, with and other OS vendors discouraging new x87 usage in favor of and later extensions for consistency and performance; and documentation since the early 2010s similarly positions SSE/AVX as the primary floating-point pathways, labeling x87 as a legacy feature unsuitable for new development. As of 2025, x87 remains fully implemented in all current-generation CPUs, such as Intel's (Core Ultra Series 1) and AMD's (Ryzen 9000 Series), to support the vast installed base of compatible software, though it is power-gated—dynamically disabled at the hardware level when idle—to minimize in battery-constrained or server environments. No new x87 instructions have been introduced since the late , with the last major additions occurring in the Pentium era around 1993-1997.

References

Add your contribution
Related Hubs
User Avatar
No comments yet.