Recent from talks
Knowledge base stats:
Talk channels stats:
Members stats:
Pentium Pro
The Pentium Pro is the first sixth-generation x86-based microprocessor developed and manufactured by Intel and introduced on November 1, 1995. It implements the P6 microarchitecture (sometimes termed i686), and was the first x86 Intel CPU to do so.
The Pentium Pro was originally intended to replace the original Pentium in a full range of applications. Later, it was reduced to a more narrow role as a server and high-end desktop processor. The Pentium Pro was also used in supercomputers, most notably ASCI Red, which was the first computer to reach over one teraFLOPS in 1996 and held the number one spot in the TOP500 list from 1997 to 2000. ASCI Red used two Pentium Pro CPUs on each computing node.
While the Pentium and Pentium MMX had 3.1 and 4.5 million transistors, respectively, the Pentium Pro contained 5.5 million transistors. It was capable of both dual- and quad-processor configurations and only came in one form factor, the relatively large rectangular Socket 8. The Pentium Pro was succeeded by the Pentium II Xeon in 1998.
The lead architect of Pentium Pro was Fred Pollack who was specialized in superscalarity and had also worked as the lead engineer of the Intel iAPX 432.
The Pentium Pro incorporated a new microarchitecture, different from the Pentium's P5 microarchitecture. It has a decoupled, 14-stage superpipelined architecture which used an instruction pool. The Pentium Pro (P6) implemented many radical architectural differences mirroring other contemporary x86 designs such as the NexGen Nx586 and Cyrix 6x86. The Pentium Pro pipeline had extra decode stages to dynamically translate IA-32 instructions into buffered micro-operation sequences which could then be analysed, reordered, and renamed in order to detect parallelizable operations that may be issued to more than one execution unit at once. The Pentium Pro thus featured out-of-order execution, including speculative execution via register renaming. It also had a wider 36-bit address bus, usable by Physical Address Extension (PAE), allowing it to access up to 64 GB (64 × 10243 bytes) of memory.
The Pentium Pro has an 8 KB instruction cache, from which up to 16 bytes are fetched on each cycle and sent to the instruction decoders. There are three instruction decoders. The decoders are unequal in ability: only one can decode any x86 instruction, while the other two can only decode simple x86 instructions. This restricts the Pentium Pro's ability to decode multiple instructions simultaneously, limiting superscalar execution. x86 instructions are decoded into 118-bit micro-operations (micro-ops). The micro-ops are reduced instruction set computer (RISC)-like; that is, they encode an operation, two sources, and a destination. The general decoder can generate up to four micro-ops per cycle, whereas the simple decoders can generate one micro-op each per cycle. Thus, x86 instructions that operate on the memory (e.g., add this register to this location in the memory) can only be processed by the general decoder, as this operation requires a minimum of three micro-ops. Likewise, the simple decoders are limited to instructions that can be translated into one micro-op. Instructions that require more micro-ops than four are translated with the assistance of a sequencer, which generates the required micro-ops over multiple clock cycles. The Pentium Pro was the first processor in the x86 family to support upgradeable microcode under BIOS and/or operating system (OS) control.
Micro-ops exit the re-order buffer (ROB) and enter a reservation station (RS), where they await dispatch to the execution units. In each clock cycle, up to five micro-ops can be dispatched to five execution units. The Pentium Pro has a total of six execution units: two integer units, one floating-point unit (FPU), a load unit, store address unit, and a store data unit. One of the integer units shares the same ports as the FPU, and therefore the Pentium Pro can only dispatch one integer micro-op and one floating-point micro-op, or two integer micro-ops per a cycle, in addition to micro-ops for the other three execution units. Of the two integer units, only the one that shares the path with the FPU on port 0 has the full complement of functions such as a barrel shifter, multiplier, divider, and support for LEA instructions. The second integer unit, which is connected to port 1, does not have these facilities and is limited to simple operations such as add, subtract, and the calculation of branch target addresses.
The FPU executes floating-point operations. Addition and multiplication are pipelined and have a latency of three and five cycles, respectively. Division and square-root are not pipelined and are executed in separate units that share the FPU's ports. Division and square root have a latency of 18-36 and 29-69 cycles, respectively. The smallest number is for single precision (32-bit) floating-point numbers and the largest for extended precision (80-bit) numbers. Division and square root can operate simultaneously with adds and multiplies, preventing them from executing only when the result has to be stored in the ROB.
Hub AI
Pentium Pro AI simulator
(@Pentium Pro_simulator)
Pentium Pro
The Pentium Pro is the first sixth-generation x86-based microprocessor developed and manufactured by Intel and introduced on November 1, 1995. It implements the P6 microarchitecture (sometimes termed i686), and was the first x86 Intel CPU to do so.
The Pentium Pro was originally intended to replace the original Pentium in a full range of applications. Later, it was reduced to a more narrow role as a server and high-end desktop processor. The Pentium Pro was also used in supercomputers, most notably ASCI Red, which was the first computer to reach over one teraFLOPS in 1996 and held the number one spot in the TOP500 list from 1997 to 2000. ASCI Red used two Pentium Pro CPUs on each computing node.
While the Pentium and Pentium MMX had 3.1 and 4.5 million transistors, respectively, the Pentium Pro contained 5.5 million transistors. It was capable of both dual- and quad-processor configurations and only came in one form factor, the relatively large rectangular Socket 8. The Pentium Pro was succeeded by the Pentium II Xeon in 1998.
The lead architect of Pentium Pro was Fred Pollack who was specialized in superscalarity and had also worked as the lead engineer of the Intel iAPX 432.
The Pentium Pro incorporated a new microarchitecture, different from the Pentium's P5 microarchitecture. It has a decoupled, 14-stage superpipelined architecture which used an instruction pool. The Pentium Pro (P6) implemented many radical architectural differences mirroring other contemporary x86 designs such as the NexGen Nx586 and Cyrix 6x86. The Pentium Pro pipeline had extra decode stages to dynamically translate IA-32 instructions into buffered micro-operation sequences which could then be analysed, reordered, and renamed in order to detect parallelizable operations that may be issued to more than one execution unit at once. The Pentium Pro thus featured out-of-order execution, including speculative execution via register renaming. It also had a wider 36-bit address bus, usable by Physical Address Extension (PAE), allowing it to access up to 64 GB (64 × 10243 bytes) of memory.
The Pentium Pro has an 8 KB instruction cache, from which up to 16 bytes are fetched on each cycle and sent to the instruction decoders. There are three instruction decoders. The decoders are unequal in ability: only one can decode any x86 instruction, while the other two can only decode simple x86 instructions. This restricts the Pentium Pro's ability to decode multiple instructions simultaneously, limiting superscalar execution. x86 instructions are decoded into 118-bit micro-operations (micro-ops). The micro-ops are reduced instruction set computer (RISC)-like; that is, they encode an operation, two sources, and a destination. The general decoder can generate up to four micro-ops per cycle, whereas the simple decoders can generate one micro-op each per cycle. Thus, x86 instructions that operate on the memory (e.g., add this register to this location in the memory) can only be processed by the general decoder, as this operation requires a minimum of three micro-ops. Likewise, the simple decoders are limited to instructions that can be translated into one micro-op. Instructions that require more micro-ops than four are translated with the assistance of a sequencer, which generates the required micro-ops over multiple clock cycles. The Pentium Pro was the first processor in the x86 family to support upgradeable microcode under BIOS and/or operating system (OS) control.
Micro-ops exit the re-order buffer (ROB) and enter a reservation station (RS), where they await dispatch to the execution units. In each clock cycle, up to five micro-ops can be dispatched to five execution units. The Pentium Pro has a total of six execution units: two integer units, one floating-point unit (FPU), a load unit, store address unit, and a store data unit. One of the integer units shares the same ports as the FPU, and therefore the Pentium Pro can only dispatch one integer micro-op and one floating-point micro-op, or two integer micro-ops per a cycle, in addition to micro-ops for the other three execution units. Of the two integer units, only the one that shares the path with the FPU on port 0 has the full complement of functions such as a barrel shifter, multiplier, divider, and support for LEA instructions. The second integer unit, which is connected to port 1, does not have these facilities and is limited to simple operations such as add, subtract, and the calculation of branch target addresses.
The FPU executes floating-point operations. Addition and multiplication are pipelined and have a latency of three and five cycles, respectively. Division and square-root are not pipelined and are executed in separate units that share the FPU's ports. Division and square root have a latency of 18-36 and 29-69 cycles, respectively. The smallest number is for single precision (32-bit) floating-point numbers and the largest for extended precision (80-bit) numbers. Division and square root can operate simultaneously with adds and multiplies, preventing them from executing only when the result has to be stored in the ROB.