Hubbry Logo
search
logo

Out-of-order execution

logo
Community Hub0 Subscribers
Write something...
Be the first to start a discussion here.
Be the first to start a discussion here.
See all
Out-of-order execution

In computer engineering, out-of-order execution (or more formally dynamic execution) is an instruction scheduling paradigm used in high-performance central processing units (CPUs) to make use of instruction cycles that would otherwise be wasted. In this paradigm, a processor executes instructions in an order governed by the availability of input data and execution units, rather than by their original order in a program. In doing so, the processor can avoid being idle while waiting for the preceding instruction to complete and can, in the meantime, process the next instructions that are able to run immediately and independently.

Out-of-order execution is a restricted form of dataflow architecture, which was a major research area in computer architecture in the 1970s and early 1980s.

Arguably the first machine to use out-of-order execution is the CDC 6600 (1964), which used a scoreboard to resolve conflicts. The 6600 however lacked WAW conflict handling, choosing instead to stall. This situation was termed a "First Order Conflict" by Thornton. Whilst it had both RAW conflict resolution (termed "Second Order Conflict") and WAR conflict resolution (termed "Third Order Conflict") all of which is sufficient to declare it capable of full out-of-order execution, the 6600 did not have precise exception handling. An early and limited form of Branch prediction was possible as long as the branch was to locations on what was termed the "Instruction Stack" which was limited to within a depth of seven words from the Program Counter.

About two years later, the IBM System/360 Model 91 (1966) introduced register renaming with Tomasulo's algorithm, which dissolves false dependencies (WAW and WAR), making full out-of-order execution possible. An instruction addressing a write into a register rn can be executed before an earlier instruction using the register rn is executed, by actually writing into an alternative (renamed) register alt-rn, which is turned into a normal register rn only when all the earlier instructions addressing rn have been executed, but until then rn is given for earlier instructions and alt-rn for later ones addressing rn.

In the Model 91 the register renaming is implemented by a bypass termed Common Data Bus (CDB) and memory source operand buffers, leaving the physical architectural registers unused for many cycles as the oldest state of registers addressed by any unexecuted instruction is found on the CDB. Another advantage the Model 91 has over the 6600 is the ability to execute instructions out-of-order in the same execution unit, not just between the units like the 6600[disputeddiscuss]. This is accomplished by reservation stations, from which instructions go to the execution unit when ready, as opposed to the FIFO queue[disputeddiscuss] of each execution unit of the 6600. The Model 91 is also capable of reordering loads and stores to execute before the preceding loads and stores, unlike the 6600, which only has a limited ability to move loads past loads, and stores past stores, but not loads past stores and stores past loads. Only the floating-point registers of the Model 91 are renamed, making it subject to the same WAW and WAR limitations as the CDC 6600 when running fixed-point calculations. The 91 and 6600 both also suffer from imprecise exceptions, which needed to be solved before out-of-order execution could be applied generally and made practical outside supercomputers.

To have precise exceptions, the proper in-order state of the program's execution must be available upon an exception. By 1985 various approaches were developed as described by James E. Smith and Andrew R. Pleszkun. The CDC Cyber 205 was a precursor, as upon a virtual memory interrupt the entire state of the processor (including the information on the partially executed instructions) is saved into an invisible exchange package, so that it can resume at the same state of execution. However to make all exceptions precise, there has to be a way to cancel the effects of instructions. The CDC Cyber 990 (1984) implements precise interrupts by using a history buffer, which holds the old (overwritten) values of registers that are restored when an exception necessitates the reverting of instructions. Through simulation, Smith determined that adding a reorder buffer (or history buffer or equivalent) to the Cray-1S would reduce the performance of executing the first 14 Livermore loops (unvectorized) by only 3%. Important academic research in this subject was led by Yale Patt with his HPSm simulator.

In the 1980s many early RISC microprocessors, had out-of-order writeback to the registers, invariably resulting in imprecise exceptions. The Motorola 88100 was one of the few early microprocessors that did not suffer from imprecise exceptions despite out-of-order writes, although it did allow both precise and imprecise floating-point exceptions. Instructions started execution in order, but some (e.g. floating-point) took more cycles to complete execution. However, the single-cycle execution of the most basic instructions greatly reduced the scope of the problem compared to the CDC 6600.

Smith also researched how to make different execution units operate more independently of each other and of the memory, front-end, and branching. He implemented those ideas in the Astronautics ZS-1 (1988), featuring a decoupling of the integer/load/store pipeline from the floating-point pipeline, allowing inter-pipeline reordering. The ZS-1 was also capable of executing loads ahead of preceding stores. In his 1984 paper, he opined that enforcing the precise exceptions only on the integer/memory pipeline should be sufficient for many use cases, as it even permits virtual memory. Each pipeline had an instruction buffer to decouple it from the instruction decoder, to prevent the stalling of the front end. To further decouple the memory access from execution, each of the two pipelines was associated with two addressable queues that effectively performed limited register renaming. A similar decoupled architecture had been used a bit earlier in the Culler 7. The ZS-1's ISA, like IBM's subsequent POWER, aided the early execution of branches.

See all
User Avatar
No comments yet.