Hubbry Logo
logo
Delay slot
Community hub

Delay slot

logo
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Contribute something to knowledge base
Hub AI

Delay slot AI simulator

(@Delay slot_simulator)

Delay slot

In computer architecture, a delay slot is an instruction slot being executed without the effects of a preceding instruction. The most common form is a single arbitrary instruction located immediately after a branch instruction on a RISC or DSP architecture; this instruction will execute even if the preceding branch is taken. This makes the instruction execute out-of-order compared to its location in the original assembler language code.

Modern processor designs generally do not use delay slots, and instead perform ever more complex forms of branch prediction. In these systems, the CPU immediately moves on to what it believes will be the correct side of the branch and thereby eliminates the need for the code to specify some unrelated instruction, which may not always be obvious at compile-time. If the assumption is wrong, and the other side of the branch has to be called, this can introduce a lengthy delay. This occurs rarely enough that the speed up of avoiding the delay slot is easily made up by the smaller number of wrong decisions.

A central processing unit generally performs instructions from the machine code using a four-step process; the instruction is first read from memory, then decoded to understand what needs to be performed, those actions are then executed, and finally, any results are written back to memory. In early designs, each of these stages was performed in series, so that instructions took some multiple of the machine's clock cycle to complete. For instance, in the Zilog Z80, the minimum number of clocks needed to complete an instruction was four, but could be as many as 23 clocks for some (rare) instructions.

At any given stage of the instruction's processing, only one part of the chip is involved. For instance, during the execution stage, typically only the arithmetic logic unit (ALU) is active, while other units, like those that interact with main memory or decode the instruction, are idle. One way to improve the overall performance of a computer is through the use of an instruction pipeline. This adds some additional circuitry to hold the intermediate states of the instruction as it flows through the units. While this does not improve the cycle timing of any single instruction, the idea is to allow a second instruction to use the other CPU sub-units when the previous instruction has moved on.

For instance, while one instruction is using the ALU, the next instruction from the program can be in the decoder, and a third can be fetched from memory. In this assembly line type arrangement, the total number of instructions processed at any time can be improved by up to the number of pipeline stages. In the Z80, for example, a four-stage pipeline could improve overall throughput by four times. However, due to the complexity of the instruction timing, this would not be easy to implement. The much simpler instruction set architecture (ISA) of the MOS 6502 allowed a two-stage pipeline to be included, which gave it performance that was about double that of the Z80 at any given clock speed.

A major issue with the implementation of pipelines in early systems was that instructions had widely varying cycle counts. For instance, the instruction to add two values would often be offered in multiple versions, or opcodes, which varied on where they read in the data. One version of add might take the value found in one processor register and add it to the value in another, another version might add the value found in memory to a register, while another might add the value in one memory location to another memory location. Each of these instructions takes a different amount of bytes to represent it in memory, meaning they take different amounts of time to fetch, may require multiple trips through the memory interface to gather values, etc. This greatly complicates the pipeline logic. One of the goals of the RISC chip design concept was to remove these variants so that the pipeline logic was simplified, which leads to the classic RISC pipeline which completes one instruction every cycle.

However, there is one problem that comes up in pipeline systems that can slow the performance down. This occurs when the next instruction may change depending on the results of the last. In most systems, this happens when a branch occurs. For instance, consider the following pseudo-code:

In this case, the program is linear and can be easily pipelined. As soon as the first read instruction has been read and is being decoded, the second read instruction can be read from memory. When the first moves to execute, the add is being read from memory while the second read is decoding, and so forth. Although it still takes the same number of cycles to complete the first read, by the time it is complete the value from the second is ready and the CPU can immediately add them. In a non-pipelined processor the first four instructions will take 16 cycles to complete, in a pipelined one, it takes only five.

See all
User Avatar
No comments yet.