Pipelining

Can idle hardware be used to improve concurrency?

The goal:

More concurrency -> Higher instruction throughput
More work completed in one cycle of the CPU

When an instruction is using some resources in its processing phase, process other instructions on idle resources that are not needed by that instruction. For example, when an instruction is being decoded, fetch the next instruction.

The von-Neumann model processor organisation

Control needs to:

  • input instructions from memory
  • issue signals to control the information flow between the datapath components and to control which operations they perform
  • control instruction sequencing

atapath needs to have the components needed to execute instructions. It also needs to have the interconnections between the components so that the instructions can be accomplished and data can be loaded from and stored to memory.

Two properties of the control and datapath are static (how components are connected) and dynamic (how data moves between components)

Levels of code

The application software is written in high level language. The system software contains two key components. The compiler translates high level language code into machine code, and the operating system handles input/output, memory and storage, task scheduling and resource sharing management. The hardware provides the implementation by the processor, memory and I/O controllers.

High level language

This is the level of abstraction closer to the problem domain and it provides productivity and portability.

Assembly language

Assembly language is a textual representation of instructions (ISA).

Hardware representation

These are the binary bits and encoded instructions and data.

Five stage execution model for MIPS

If each step takes one clock cycle, then a complete instruction takes a maximum of five clock cycles. There is a sequence to how every instruction is executed. Instructions are different, so the details of each step are specific to the type of instruction.

Cycles Per Instruction - CPI

FETCH from memory

Send out the address now in the program counter. Fetch the instruction from memory into the Instruction Register.

DECODE and read registers concurrently

Decode the instruction and access the register file to read present registers.

EXECUTION on the operands

The ALU operates on the operands prepared in the decode stage.

MEMORY access and branch completion

Access the memory if needed to perform a load or store.

WRITE back

Write result back into register file.

Instruction level parallelism (ILP)

Essentially all processors use pipelining to overlap instructions in order to improve performance.

Loop level parallelism

A typical MIPS assembler programme executes between 15-20% of branch instructions. Every third to fifth instruction is a branch giving quite short runs of sequential instructions. This doesn't give much scope for overlap. However, iterations over long loops (loop level parallelism) may be a good candidate for ILP.

for (i = 0; i < 1000; i++) x[i] += y[i]; is a common and typical loop.

Role of the compiler

Compilers can help enhance the available of ILP. Most programmes are written in a high level language, not directly in assembly language. The compiler generates the assembly. The compiler can find sequences of instructions that can be overlapped, then matches sequence lengths to the properties of the CPU. One technique is to unroll or partially unroll for a loop.

The possible danger of this is that when re-ordering instructions - reordered instructions that involve floating point operations can be erroneous due to the properties of floating point. Mission and safety control code does not use floating point numbers for that reason.

The basic idea

  • Divide the instruction processing cycle into distinct stages of processing.
  • Ensure that there are enough hardware resources abailable to process one instruction in each stage.
  • Process a different instruction in each stage. Instructions consecutive in program order are processed in consecutive stages.

The benefit of pipelining is that instruction processing throughput is increased. However there are a number of downsides.

Limitations and clarifications

Each instruction step takes the same time to execute. Time is saved as instructions can be overlapped. Pipeline depth is the maximum number of instructions that can be overlapped. There is a maximum length to the pipeline, and it is ultimately defined by the hardware resources available.

When a dependency is hit during execution, the pipeline is stalled. The CPU needs to dump and reload its state at that point. This leads to latency in execution. The slowest step in the execution process decides the total throughput of the pipeline.

Hazards

Hazard
The situation where there are problems with the instruction pipeline in CPU microarchitectures.
When the next instruction cannot execute in the following clock cycle, there is potential for incorrect computation results if it were to execute.

There are three common types of hazard:

  • Data hazards - instructions that exhibit data dependence or modify data in different stages of a pipeline (race condition).
  • Structural hazards - when two stages in the pipeline need to access the same hardware component, they have to queue up for it. There is a limit to how much hardware can be implemented on a chip.
  • Control flow hazards (branching hazards) - the processor will not know the outcome of the branch when it needs to insert a new instruction into the pipeling (typically during the fetch stage).

More generally, while hazards relate to ILP, in the most abstract sense, similar problems arise when thinking about multi-threaded programming on a multi-core CPU, or even multiple programs accessing shared resources. However, thread libraries and operating systems include mechanisms to allow for synchronisation and serialisation by the software engineer.