# CS 110 Computer Architecture Lecture 14: Superscalar CPUs

#### Instructors:

Sören Schwertfeger & Chundong Wang

https://robotics.shanghaitech.edu.cn/courses/ca/20s/

School of Information Science and Technology SIST

ShanghaiTech University

Slides based on UC Berkley's CS61C

# Agenda

- Processor Performance Overview
- Complex Pipelines
- Static Multiple Issues (VLIW)
- Dynamic Multiple Issues (Superscalar)

#### **Increasing Processor Performance**

#### 1. Clock rate

Limited by technology and power dissipation

#### 2. Pipelining

- "Overlap" instruction execution
- Deeper pipeline: 5 => 10 => 15 stages
  - Less work per stage → shorter clock cycle
  - But more potential for hazards
  - Multi-issue "superscalar" processor



#### Greater Instruction-Level Parallelism (ILP)

- Multiple issue "superscalar"
  - Replicate pipeline stages => multiple pipelines
  - Start multiple instructions per clock cycle
  - CPI < 1, so use Instructions Per Cycle (IPC)</li>
  - E.g., 4GHz 4-way multiple-issue
    - 16 BIPS, peak CPI = 0.25, peak IPC = 4
  - But dependencies reduce this in practice
- "Out-of-Order" execution
  - Reorder instructions dynamically in hardware to reduce impact of hazards
- Hyper-threading

#### Pipelined RISC-V RV32I Datapath



#### Hyper-threading (simplified)



- Duplicate all elements that hold the state (registers)
- Use the same CL blocks
- Use muxes to select which state to use every clock cycle
- => run 2 independent processes
  - No Hazards: registers different; different control flow; memory different;
     Threads: memory hazard should be solved by software (locking, mutex, ...)
- Speedup?
  - No obvious speedup; Complex pipeline: make use of CL blocks in case of unavailable resources (e.g. wait for memory)

#### Intel Nehalem i7

- Hyperthreading:
  - About 5% die area
  - Up to 30% speed gain(BUT also < 0% possible)</li>
- Pipeline: 20-24 stages!
- Out-of-order execution
  - 1. Instruction fetch.
  - 2. Instruction dispatch to an instruction queue
  - Instruction: Wait in queue until input operands are available => instruction can leave queue before earlier, older instructions.
  - 4. The instruction is issued to the appropriate functional unit and executed by that unit.
  - 5. The results are queued.
  - 6. Write to register only after all older instructions have their results written.



# Superscalar Processor



#### **Superscalar = Multicore?**

https://en.wikipedia.org/wiki/Superscalar\_processor

- NO!
- Superscalar: More than one Instruction per clock cycle!
  - Computing not a different thread!
  - Computing instructions from the same program!
  - => Higher throughput
- In Flynn's taxonomy (later in course):
  - a single-core superscalar processor is classified as an SISD processor (Single Instruction stream, Single Data stream)
  - But: most superscalar processors support short vector operations => those are then SIMD (Single Instruction stream, Multiple Data streams).
  - And: nowadays most superscalar processors are multicore, too.

# "Iron Law" of Processor Performance



$$CPI = \frac{Cycles}{Instruction Program} = \frac{Time}{Instruction Program} = \frac{Instructions}{Instruction Program} \times \frac{Time}{Cycle}$$

#### Benchmark: CPI of Intel Core i7



# Calculating CPI Another Way

- First calculate CPI for each individual instruction (add, sub, and, etc.)
- Next calculate frequency of each individual instruction
- Finally multiply these two for each instruction and add them up to get final CPI (the weighted sum)

# Example (RISC processor)

| Op         | Freq <sub>i</sub> | $CPI_i$ | Prod      | (         | % Time | ·) |
|------------|-------------------|---------|-----------|-----------|--------|----|
| ALU        | 50%               | 1       | .5        |           | (23%)  |    |
| Load       | 20%               | 5       | 1.0       |           | (45%)  |    |
| Store      | 10%               | 3       | .3        |           | (14%)  |    |
| Branch     | 20%               | 2       | .4        |           | (18%)  |    |
| <u>Ins</u> | 2.2               | (V      | Vhere tim | ne spent) |        |    |

# Agenda

- Processor Performance
- Complex Pipelines
  - Static Multiple Issues (VLIW)
  - Dynamic Multiple Issues (Superscalar)

## **Complex Pipeline**

- More than one Functional Unit
- Floating point execution!
  - Fadd & Fmul: fixed number of cycles; > 1
  - Fdiv: unknown number of cycles!
- Memory access: on Cache miss unknown number of cycles



#### Issues in Complex Pipeline Control

- Structural conflicts at the execution stage if some FPU or memory unit is not pipelined and takes more than one cycle
- Structural conflicts at the write-back stage due to variable latencies of different functional units

Out-of-order write hazards due to variable latencies of different functional

units



# Modern Complex In-Order Pipeline



# Agenda

- Processor Performance Overview
- Complex Pipelines
- Static Multiple Issues (VLIW)
- Dynamic Multiple Issues (Superscalar)

#### Static Multiple Issue

- aka.: Very Long Instruction Word (VLIW)
- Compiler bundles instructions together
- Compiler takes care of hazards
- CPU executes at the same time

| Instruction type          | Pipe stages |    |    |     |     |     |     |    |
|---------------------------|-------------|----|----|-----|-----|-----|-----|----|
| ALU or branch instruction | IF          | ID | EX | MEM | WB  |     |     |    |
| Load or store instruction | IF          | ID | EX | MEM | WB  |     |     |    |
| ALU or branch instruction |             | IF | ID | EX  | MEM | WB  |     |    |
| Load or store instruction |             | IF | ID | EX  | MEM | WB  |     |    |
| ALU or branch instruction |             |    | IF | ID  | EX  | MEM | WB  |    |
| Load or store instruction |             |    | IF | ID  | EX  | MEM | WB  |    |
| ALU or branch instruction |             |    |    | IF  | ID  | EX  | MEM | WB |
| Load or store instruction |             |    |    | IF  | ID  | EX  | MEM | WB |

# Static Two-Issue RISC-V Datapath



# In-Order Superscalar Pipeline



#### Question

- Which statements that are true?
- A. The number of clock cycles a floating point multiplier needs depends on the values of the operands.
- B. The number of clock cycles a floating point divider needs depends on the values of the operands.
- C. A hyperthreading CPU can execute more than one process/ thread at a given time
- D. A superscalar CPU can execute more than one process/ thread at a given time.
- E. A multi-core CPU can execute more than one process/ thread at a given time.

# Agenda

- Processor Performance Overview
- Complex Pipelines
- Static Multiple Issues (VLIW)
- Dynamic Multiple Issues (Superscalar)

# Superscalar: Dynamic Multiple Issue

- Hardware guarantees correct execution =>
  - Compiler does not need to (but can) optimize
- Dynamic pipeline scheduling:
  - Re-order instructions based on:
    - What functional units are free
    - Avoiding of data hazards
  - Reservation Station
    - Buffer of instructions waiting to be executed
    - With operands (Registers) needed
    - Once all operands are available: execute!
  - Commit Unit (Reorder buffer): supply the operands to reservation station; write to register
  - OR: Unified Physical Register File :
     Registers are renamed for use in reservation station and commit unit

```
add x9 , x9 , x9
    div x10, x9, x8
16:
                                Out of Order Issue
17:
    mv
         x12, x6
18: add x12, x12, x6
19: add x11, x10, x12
20:
     1w \times 13, 8(\times 12)
                                                    Architectual State
     1w \times 14, 8(\times 10)
21:
                                                 (Registers & Memory) as
22:
    mv x7, x15
                                                   if this Instruction is
23:
    mv x8, x16
                                                 finished in in-order CPU.
24: mv x9 , x17
25: div x7 , x7 , x8
                                                      Program State
         x10, 0(x12)
26: sw
27: mv x6, x7
                                                 Commit
                    Reservation
                                Functional Units
                                (ALU, Memory)
                      Station
                                                   Unit
                     Waiting
                                  Computing
                                                  Waiting
                                                               Done
 Next Instructions
                                                               Instructions
                                      16
                       19
                                                    17
  ... 30 29 28 27
                                                            15 14 13 12.
                       21
                                      20
                                                    18
                       25
                                      24
                                                    22
                                                    23
                       26
                                                                Memory
```

```
15: add x9 , x9 , x9
    div x10, x9, x8
16:
                                Out of Order Issue
17: mv
         x12, x6
18: add x12, x12, x6
19: add x11, x10, x12
20:
     1w \times 13, 8(\times 12)
                                                    Architectual State
     1w \times 14, 8(\times 10)
21:
                                                 (Registers & Memory) as
22:
    mv x7, x15
                                                   if this Instruction is
23:
    mv x8, x16
                                                 finished in in-order CPU.
24: mv x9 , x17
25: div x7 , x7 , x8
                                                      Program State
26: sw
         x10, 0(x12)
27: mv x6, x7
                                                 Commit
                    Reservation
                                Functional Units
                                (ALU, Memory)
                      Station
                                                   Unit
                     Waiting
                                                 Waiting
                                  Computing
                                                               Done
 Next Instructions
                                                               Instructions
                               16 div x10 x9 x8
                       19
                                                    17
  ... 30 29 28 27
                                                            15 14 13 12.
                       21
                                                    18
                               20 lw x13 8(x12)
                               24 mv x9 x17
                       25
                                                    22
                                                    23
                       26
                                                                Memory
                               CPU Cycle: 1000
```

```
15: add x9 , x9 , x9
     div x10, x9, x8
16:
                                 Out of Order Issue
17: mv
         x12, x6
18: add x12, x12, x6
                          * 16 finished =>
19:
     add x11, x10, x12
                           16, 17, 18 committed
20:
     1w \times 13, 8(\times 12)
                                                     Architectual State
                           * 16 computed x10
     1w \times 14, 8(\times 10)
21:
                                                   (Registers & Memory) as
                           => 19 can run
22:
     mv x7, x15
                                                     if this Instruction is
                           * division unit free
23:
    mv x8, x16
                                                   finished in in-order CPU.
                           => 25 can run
24: mv x9 , x17
                           * 24 finished
25: div x7 , x7 , x8
                           * 27, 28 were fetched
                                                       Program State
26: sw
         x10, 0(x12)
27: mv x6, x7
                     Reservation
                                 Functional Units
                                                   Commit
                                 (ALU, Memory)
                       Station
                                                    Unit
                      Waiting
                                   Computing
                                                   Waiting
                                                                 Done
 Next Instructions
                                                                 Instructions
                               20 lw x13 8(x12)
                        21
                                                      22
  ... 32 31 30 29
                                                              18 17 16 15...
                        26
                                19 add x11 x10
                                                      23
                                   x12
                        27
                                                      24
                               25 div x7 x7 x8
                        28
                                                                  Memory
                                CPU Cycle: 1001
```

```
15: add x9 , x9 , x9
    div x10, x9, x8
16:
                                Out of Order Issue
17: mv
         x12, x6
18: add x12, x12, x6
                          * 20 & 19 finished =>
19:
     add x11, x10, x12
                          20 & 19 committed
20:
     1w \times 13, 8(\times 12)
                                                     Architectual State
                          * mem unit free =>
     1w \times 14, 8(\times 10)
21:
                                                  (Registers & Memory) as
                          21 can run
22:
     mv x7, x15
                                                     if this Instruction is
                          * 26 still waiting for
23: mv x8 , x16
                                                  finished in in-order CPU.
                          mem unit
24: mv x9 , x17
                          * 27 waiting for 25
25: div x7 , x7 , x8
                                                       Program State
26: sw
         x10, 0(x12)
27: mv x6, x7
                                 Functional Units
                    Reservation
                                                   Commit
                                 (ALU, Memory)
                       Station
                                                    Unit
                      Waiting
                                  Computing
                                                   Waiting
                                                                Done
 Next Instructions
                                                                Instructions
                               25 div x7 x7 x8
                        26
                                                     22
  ... 33 32 31 30
                                                             20 19 18 17
                        27
                                                     23
                               21 lw x14 8(x10)
                        28
                                                     24
                        29
                                                                 Memory
                               CPU Cycle: 1002
```

#### Phases of Instruction Execution



#### Separating Completion from Commit

- Re-order buffer (ROB) holds register results from completion until commit
  - Entries allocated in program order during decode
  - Buffers completed values and exception state until in-order commit point
  - Completed values can be used by dependents before committed (bypassing)
  - Each entry holds program counter, instruction type, destination register specifier and value if any, and exception status (info often compressed to save hardware)

# In-Order versus Out-of-Order Phases

- Instruction fetch/decode/rename always in-order
  - Need to parse ISA sequentially to get correct semantics
  - Proposals for speculative OoO instruction fetch, e.g., Multiscalar.
     Predict control flow and data dependencies across sequential program segments fetched/decoded/executed in parallel, fixup if prediction wrong
- Dispatch (place instruction into machine buffers to wait for issue) also always in-order
  - Some use "Dispatch" to mean "Issue"

#### In-Order Versus Out-of-Order Issue

#### • In-order (InO) issue:

- Issue stalls on read after write (RAW), dependencies or structural hazards, or possibly write after read (WAR), write after write (WAW) hazards
- Instruction cannot issue to execution units unless all preceding instructions have issued to execution units

#### Out-of-order (OoO) issue:

- Instructions dispatched in program order to reservation stations (or other forms of instruction buffer) to wait for operands to arrive, or other hazards to clear
- While earlier instructions wait in issue buffers, following instructions can be dispatched and issued out-of-order

# In-Order versus Out-of-Order Completion

- All but simplest machines have out-of-order completion, due to different latencies of functional units and desire to bypass values as soon as available
- Classic RISC V-stage integer pipeline just barely has in-order completion
  - Load takes two cycles, but following one-cycle integer op completes at same time, not earlier
  - Adding pipelined FPU immediately brings OoO completion

#### Superscalar Intel Processors

- Pentium 4: Marketing demanded higher clock rate => deeper pipelines & high power consumption
- Afterwards: Multi-core processors

| Microprocessor             | Year | Clock Rate | Pipeline<br>Stages | Issue<br>Width | Out-of-Order/<br>Speculation | Cores/<br>Chip | Power |   |
|----------------------------|------|------------|--------------------|----------------|------------------------------|----------------|-------|---|
| Intel 486                  | 1989 | 25 MHz     | 5                  | 1              | No                           | 1              | 5     | W |
| Intel Pentium              | 1993 | 66 MHz     | 5                  | 2              | No                           | 1              | 10    | W |
| Intel Pentium Pro          | 1997 | 200 MHz    | 10                 | 3              | Yes                          | 1              | 29    | W |
| Intel Pentium 4 Willamette | 2001 | 2000 MHz   | 22                 | 3              | Yes                          | 1              | 75    | W |
| Intel Pentium 4 Prescott   | 2004 | 3600 MHz   | 31                 | 3              | Yes                          | 1              | 103   | W |
| Intel Core                 | 2006 | 2930 MHz   | 14                 | 4              | Yes                          | 2              | 75    | W |
| Intel Core i5 Nehalem      | 2010 | 3300 MHz   | 14                 | 4              | Yes                          | 2–4            | 87    | W |
| Intel Core i5 Ivy Bridge   | 2012 | 3400 MHz   | 14                 | 4              | Yes                          | 8              | 77    | W |

#### Arm Cortex A53 & Intel Core i7 920

| Processor                     | ARM A53                                      | Intel Core i7 920                     |  |  |
|-------------------------------|----------------------------------------------|---------------------------------------|--|--|
| Market                        | Personal Mobile Device                       | Server, Cloud                         |  |  |
| Thermal design power          | 100 milliWatts (1 core @ 1 GHz)              | 130 Watts                             |  |  |
| Clock rate                    | 1.5 GHz                                      | 2.66 GHz                              |  |  |
| Cores/Chip                    | 4 (configurable)                             | 4                                     |  |  |
| Floating point?               | Yes                                          | Yes                                   |  |  |
| Multiple Issue?               | Dynamic                                      | Dynamic                               |  |  |
| Peak instructions/clock cycle | 2                                            | 4                                     |  |  |
| Pipeline Stages               | 8                                            | 14                                    |  |  |
| Pipeline schedule             | Static In-order                              | Dynamic Out-of-order with Speculation |  |  |
| Branch prediction             | Hybrid                                       | 2-level                               |  |  |
| 1st level caches/core         | 16-64 KiB I, 16-64 KiB D                     | 32 KiB I, 32 KiB D                    |  |  |
| 2nd level cache/core          | 128-2048 KiB (shared)                        | 256 KiB (per core)                    |  |  |
| 3rd level cache (shared)      | rd level cache (shared) (platform dependent) |                                       |  |  |

#### **ARM Cortex A53 Pipeline**

• Prediction 1 clock cycle! Predict: branches, future function returns; 8 clock cycles on mis-prediction (flush pipeline)



#### Speculative & Out-of-Order Execution

