# CS 110 Computer Architecture Lecture 10: Datapath

Instructor:

Sören Schwertfeger

http://shtech.org/courses/ca/

School of Information Science and Technology SIST

ShanghaiTech University

Slides based on UC Berkley's CS61C

#### Review

- Timing constraints for Finite State Machines
  - Setup time, Hold Time, Clock to Q time
- Use muxes to select among inputs
  - S control bits selects from 2<sup>S</sup> inputs
  - Each input can be n-bits wide, independent of S
  - Can implement muxes hierarchically
- ALU can be implemented using a mux
  - Coupled with basic block elements
  - Adder/ Substractor & AND & OR & shift

# Components of a Computer



#### The CPU

- Processor (CPU): the active part of the computer that does all the work (data manipulation and decision-making)
- Datapath: portion of the processor that contains hardware necessary to perform operations required by the processor
- Control: portion of the processor (also in hardware) that tells the datapath what needs to be done

#### One-Instruction-Per-Cycle RISC-V Machine



- One clock tick => one instruction
- Current state outputs
   => inputs to
   combinational logic => outputs settle at the values of state before next clock edge
- Rising clock edge:
  - all state elements are updated with combinational logic outputs
  - execution moves to next clock cycle

# Datapath and Control

- Datapath designed to support data transfers required by instructions
- Controller causes correct transfers to happen



#### Stages of the Datapath: Overview

- Problem: a single, "monolithic" block that "executes an instruction" (performs all necessary operations beginning with fetching the instruction) would be too bulky and inefficient
- Solution: break up the process of "executing an instruction" into stages, and then connect the stages to create the whole datapath
  - smaller stages are easier to design
  - easy to optimize (change) one stage without touching the others (modularity)

#### Five Stages of Instruction Execution

- Stage 1: Instruction Fetch (IF)
- Stage 2: Instruction Decode (ID)
- Stage 3: Execute (EX): ALU (Arithmetic-Logic Unit)
- Stage 4: Memory Access (MEM)
- Stage 5: Register Write (WB)

#### Stages of Execution on Datapath



# Stages of Execution (1/5)

- There is a wide variety of RISC-V instructions: so what general steps do they have in common?
- Stage 1: Instruction Fetch
  - no matter what the instruction, the 32-bit instruction word must first be fetched from memory (the cache-memory hierarchy)
  - also, this is where we Increment PC
     (that is, PC = PC + 4, to point to the next instruction: byte addressing so + 4)

# Stages of Execution (2/5)

- Stage 2: Instruction Decode
  - upon fetching the instruction, we next gather data from the fields (decode all necessary instruction data)
  - first, read the opcode to determine instruction type and field lengths
  - second, read in data from all necessary registers
    - for add, read two registers
    - for addi, read one register
  - third, generate the immediates

# Stages of Execution (3/5)

- Stage 3: ALU (Arithmetic-Logic Unit)
  - the real work of most instructions is done here: arithmetic (+, -, \*, /), shifting, logic (&, |)

- what about loads and stores?
  - lw t0, 40(t1)
  - the address we are accessing in memory = the value in t1 PLUS the value 40
  - so we do this addition in this stage
- also does stuff for other instructions...

# Stages of Execution (4/5)

- Stage 4: Memory Access
  - actually only the load and store instructions do anything during this stage; the others remain idle during this stage or skip it all together
  - since these instructions have a unique step, we need this extra stage to account for them
  - as a result of the cache system, this stage is expected to be fast

# Stages of Execution (5/5)

- Stage 5: Register Write
  - most instructions write the result of some computation into a register
  - examples: arithmetic, logical, shifts, loads, jumps
  - what about stores, branches?
    - don't write anything into a register at the end
    - these remain idle during this fifth stage or skip it all together

#### Stages of Execution on Datapath



# Datapath Components: Combinational

Combinational Elements



- Storage Elements + Clocking Methodology
- Building Blocks

#### Datapath Elements: State and Sequencing (1/3)

Register



- Write Enable:
  - Negated (or deasserted) (0):Data Out will not change
  - Asserted (1): Data Out will become Data In on positive edge of clock

#### Datapath Elements: State and Sequencing (2/3)

- Register file (regfile, RF) consists of 32 registers
  - Two 32-bit output busses: busA and busB
  - One 32-bit input bus: busW
  - In one clock cycle can read two registers and write another!



- Register is selected by:
  - RA (number) selects the register to put on busA (data)
  - RB (number) selects the register to put on busB (data)
  - RW (number) selects the register to be written via busW (data) when Write Enable is 1
- Clock input (clk)
  - Clk input is a factor ONLY during write operation
  - During read operation, behaves as a combinational logic block:
    - RA or RB valid ⇒ busA or busB valid after "access time."

#### Datapath Elements: State and Sequencing (1/3)

- "Magic" Memory
  - One input bus: Data In
  - One output bus: Data Out



- Memory word is found by:
  - For Read: Address selects the word to put on Data Out
  - For Write: Set Write Enable = 1: address selects the memory word to be written via the Data In bus
- Clock input (CLK)
  - CLK input is a factor ONLY during write operation
  - During read operation, behaves as a combinational logic block:
     Address valid ⇒ Data Out valid after "access time"

#### State Required by RV32I ISA

Each instruction reads and updates this state during execution:

- Registers (x0..x31)
  - Register file (regfile) Reg holds 32 registers x 32 bits/register: Reg [0] . . Reg [31]
  - First register read specified by rs1 field in instruction
  - Second register read specified by rs2 field in instruction
  - Write register (destination) specified by rd field in instruction
  - x0 is always 0 (writes to Reg[0] are ignored)
- Program Counter (PC)
  - Holds address of current instruction
- Memory (MEM)
  - Holds both instructions & data, in one 32-bit byte-addressed memory space
  - We'll use separate memories for instructions (IMEM) and data (DMEM)
    - These are placeholders for instruction and data caches
  - Instructions are read (fetched) from instruction memory (assume IMEM read-only)
  - Load/store instructions access data memory

# Review: Complete RV32I ISA

| imm[31:12]            |     | rd          | 0110111 | LUI   | 0000000 | )      | $_{\mathrm{shamt}}$    | rs1    | 001 | rd    | 001001 |
|-----------------------|-----|-------------|---------|-------|---------|--------|------------------------|--------|-----|-------|--------|
| imm[31:12]            |     | rd          | 0010111 | AUIPC | 0000000 | )      | $_{ m shamt}$          | rs1    | 101 | rd    | 001001 |
| imm[20 10:1 11 19:12] |     | rd          | 1101111 | JAL   | 0100000 | )      | $\operatorname{shamt}$ | rs1    | 101 | rd    | 001001 |
| imm[11:0] rs1         | 000 | rd          | 1100111 | JALR  | 0000000 | )      | rs2                    | rs1    | 000 | rd    | 011001 |
| imm[12 10:5] rs2 rs1  | 000 | imm[4:1 11] | 1100011 | BEQ   | 0100000 | )      | rs2                    | rs1    | 000 | rd    | 011001 |
| imm[12 10:5] rs2 rs1  | 001 | imm[4:1 11] | 1100011 | BNE   | 0000000 | )      | rs2                    | rs1    | 001 | rd    | 011001 |
| imm[12 10:5] rs2 rs1  | 100 | imm[4:1 11] | 1100011 | BLT   | 0000000 | )      | rs2                    | rs1    | 010 | rd    | 011001 |
| imm[12 10:5] rs2 rs1  | 101 | imm[4:1 11] | 1100011 | BGE   | 0000000 | )      | rs2                    | rs1    | 011 | rd    | 011001 |
| imm[12 10:5] rs2 rs1  | 110 | imm[4:1 11] | 1100011 | BLTU  | 0000000 | )      | rs2                    | rs1    | 100 | rd    | 011001 |
| imm[12 10:5] rs2 rs1  | 111 | imm[4:1 11] | 1100011 | BGEU  | 0000000 | )      | rs2                    | rs1    | 101 | rd    | 011001 |
| imm[11:0] rs1         | 000 | rd          | 0000011 | LB    | 0100000 | )      | rs2                    | rs1    | 101 | rd    | 011001 |
| imm[11:0] rs1         | 001 | rd          | 0000011 | LH    | 0000000 | )      | rs2                    | rs1    | 110 | rd    | 011001 |
| imm[11:0] rs1         | 010 | rd          | 0000011 | LW    | 0000000 | )      | rs2                    | rs1    | 111 | rd    | 011001 |
| imm[11:0] rs1         | 100 | rd          | 0000011 | LBU   | 0000    | pred   |                        | 00000  | 000 | 00000 | 000111 |
| imm[11:0] rs1         | 101 | rd          | 0000011 | LHU   | 0000    | 000    |                        | 00000  | 001 | 00000 | 000111 |
| imm[11:5] rs2 rs1     | 000 | imm[4:0]    | 0100011 | SB    |         | 000000 |                        | 00000  | 000 | 00000 | 111001 |
| imm[11:5] rs2 rs1     | 001 | imm[4:0]    | 0100011 | SH    | 000     | 000000 |                        | 00000  | 000 | 00000 | 111001 |
| imm[11:5] rs2 rs1     | 010 | imm[4:0]    | 0100011 | SW    |         | csr    |                        | rs1    | 001 | rd    | 111001 |
| imm[11:0] rs1         | 000 | rd          | 0010011 | ADDI  |         | csr    | IV                     | otrs1n | CA  | rd    | 111001 |
| imm[11:0] rs1         | 010 | rd          | 0010011 | SLTI  |         | csr    |                        | rs1    | 011 | rd    | 111001 |
| imm[11:0] rs1         | 011 | rd          | 0010011 | SLTIU |         | csr    |                        | zimm   | 101 | rd    | 111001 |
| imm[11:0] rs1         | 100 | rd          | 0010011 | XORI  |         | csr    |                        | zimm   | 110 | rd    | 111001 |
| imm[11:0] rs1         | 110 | rd          | 0010011 | ORI   |         | csr    |                        | zimm   | 111 | rd    | 111001 |
| imm[11:0] rs1         | 111 | rd          | 0010011 | ANDI  |         |        |                        |        |     |       |        |

Need datapath and control to implement these instructions

SRLI
SRAI
ADD
SUB
SLL
SLT
SLTU
XOR
SRL
SRA
OR
AND
FENCE

EBREA CSRRW CSRRS

CSRRC

#### Implementing the add instruction



Instruction makes two changes to machine's state:

```
-Reg[rd] = Reg[rs1] + Reg[rs2]
-PC = PC + 4
```

# Datapath for add



# Timing Diagram for add



#### Implementing the **sub** instruction

| 31 25   | 24 20 | 19 15 | 14 12 | 11 7 | 76 0    |     |
|---------|-------|-------|-------|------|---------|-----|
| 0000000 | rs2   | rs1   | 000   | rd   | 0110011 | add |
| 0100000 | rs2   | rs1   | 000   | rd   | 0110011 | sub |

#### sub rd, rs1, rs2

- Almost the same as add, except now have to subtract operands instead of adding them
- inst[30] selects between add and subtract

# Datapath for add/sub



#### Implementing other R-Format instructions

| 0000000 | rs2 | rs1 | 000 | rd | 0110011 |
|---------|-----|-----|-----|----|---------|
| 0100000 | rs2 | rs1 | 000 | rd | 0110011 |
| 0000000 | rs2 | rs1 | 001 | rd | 0110011 |
| 0000000 | rs2 | rs1 | 010 | rd | 0110011 |
| 0000000 | rs2 | rs1 | 011 | rd | 0110011 |
| 0000000 | rs2 | rs1 | 100 | rd | 0110011 |
| 0000000 | rs2 | rs1 | 101 | rd | 0110011 |
| 0100000 | rs2 | rs1 | 101 | rd | 0110011 |
| 0000000 | rs2 | rs1 | 110 | rd | 0110011 |
| 0000000 | rs2 | rs1 | 111 | rd | 0110011 |

add
sub
sll
slt
sltu
xor
srl
sra
or

 All implemented by decoding funct3 and funct7 fields and selecting appropriate ALU function

# Implementing I-Format - addi instruction

#### RISC-V Assembly Instruction:



| 111111001110 | 00001 | 000 | 01111 | 0010011 |
|--------------|-------|-----|-------|---------|
| imm=-50      | rs1=1 | add | rd=15 | OP-Imm  |

#### Datapath for add/sub



# Adding addi to Datapath



# Adding addi to Datapath



#### **I-Format immediates**



imm[31:0]



- High 12 bits of instruction (inst[31:20]) copied to low 12 bits of immediate (imm[11:0])
- Immediate is sign-extended by copying value of inst[31] to fill the upper 20 bits of the immediate value (imm[31:12])

#### R+I Datapath



#### Peer Instruction

- 1) Program counter is a register
- 2) We should use the main ALU to compute PC=PC+4 in order to save some gates
- 3) The ALU is a synchronous state element

123
A: FFF
B: FFT
C: FTF
D: FTT
E: TFF
G: TTF
H: TTT

#### Add lw

RISC-V Assembly Instruction (I-type): lw x14, 8 (x2)



- The 12-bit signed immediate is added to the base address in register rs1 to form the memory address
  - This is very similar to the add-immediate operation but used to create address not to create final result
- The value loaded from memory is stored in register rd

# Adding lw to Datapath



#### All RV32 Load Instructions

| imm[11:0] | rs1 | 000 | rd | 0000011 |
|-----------|-----|-----|----|---------|
| imm[11:0] | rs1 | 001 | rd | 0000011 |
| imm[11:0] | rs1 | 010 | rd | 0000011 |
| imm[11:0] | rs1 | 100 | rd | 0000011 |
| imm[11:0] | rs1 | 101 | rd | 0000011 |

lh lw lbu

funct3 field encodes size and 'signedness' of load data

- Supporting the narrower loads requires additional logic to extract the correct byte/halfword from the value loaded from memory, and sign- or zero-extend the result to 32 bits before writing back to register file.
  - It is just a mux mod

#### Adding SW Instruction

sw: Reads two registers, rs1 for base memory address, and rs2 for data to be stored, as well immediate offset!
 sw x14, 8(x2)



# Datapath with 1w



# Adding sw to Datapath



#### **I+S Immediate Generation**



- Just need a 5-bit mux to select between two positions where low five bits of immediate can reside in instruction
- Other bits in immediate are wired to fixed positions in instruction

# Implementing Branches



- B-format is mostly same as S-Format, with two register sources (rs1/rs2) and a 12-bit immediate
- But now immediate represents values -4096 to +4094 in 2-byte increments
- The 12 immediate bits encode *even* 13-bit signed byte offsets (lowest bit of offset is always zero, so no need to store it)

# Datapath So Far



#### **Branches**

Different change to the state:

$$- PC = \begin{cases} PC + 4, & \text{branch not taken} \\ PC + \text{immediate, branch taken} \end{cases}$$

Six branch instructions:

```
BEQ, BNE, BLT, BGE, BLTU, BGEU
```

- Need to compute PC + immediate and to compare values of rs1 and rs2
  - But have only one ALU need more hardware

# **Adding Branches**



# **Branch Comparator**



- BrEq = 1, if A=B
- BrLT = 1, if A < B
- BrUn =1 selects unsigned comparison for BrLT, 0=signed

• BGE branch: A  $\Rightarrow$  B, if  $\overline{A < B}$ 

$$\overline{A < B} = !(A < B)$$

#### Branch Immediates (In Other ISAs)

- 12-bit immediate encodes PC-relative offset of -4096 to +4094 bytes in multiples of 2 bytes
- Standard approach: Treat immediate as in range -2048..+2047, then shift left by 1 bit to multiply by 2 for branches



Each instruction immediate bit can appear in one of two places in output immediate value – so need one 2-way mux per bit