# CS 110 Computer Architecture Lecture 30:

Course Summary
Video 1: Admin

#### Instructors:

Sören Schwertfeger & Chundong Wang

https://robotics.shanghaitech.edu.cn/courses/ca/20s/

School of Information Science and Technology SIST

ShanghaiTech University

Slides based on UC Berkley's CS61C



## Quiz on Security

Piazza: "Video Lecture 29 Security"

- Which one of the following statements is NOT true?
  - A. In a processor with inclusive cache, Flush+Reload will flush a cache line from all levels of CPU cache.
  - B. Monitoring network traffic may provide a side channel to attack cloud computing applications.
  - C. Smartphones are not affected by Rowhammer because the DRAM chips in them are battery-backed.
  - D. Using Meltdown to dump the entire kernel memory does not require any permission or privilege.



## Quiz on Security

Piazza: "Video Lecture 29 Security"

- Which one of the following statements is NOT true?
  - A. In a processor with inclusive cache, Flush+Reload will flush a cache line from all levels of CPU cache.
  - B. Monitoring network traffic may provide a side channel to attack cloud computing applications.
  - C. Smartphones are not affected by Rowhammer because the DRAM chips in them are battery-backed.
  - D. Using Meltdown to dump the entire kernel memory does not require any permission or privilege.

## **Final**

- Date: Tuesday, June 23rd, 2020
- Time: 8:00 10:00 (normal lecture slot++)
  - Be there latest 7:45 we start 8:00 sharp!
- Venue: 3 rooms <u>check on egate which room you are!</u>
  - 教学中心201教学中心202教学中心203
- Closed book:
  - You can bring <u>three</u> A4 pages with notes (both sides; in <u>English</u>): Write your Chinese and <u>Pinyin</u> name on the top! <u>Handwritten</u> by you!
  - You will be provided with the RISC-V "green sheet"
  - No other material allowed!

## **Final**

- Wear your Corona mask!
- Switch cell phones off! (not silent mode – off!)
  - Put them in your bags.
- Bags under the table. Nothing except paper, pen, 1 drink, 1 snack, your student ID card on the table!
- No other electronic devices are allowed!
  - No ear plugs, music, smartwatch...
- Anybody touching any electronic device will FAIL the course!
- Anybody found cheating (copy your neighbors answers, additional material, ...) will FAIL the course!
- Content: Everything!

## **Next Lecture**

- Next Online Lecture:
  - Q&A!
  - Prepare your questions
  - We will try to answer live or on piazza







**Admin** 

## COMPUTER RGANIZATION AND DESIGN

THE HA. WARE/SOFTWA INTERFACE



FIL C- EDITION





DAVID A. PATTERSON JOHN L. HENNESSY

#### Admin





#### **Admin**



And there is no seek forms in the first continued of the **国家提供证券** and on the same of 北西 智 march of Land of Chall and one of cart at conduct only and Chairm Street Arris description (A) (2) and the control of the control -Service of the property of the control of the contr AND THE REPORT OF THE PARTY OF arty auto daily to lopher acts. The state of the s Special Specia 7 25 75 25 and a fact out 1250 -Acres and the second second phones of instruction distribution fold The state of the s Product on Bandler ("Minority, 10-7)

Makes a service of a proceeding band in the State of St Action of the organization TX : HOSE NIX 1 PAR int Albert 1918 HE MADE AND MARCO (MAR) BUT portugeness of CR) ALLS registed dw/o 814 represent January 82-5 Approxi from SEC 70 SEC 10 SEC Search State AL participa - Stan METERS.



## CS 110 Computer Architecture Lecture 30:

Course Summary
Video 2: WSC to CPUs

Instructors:

Sören Schwertfeger & Chundong Wang

https://robotics.shanghaitech.edu.cn/courses/ca/20s/

School of Information Science and Technology SIST

ShanghaiTech University

Slides based on UC Berkley's CS61C

## New School Computer Architecture (1/3)



## New School Computer Architecture (2/3)





## **Old Machine Structures**



## New-School Machine Structures (It's a bit more complicated!)

Software

Parallel Requests
 Assigned to computer
 e.g., Search "Katz"

Parallel Threads
 Assigned to core
 e.g., Lookup, Ads

Parallel Instructions
 >1 instruction @ one time
 e.g., 5 pipelined instructions

Parallel Data
 >1 data item @ one time
 e.g., Add of 4 pairs of words

Hardware descriptions
 All gates functioning in parallel at same time

Programming Languages



## Great Ideas in Computer Architecture

- 1. Design for Moore's Law
- 2. Abstraction to Simplify Design
- 3. Make the Common Case Fast
- 4. Dependability via Redundancy
- 5. Memory Hierarchy
- Performance via Parallelism/Pipelining/Prediction

## Powers of Ten inspired CA Overview

- Going Top Down cover 3 Views
- 1. Architecture (when possible)
- 2. Physical Implementation of that architecture
- 3. Programming system for that architecture and implementation (when possible)

See <a href="http://www.powersof10.com/film">http://www.powersof10.com/film</a>

## 10<sup>7</sup> meters

## **Earth**



#### Kazakhstan Turkey Mongolia Russia Finland Poland Barents Sea Denmark Italy France Arctic Ocean South Korea Iceland Portugal Sea of Japan Japan Sea of Okhotsk Arctic Ocean Philippine Sea Arctic Ocean Baffin Bay Greenland Western Sahara Labrador Sea Bering Sea North Atlantic Ocean Canada North Pacific Ocean Puerto Rico Guyana Gulf of Cuba 7 Sea Brazil Mexico Colombia Guatemala Ecuador Peru Google Map data @2020 Google, INEGI United States Terms Send feedback 2000 km L

### 10<sup>7</sup> meters

## Vancouver Victoria Seattle WASHINGTON MONTANA Portland OREGON IDAHO NEVADA UTAH Sacramento Google San Francisco Map data ©2020 Google, INEGI United States Terms Send feedback 200 km ∟

## 10<sup>6</sup> meters



## The Dalles, Oregon 104 meters



The Dalles, Oregon 104 meters



## Google's Oregon WSC 10<sup>3</sup> meters



### 10<sup>4</sup> meters

## Google's Oregon WSC



## Google Warehouse

- 90 meters by 75 meters, 10 Megawatts
- Contains 40,000 servers, 190,000 disks
- Power Utilization Effectiveness: 1.23
  - 85% of 0.23 overhead goes to cooling losses
  - 15% of 0.23 overhead goes to power losses
- Contains 45, 40-foot long containers
  - 8 feet x 9.5 feet x 40 feet
- 30 stacked as double layer, 15 as single layer

## Containers in WSCs

## 10<sup>2</sup> meters



## **Google Container**

## 10<sup>1</sup> meters



## **Google Container**

## 10<sup>0</sup> meters





- 2 long rows, each with 29 racks
- Cooling below raised floor
- Hot air returned behind racks

## **Equipment Inside a Container**



Server (in rack format):





7 foot Rack: servers + Ethernet local area network switch in middle ("rack switch")

Array (aka cluster): server racks + larger local area network switch ("array switch") 10X faster => cost 100X: cost f(N<sup>2</sup>)

## Great Ideas in Computer Architecture

- 1. Design for Moore's Law
  - -- WSC, Container, Rack
- 2. Abstraction to Simplify Design
- 3. Make the Common Case Fast
- 4. Dependability via Redundancy
  - -- Multiple WSCs, Multiple Racks, Multiple Switches
- 5. Memory Hierarchy
- 6. Performance via Parallelism/Pipelining/Prediction
  - -- Task level Parallelism, Data Level Parallelism

# Google Server Internals 10-1 meters



## Facebook Datacenter





# Sfotware: Often uses MapReduce

- Simple data-parallel programming model and implementation for processing large datasets
- Users specify the computation in terms of
  - a map function, and
  - a reduce function
- Underlying runtime system
  - Automatically *parallelize* the computation across large scale clusters of machines
  - Handles machine failure
  - Schedule inter-machine communication to make efficient use of the networks

# Programming Multicore Microprocessor: OpenMP

```
#include <omp.h>
#include <stdio.h>
static long num steps = 100000;
int value[num steps];
int reduce()
   int i; int sum = 0;
#pragma omp parallel for private(x) reduction(+:sum)
   for (i=1; i \le num steps; i++){
     sum = sum + value[i];
```

# Great Ideas in Computer Architecture

- 1. Design for Moore's Law
  - -- More transistors = Multicore + SIMD
- 2. Abstraction to Simplify Design
- 3. Make the Common Case Fast
- 4. Dependability via Redundancy
- 5. Memory Hierarchy
  - -- More transistors = Cache Memories
- 6. Performance via Parallelism/Pipelining/ Prediction
  - -- Thread-level Parallelism

# **AMD Opteron Microprocessor**



#### AMD Opteron Microarchitecture



# **AMD Opteron Pipeline Flow**

For integer operations



- 12 stages (Floating Point is 17 stages)
- Up to 106 RISC-ops in progress

# **AMD Opteron Block Diagram**



# **AMD Opteron Microprocessor**



# **AMD Opteron Core**



# Zoom into a Microchip





# Q & A



#### Remember:

- One more Video ;)
- Prepare Questions for next session

# CS 110 Computer Architecture Lecture 30:

Course Summary

Video 3: CPUs to Transistors

#### Instructors:

Sören Schwertfeger & Chundong Wang

https://robotics.shanghaitech.edu.cn/courses/ca/20s/

School of Information Science and Technology SIST

ShanghaiTech University

Slides based on UC Berkley's CS61C

# **AMD Opteron Core**



# Programming One Core: C with Intrinsics

```
void mmult(int n, float *A, float *B, float *C)
for (int i = 0; i < n; i+=4)
  for ( int j = 0; j < n; j++)
     m128 c0 = mm load ps(C+i+j*n);
   for( int k = 0; k < n; k++)
    c0 = _mm_add_ps(c0, _mm_mul_ps(_mm_load_ps(A+i+k*n),
                                        mm load1 ps(B+k+j*n)));
   mm store ps(C+i+j*n, c0);
```

# Inner loop from gcc –O -S

Assembly snippet from innermost loop:

```
movaps (%rax), %xmm9
mulps %xmm0, %xmm9
addps %xmm9, %xmm8
movaps 16(%rax), %xmm9
mulps %xmm0, %xmm9
addps %xmm9, %xmm7
movaps 32(%rax), %xmm9
mulps %xmm0, %xmm9
addps %xmm9, %xmm6
movaps 48(%rax), %xmm9
mulps %xmm0, %xmm9
addps %xmm9, %xmm5
```

## Great Ideas in Computer Architecture

- Design for Moore's Law
- 2. Abstraction to Simplify Design
  - -- Instruction Set Architecture, Micro-operations
- 3. Make the Common Case Fast
- 4. Dependability via Redundancy
- Memory Hierarchy
- 6. Performance via Parallelism/Pipelining/Prediction
  - -- Instruction-level Parallelism (superscalar, pipelining)
  - -- Data-level Parallelism

#### SIMD Adder

- Four 32-bit adders that operate in parallel
  - Data Level Parallelism



## One 32-bit Adder



# 1 bit of 32-bit Adder



# Complementary MOS Transistors (NMOS and PMOS) of NAND Gate



| ×       | У       | Z       |
|---------|---------|---------|
| 0 volts | 0 volts | 3 volts |
| 0 volts | 3 volts | 3 volts |
| 3 volts | 0 volts | 3 volts |
| 3 volts | 3 volts | 0 volts |

# Physical Layout of NAND Gate 10<sup>-7</sup> meters



#### 10<sup>-7</sup> meters

# Scanning Electron Microscope

# 100 nanometers



Top View



**Cross Section** 

# How to make a CMOS chip?

#### 10<sup>-6</sup> meters

# **Block Diagram of Static RAM**



### 1 Bit SRAM in 6 Transistors



# Physical Layout of SRAM Bit



#### **SRAM Cross Section**



#### **DIMM Module**

- DDR = Double Data Rate
  - Transfers bits on Falling AND Rising Clock Edge
- Has Single Error Correcting, Double Error Detecting Redundancy (SEC/DED)
  - 72 bits to store 64 bits of data
  - Uses "Chip kill" organization so that if single
     DRAM chip fails can still detect failure
- Average server has 22,000 correctable errors and 1 uncorrectable error per year

## **DRAM Bits**



#### **DRAM Cell in Transistors**



# Physical Layout of DRAM Bit



#### 10<sup>-7</sup> meters

# 100 nanometers

## **Cross Section of DRAM Bits**



# **TSCM**



# **AMD Opteron Dependability**

- L1 cache data is SEC/DED protected
- L2 cache and tags are SEC/DED protected
- DRAM is SEC/DED protected with chipkill
- On-chip and off-chip ECC protected arrays include autonomous, background hardware scrubbers
- Remaining arrays are parity protected
  - Instruction cache, tags and TLBs
  - Data tags and TLBs
  - Generally read only data that can be recovered from lower levels

# Programming Memory Hierarchy: Cache Blocked Algorithm

 The blocked version of the i-j-k algorithm is written simply as (A,B,C are submatricies of a, b, c)

```
for (i=0;i<N/r;i++)
for (j=0;j<N/r;j++)
for (k=0;k<N/r;k++)
C[i][j] += A[i][k]*B[k][j]</pre>
```

- r = block (sub-matrix) size (Assume r divides N)
- X[i][j] = a sub-matrix of X, defined by block row i and block column j

# Great Ideas in Computer Architecture

- Design for Moore's Law
  - -- Higher capacities caches and DRAM
- Abstraction to Simplify Design
- Make the Common Case Fast
- 4. Dependability via Redundancy
  - -- Parity, SEC/DEC
- 5. Memory Hierarchy
  - -- Caches, TLBs
- 6. Performance via Parallelism/Pipelining/Prediction
  - -- Data-level Parallelism

## **Course Summary**

- As the field changes, Computer Architecture courses change, too!
- It is still about the software-hardware interface
  - Programming for performance!
  - Parallelism: Task-, Thread-, Instruction-, and Data MapReduce, OpenMP, C, SSE Intrinsics
  - Understanding the memory hierarchy and its impact on application performance