

#### **About This Presentation**

This slide show was first developed as a keynote talk for remote delivery at CSICC-2016, Computer Society of Iran Computer Conference, held in Tehran on March 8-10. The talk was presented at a special session on March 9, 11:30 AM to 12:30 PM Tehran time (12:00-1:00 AM PST). An updated version of the talk was prepared and used within B. Parhami's suite of lectures for IEEE Computer Society's Distinguished Visitors Program, 2021-2023. All rights reserved for the author. ©2021 Behrooz Parhami

| Edition | Released  | Revised   | Revised | Revised |
|---------|-----------|-----------|---------|---------|
| First   | Mar. 2016 | July 2017 |         |         |
| Second  | Mar. 2021 |           |         |         |
|         |           |           |         |         |





## My Personal Academic Journey







## Some of the material in this talk come from, or will appear in updated versions of, my two computer architecture textbooks





Mar. 2021



Eight Key Ideas in Computer Architecture



## Eight Key Ideas in Computer Architecture, from Eight Decades of Innovation

Computer architecture became an established discipline when the stored-program concept was incorporated into barebones computers of the 1940s. Since then, the field has seen multiple minor and major innovations in each decade. I will present my pick of the most-important innovation in each of the eight decades, from the 1940s to the 2010s, and show how these ideas, when connected to each other and allowed to interact and cross-fertilize, produced the phenomenal growth of computer performance, now approaching exa-op/s (billion billion operations per second) level, as well as to ultra-low-energy and single-chip systems. I will also offer predictions for what to expect in the 2020s and beyond.





# Background: 1820s-1930s





Program (Instructions)

Data (Variable values)









#### Difference Engine: Fixed Program



f(x) $D^{(1)}$  $D^{(2)}$  $x^2 + x + 41$ 43 10

Babbage's Difference Engine 2

2nd-degree polynomial evaluation Babbage used 7th-degree f(x)







## Analytical Engine: Programmable





Ada Lovelace, world's first programmer

Sample program >

| Number of Operation | 8                   | Operation Variables acted upon              | receiving change             |                                                                                                                                                                                                                                 | ,                                                                                        | Data  |                           |                         |        |        |
|---------------------|---------------------|---------------------------------------------|------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------|-------|---------------------------|-------------------------|--------|--------|
|                     | Nature of Operation |                                             |                              | Indication of<br>change in the<br>value on any<br>Variable                                                                                                                                                                      | any Statement of Results                                                                 |       | 1 V 1<br>0<br>0<br>0<br>2 | 1V,<br>0<br>0<br>0<br>4 | \$0000 | \$0000 |
| Nun                 | Na                  |                                             |                              |                                                                                                                                                                                                                                 |                                                                                          | 1     | 2                         | B                       | Ш      | _      |
| i                   | ×                   | TV2 × TV3                                   | 1V4, 1V6, 1V6                | $\begin{cases} {}^{1}V_{4} = {}^{1}V_{4} \\ {}^{1}V_{3} = {}^{1}V_{3} \end{cases}$                                                                                                                                              | = 2n                                                                                     |       | 2                         | л                       | 2я     | 2:     |
| 2                   | -                   | $^{1}V_{4} - ^{1}V_{1}$                     | *V                           | [ 1 . A. = . A. ]                                                                                                                                                                                                               | = 2n-1                                                                                   | 1     | ,                         | \$220                   | 2n -   | 1      |
| 3                   | +                   | 1V6 + 1V1                                   | ²V4                          | 1V = V                                                                                                                                                                                                                          | = 2n+1                                                                                   | ı     | ***                       | 144                     | 1      | 24-1-  |
| 4                   | +                   | *V. +*V.                                    | ιγ <sub>ι</sub> ,            | $\left\{ \begin{array}{l} 2V_{4} = 0V_{4} \\ 2V_{4} = 0V_{4} \end{array} \right\}$                                                                                                                                              | $=\frac{2n-1}{2n+1}$                                                                     |       | ***                       | 400                     | 0      | 0      |
| 5                   | +                   | 'V11+1V4                                    | *V <sub>11</sub>             | \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\                                                                                                                                                                                          | $=\frac{1}{2}\cdot\frac{2n-1}{2n+1}\cdot\dots$                                           | ***   | 2                         | 215                     | 1025   |        |
| 6                   | -                   | 0V13-3V11                                   | 'V1:                         | $\left\{ \begin{array}{l} {}^{2}V_{11} = {}^{0}V_{11} \\ {}^{0}V_{11} = {}^{1}V_{11} \end{array} \right\}$                                                                                                                      | $= -\frac{1}{2} \cdot \frac{2\kappa - 1}{2\kappa + 1} = \Lambda_0 \dots$                 | 1110  |                           | 555                     | 2555   |        |
| 7                   | -                   | $^{1}V_{1} - ^{2}V_{1}$                     | rV <sub>10</sub>             | $\begin{cases} {}^{2}V_{11} = {}^{0}V_{11} \\ {}^{0}V_{13} = {}^{1}V_{13} \end{cases}$ $\begin{cases} {}^{1}V_{3} = {}^{2}V_{3} \\ {}^{1}V_{3} = {}^{2}V_{1} \end{cases}$                                                       | $= n - 1 (= 3) \dots$                                                                    | 1     | 424                       | n                       |        | 275    |
| 8                   | +                   | 'V, +°V,                                    | ۳۷,                          | 07.00 VAP - 17.00 S - 30.00                                                                                                                                                                                                     | = 2+0 = 2                                                                                | 30255 | 2                         | 2010                    |        | -22    |
| 9                   | ÷                   | ${}^{1}V_{\bullet} \div {}^{1}V_{\uparrow}$ | *V <sub>11</sub>             | {*V = *V }                                                                                                                                                                                                                      | $=\frac{2n}{2}=A_1 \ldots \ldots$                                                        | 311.  |                           |                         | ***    |        |
| 0                   | ×                   | *V,, x *V,,                                 | ν <sub>12</sub>              | $\left\{ \begin{array}{l} \left\{ \begin{array}{l} V_{21} = V_{21} \\ V_{21} = V_{21} \end{array} \right\} \end{array} \right.$                                                                                                 | $= B_1 \cdot \frac{2n}{2} = B_1 A_1 \cdot \dots \cdot \dots$                             |       | 111-                      | 400                     | 1344   | Č.     |
| 1                   | +                   | 'V11+ 'V15                                  | *V <sub>14</sub>             | $\begin{cases} {}^{1}V_{12} = {}^{0}V_{12} \\ {}^{1}V_{12} = {}^{1}V_{12} \end{cases}$                                                                                                                                          | $= -\frac{1}{2} \cdot \frac{2n-1}{2n+1} + B_1 \cdot \frac{2n}{2} \cdot \dots$            |       | -22                       | ***                     | 355    | ***    |
| 12                  | -                   | Vto-IV                                      | 2V10                         | $\begin{cases} {}^{1}V_{12} = {}^{0}V_{12} \\ {}^{1}V_{13} = {}^{1}V_{13} \\ {}^{1}V_{10} = {}^{2}V_{10} \\ {}^{1}V_{1} = {}^{1}V_{1} \end{cases}$                                                                              | = n-2(=2)                                                                                | 1     |                           |                         | •••    |        |
| 3                   | <b>r</b> -          | 'V <sub>4</sub> - 'V <sub>1</sub>           | ۱۸ <sup>e</sup>              | $ \left\{ \begin{array}{l} {}^{4}V_{4} = {}^{8}V_{4} \\ {}^{4}V_{1} = {}^{3}V_{1} \\ {}^{4}V_{2} = {}^{3}V_{1} \\ {}^{4}V_{3} = {}^{3}V_{3} \end{array} \right\} $                                                              | = 2n-1                                                                                   | 1     |                           | 3900                    | ***    |        |
| 11                  | 13                  | V, + 1V,                                    | 'V <sub>†</sub>              | $\left\{ \begin{array}{l} i \nabla_1 = i \nabla_1 \\ i \nabla_1 = i \nabla_2 \end{array} \right\}$                                                                                                                              | = 2+1 = 3                                                                                | 1     | 101                       | ,                       | 100    | 2      |
| 5                   | 1 +                 | V <sub>4</sub> + 2V <sub>7</sub>            | 1V4                          | \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\                                                                                                                                                                                          | $=\frac{2n-1}{3}\dots\dots$                                                              | 2.55  | 97                        | 995                     | 222    | ce     |
| 16                  | (×                  | $^{t}V_{8} \times ^{s}V_{11}$               | 'V <sub>11</sub>             | $\left\{ {}^{1}V_{0} = {}^{0}V_{3} \atop {}^{3}V_{0} = {}^{4}V_{0} \right\}$                                                                                                                                                    | $=\frac{2n}{2}\cdot\frac{2n-1}{3}\cdot\ldots$                                            | 1021  | ,                         |                         |        |        |
| 17                  | 1-                  | $^{1}V_{\bullet} - {^{1}V_{1}}$             | *V,                          |                                                                                                                                                                                                                                 | =2n-2                                                                                    | 1     |                           |                         | ***    |        |
| 18                  | +                   | 2V1 + 2V,                                   | °V,                          | $ \begin{cases}                                    $                                                                                                                                                                            | = 3+1 = 4                                                                                | 1     |                           | ,,-                     | 2.00   |        |
| 9                   | 1+                  | ²V, ÷ ⁴V,                                   | ν,                           | \\ \frac{2V_0}{2V_1} = \frac{2V_0}{2V_1} \\ \}                                                                                                                                                                                  | $=\frac{2n-2}{4}\ldots\ldots$                                                            |       | 100                       | 300                     | 100    | 666    |
| 20                  | (×                  | V <sub>a</sub> × V <sub>11</sub>            | *V11                         | $\begin{cases} {}^{1}V_{\bullet} = {}^{0}V_{\bullet} \\ {}^{0}V_{\bullet} = {}^{0}V_{\bullet} \end{cases}$                                                                                                                      | $=\frac{2n}{2}\cdot\frac{2n-1}{3}\cdot\frac{2n-2}{4}=A_3\ldots.$                         |       |                           |                         | Set.   | ***    |
| 21                  | ×                   | 1V21×1V11                                   | ۰۷ <sub>۱۹</sub>             | $\left\{ {}^{i}\nabla_{i} = {}^{i}\nabla_{i} \right\}$                                                                                                                                                                          | $= B_a \cdot \frac{2\pi}{2} \cdot \frac{2\pi - 1}{3} \cdot \frac{2\pi - 2}{3} = B_a A_8$ |       |                           | 550                     | ***    |        |
| 22                  | +                   | *V12+*V18                                   | »V <sub>14</sub>             | $\left\{ {{{{{{}^{2}}{\mathbf{V}}_{12}}} = {{{{}^{0}}{\mathbf{V}}_{12}}}}\atop{{{{}^{2}}{\mathbf{V}}_{12}}}} \right\}$                                                                                                          | $= A_0 + B_1 A_1 + B_1 A_2 \dots$                                                        |       |                           | ••• •                   |        | ••••   |
| 23 (                | -                   | 1V10-1V1                                    | <sup>3</sup> V <sub>10</sub> | $\begin{cases} {}^{1}V_{12} = {}^{9}V_{12} \\ {}^{2}V_{13} = {}^{2}V_{13} \\ {}^{2}V_{14} = {}^{3}V_{14} \\ {}^{3}V_{1} = {}^{1}V_{1} \end{cases}$                                                                              | =n-3(=1)                                                                                 | 1     | ١                         |                         |        |        |
|                     |                     |                                             |                              |                                                                                                                                                                                                                                 |                                                                                          | F     | lere fo                   | llows                   | a rep  | etitia |
| 24                  | +                   | 4V12+4V24                                   | 1V24                         | $\begin{cases} {}^{4}\nabla_{13} = {}^{4}\nabla_{13} \\ {}^{4}\nabla_{14} = {}^{4}\nabla_{14} \\ {}^{4}\nabla_{1} = {}^{4}\nabla_{1} \\ {}^{4}\nabla_{1} = {}^{4}\nabla_{1} \\ {}^{4}\nabla_{2} = {}^{4}\nabla_{2} \end{cases}$ | ∞ В <sub>7</sub>                                                                         |       |                           |                         |        | 1252   |
| 25                  |                     | 'V, +'V,                                    | 'Va                          | $\left\{ \begin{array}{l} {}^{1}V_{1} = {}^{1}V_{1} \\ {}^{1}V_{2} = {}^{1}V_{3} \end{array} \right\}$                                                                                                                          | $= n+1 = 4+1 = 5 \dots$                                                                  | 1     | -944                      | n+ί                     | 275    | 2207   |
|                     |                     | *1 T *4                                     | X 1 Special                  | *V, = °V,                                                                                                                                                                                                                       | by a Variable-card.<br>by a Variable-card.                                               |       |                           |                         |        |        |

Mar. 2021



Eight Key Ideas in Computer Architecture



#### The Programs of Charles Babbage





IEEE Annals of the History of Computing, January-March 2021 (article by Raul Rojas)

#### THE LIST OF PROGRAMS

The 26 Babbage's tables in London's Museum can be arranged in the following groups. There are (with overlaps):

- Eleven programs for linear algebra (mostly solutions of simultaneous equations).
- Four for the evaluation of polynomial multiplication and division.
- Two for computing recursions (using loopunrolling, i.e., sequential code obtained by writing one loop execution right after the other).
- Four for the simulation of a Difference Engine using the so-called carriages.
- Three for the evaluation of astronomical formulas.
- A program showing how to use "combinatorial cards."
- An example program for the computation of a simple formula.

Mar. 2021



Eight Key Ideas in Computer Architecture



# Electromechanical and Plug-Programmable Computing Machines











Mar. 2021



Eight Key Ideas in Computer Architecture



Slide 10



## 1940s: Stored Program

Exactly who came up with the stored-program concept is unclear

Legally, John Vincent Atanasoff is designated as inventor, but many others deserve to share the credit



Babbage



Turing



**Atanasoff** 



**Eckert** 



Mauchly



von Neumann







Eight Key Ideas in Computer Architecture



## First Stored-Program Computer

#### Manchester Small-Scale Experimental Machine

Ran a stored program on June 21, 1948 (Its successor, Manchester Mark 1, operational in April 1949)

#### **EDSAC** (Cambridge University; Wilkes et al.)

Fully operational on May 6, 1949

EDVAC (IAS, Princeton University; von Neumann et al.)

Conceived in 1945 but not delivered until August 1949

**BINAC (Binary Automatic Computer, Eckert & Mauchly)** 

Delivered on August 22, 1949, but did not function correctly

Source: Wikipedia





#### von Neumann vs. Harvard Architecture



von Neumann architecture (unified memory for code & data)

Programs can be modified like data More efficient use of memory space



Harvard architecture (separate memories for code & data)

Better protection of programs
Higher aggregate memory bandwidth
Memory optimization for access type

Mar. 2021



Eight Key Ideas in Computer Architecture



## 1950s: Microprogramming

Traditional control unit design (multicycle): Specify which control signals are to be asserted in each cycle and synthesize



Gives rise to random logic

Error-prone and inflexible

Hardware bugs hard to fix after deployment

Design from scratch for each system

Mar. 2021



Eight Key Ideas in Computer Architecture



## The Birth of Microprogramming

The control state machine resembles a program (microprogram) comprised of instructions (microinstructions) and sequencing



Mar. 2021



Eight Key Ideas in Computer Architecture



## Microprogramming Implementation

Each microinstruction controls the data path for one clock cycle



## 1960s: Parallel Processing

Associative (content-addressed) memories and other forms of parallelism (compute-I/O overlap, functional parallelism) had been in existence since the 1940s







Highly parallel machine, proposed by Daniel Slotnick in 1964, later morphed into ILLIAC IV in 1968 (operational in 1975)

Michael J. Flynn devised his now-famous 4-way taxonomy (SISD, SIMD, MISD, MIMD) in 1966 and Amdahl formulated his speed-up law and rules for system balance in 1967

UCSB

Britis

#### The ILLIAC IV Concept: SIMD Parallelism

Common control unit fetches and decodes instructions, broadcasting the control signals to all PEs

Each PE executes or ignores the instruction based on local, datadependent conditions

The interprocessor routing network is only partially shown

Mar. 2021





#### Various Forms of MIMD Parallelism

#### Global shared memory

- Memory latency
- Memory bandwidth
- Cache coherence





## Distributed shared memory or message-passing architecture

- Scalable network performance
- Flexible and more robust
- Memory consistency model

Eight Key Ideas in Computer Architecture



#### Warehouse-Sized Data Centers



Mar. 2021



Eight Key Ideas in Computer Architecture



## Top 500 Supercomputers in the World



## The Shrinking Supercomputer







#### Status of Computing Power circa 2000 (2010 2020)

 $Giga = 10^9$ 

**GFLOPS** on desktop:

**TFLOPS** 

PFLOPS NVIDIA AI chip

Apple Macintosh, with G4 processor

Tera =  $10^{12}$ 

**TFLOPS** in supercomputer center:

**PFLOPS** 

**EFLOPS** folding@home

1152-proc IBM RS/6000 SP; Cray T3E

Peta =  $10^{15}$ 

 $Exa = 10^{18}$ 

 $Zeta = 10^{21}$ 

**PFLOPS** on the drawing board:

**EFLOPS** 

**ZFLOPS** 

1M-processor IBM Blue Gene (2005?)

32 proc/chip, 64 chips/board, 8 boards/tower, 64 towers Processor: 8 threads, on-chip memory, no data cache

Chip: defect-tolerant, row/column rings in a  $6 \times 6$  array

Board:  $8 \times 8$  chip grid organized as  $4 \times 4 \times 4$  cube

Tower: Boards linked to 4 neighbors in adjacent towers

System: 32×32×32 cube of chips, 1.5 MW (water-cooled)



Briting

## 1970s: Cache Memory

First paper on "buffer" memory: Maurice Wilkes, 1965

First implementation of a general cache memory: IBM 360 Model 85 (J. S. Liptay; *IBM Systems J.*, 1968)

Broad understanding, varied implementations, and studies of optimization and performance issues in the 1970s

#### **Modern cache implementations**

- Harvard arch for L1 cashes
- von Neumann arch higher up
- Many other caches in system besides processor cashes





#### Memory Hierarchy

Hierarchical memory provides the illusion that high speed and large size are achieved simultaneously



Mar. 2021



Eight Key Ideas in Computer Architecture



## Cache Memory Analogy

Hierarchical memory provides the illusion that high speed and large size are achieved simultaneously



Mar. 2021



Eight Key Ideas in Computer Architecture



#### Hit/Miss Rate, and Effective Cycle Time



Go to main 1 - h of the time (say, cache miss rate of 2%)

#### One level of cache with hit rate h

$$C_{\text{eff}} = hC_{\text{fast}} + (1-h)(C_{\text{slow}} + C_{\text{fast}}) = C_{\text{fast}} + (1-h)C_{\text{slow}}$$



## The Locality Principle

#### **Addresses**

#### **Temporal:**

Accesses to the same address are typically clustered in time

#### **Spatial:**

When a location is accessed, nearby locations tend to be accessed also



Illustration of temporal and spatial localities





Computer Architecture, Memory System Design



## Summary of Memory Hierarchy

Cache memory: provides illusion of very high speed

Main memory: reasonable cost, but slow & small

Virtual memory: provides illusion of very large size



Nov. 2014



Computer Architecture, Memory System Design



#### Translation Lookaside Buffer



Program page in virtual memory

```
$t0,0($s1)
   lw
          $t1,$zero,0
   addi
L: add
          $t1,$t1,1
          $t1,$s2,D
   beg
   add
          $t2,$t1,$t1
          $t2,$t2,$t2
   add
          $t2,$t2,$s1
   add
          $t3,0($t2)
   lw
          $t4,$t0,$t3
   slt
   beq
          $t4,$zero,L
          $t0,$t3,0
   addi
D:
```

All instructions on this page have the same virtual page address and thus entail the same translation

Virtual-to-physical address translation by a TLB and how the resulting physical address is used to access the cache memory.



BRITIT

#### Disk Caching and Other Applications

Entire track copied into fast cache

Disk Cache (DRAM) **3.** Disk rotation until sector has passed under the head: **Data transfer time** (< 1 ms)

2. Disk rotation until the desired sector arrives under the head:
Rotational latency (0-10s ms)



#### Web caching

- Client-side caching
- Caching within the cloud
- Server-side caching



Mar. 2021



Eight Key Ideas in Computer Architecture



## 1980s: Pipelining

An important form of parallelism that is given its own name Used from early days of digital circuits in various forms





Mar. 2021



Eight Key Ideas in Computer Architecture



## Vector Processor Implementation



Mar. 2021





#### Overlapped Load/Store and Computation



Vector processing via segmented load/store of vectors in registers in a double-buffering scheme. Solid (dashed) lines show data flow in the current (next) segment.





#### Simple Instruction-Execution Pipeline



Nov. 2014





#### Pipeline Stalls or Bubbles

#### Data dependency and its possible resolution via forwarding



#### Problems Arising from Deeper Pipelines



Forwarding more complex and not always workable Interlocking/stalling mechanisms needed to prevent errors





## Branching and Other Complex Pipelines



Front end: Instr. issue: Write-back: In-order or out-of-order In-order or out-of-order In-order or out-of-order In-order or out-of-order

The more OoO stages, the higher the complexity





**Commit:** 

#### 1990s: FPGAs

Programmable logic arrays were developed in the 1970s

PLAs provided cost-effective and flexible replacements for random logic or ROM/PROM

The related programmable array logic devices came later PALs were less flexible than PLAs, but more cost-effective



## Why FPGA Represents a Paradigm Shift

Modern FPGAs can implement any functionality

Initially used only for prototyping

Even a complete c CPU needs a small fraction of an FPGA's resources

FPGAs come with multipliers and IP cores (CPUs/SPs)



Mar. 2021



Eight Key Ideas in Computer Architecture



## FPGAs Are Everywhere

#### Applications are found in virtually all industry segments:

Aerospace and defense Medical electronics Automotive control Software-defined radio Encoding and decoding











## Example: Bit-Serial 2nd-Order Digital Filter



LUTs, registers, and an adder are all we need for linear expression evaluation:  $y^{(i)} = ax^{(i)} + bx^{(i-1)} + cx^{(i-2)} + dy^{(i-1)} + ey^{(i-2)}$ 





## 2000s: GPUs

Simple graphics and signal processing units were used since the 1970s

In the early 2000s, the two major players, ATI and Nvidia, produced powerful chips to improve the speed of shading

In the late 2000s, GPGPUs (extended stream processors) emerged and were used in lieu of, or in conjunction with, CPUs in high-performance supercomputers

GPUs are faster and more power-efficient than CPUs.

GPUs use a mixture of parallel processing and functional specialization to achieve super-high performance





## CPU vs. GPU Organization

# Small number of powerful cores versus Very large number of simple stream processors

Demo (analogy for MPP): https://www.youtube.com/watch?v=fKK933KK6Gg







#### CPU vs. GPU Performance

Peak performance (GFLOPS) and peak data rate (GB/s)



Mar. 2021





## General-Purpose Computing on GPUs

Suitable for numerically intensive matrix computations

First application to run faster on a GPU was LU factorization

Users can ignore GPU features and focus on problem solving

- Nvidia CUDA Programming System
- Matlab Parallel Computing Toolbox
- C++ Accelerated Massive Parallelism

Many vendors now give users direct access to GPU features



Example system (Titan):
Cray XK7 at DOE's Oak
Ridge Nat'l Lab used more
than ¼ M Nvidia K20x cores
to accelerate computations
(energy-efficient: 2+ gigaflops/W)

Mar. 2021



Eight Key Ideas in Computer Architecture





## Innovations for Improved Performance

(Parhami: Computer Architecture, 2005)

Computer performance grew by a factor of about 10000 between 1980 and 2000 100 due to faster technology 100 due to better architecture

Available computing power ca. 2000: GFLOPS on desktop TFLOPS in supercomputer center PFLOPS on drawing board

#### Architectural method

#### <u>Improvement factor</u>

| <del>o</del> o      | 1. Pipelining (and superpipelining)         | 3-8 √  |   | s <mark>ly</mark><br>ed |
|---------------------|---------------------------------------------|--------|---|-------------------------|
| Established methods | 2. Cache memory, 2-3 levels                 | 2-5 √  |   | noi<br>Noi              |
|                     |                                             | 2-3 √  |   | revi                    |
|                     | 4. Multiple instruction issue (superscalar) | 2-3 √  |   | ਨੂੰ ਤੁ                  |
| Ш                   | 5. ISA extensions (e.g., for multimedia)    | 1-3 √  |   |                         |
| Newer<br>methods    | 6. Multithreading (super-, hyper-)          | 2-5 ?  |   | . <u>=</u>              |
|                     | 7. Speculation and value prediction         | 2-3?   | ١ | red<br>t <              |
|                     | 8. Hardware acceleration [e.g., GPU]        | 2-10 ? |   | Ver                     |
|                     | 9. Vector and array processing              | 2-10 ? |   | O T                     |
|                     | 10. Parallel/distributed computing 2-1      | 000s?  | J |                         |

UCSB

BRITI

## **Shares of Technology and Architecture** in Processor Performance Improvement



Source: "CPU DB: Recording Microprocessor History," CACM, April 2012.





## 2010s: Specialization

Specialization entails high initial cost and creates a danger of obsolescence



Mar. 2021





**FPGA** 

## Hennessy/Patterson's Turing Lecture

Recipients of 2017 ACM Turing Award heralded in their 2018 joint lecture "A New Golden Age of Computer Architecture" in which domain-specific hardware dominates

Further progress in "conventional" architectures impeded by slowing Moore scaling, end of Dennard scaling (power wall), diminishing return at high energy cost for ILP, multi-core, etc., sorry state of security (software-only solutions not working)

- Domain-specific languages along with domain-specific arch's
- Agile hardware-development methodologies (for, otherwise, developing many domain-specific systems would be costly)
- Free open architectures and open-source implementations



## Shades of Domain-Specificity

Domain-specificity is achieved via ASIC design

Nothing new: ASICs have been around since the 1970s

Controller chips for DVD players, automotive electronics, phones

Generality lowers per-unit cost but hurts performance

ASIC's popularity: Standards & high-volume products Higher volume → We can afford to make it more specific





## Google's Tensor-Processing Unit

Al-accelerator ASIC aimed at machine-learning applications Developed in 2015 and released for third-party use in 2018 Pixel Neural Core announced in 2019 for Pixel 4 phone

| Integer | FLP → | Table from Wikipedia |
|---------|-------|----------------------|
|         |       |                      |

| Google TPU           | TPUv1    | TPUv2    | TPUv3    | TPUv4 <sup>[11]</sup> | Edge v1 |
|----------------------|----------|----------|----------|-----------------------|---------|
| Date Introduced      | 2016     | 2017     | 2018     | 2020                  | 2018    |
| Process Node         | 28 nm    | 20 nm?   | 12 nm?   | ?                     |         |
| Die Size (mm2)       | 331      | ?        | ?        | ?                     |         |
| On chip memory (MiB) | 28       | ?        | ?        | ?                     |         |
| Clock Speed (MHz)    | 700      | ?        | ?        | ?                     |         |
| Memory (GB)          | 8GB DDR3 | 16GB HBM | 32GB HBM | ?                     |         |
| TDP(W)               | 40       | 200      | 250      | ?                     | 2       |
| TOPS                 | 23       | 45       | 90       | ?                     | 4       |

**NVIDIA: Tensor Core** 

 $\rightarrow$  2x

 $\rightarrow$  2x

 $\rightarrow$  2x?

Mar. 2021





## Tensor-Processing Unit Architecture

Optimized for matrix multiplication, a key computation in ML Faster than CPU/GPU, 15-30x; More energy-frugal, 30-80x





Mar. 2021



Eight Key Ideas in Computer Architecture



## Perils of Technology Forecasting



Mar. 2021





## 2020s and Beyond: Looking Ahead

#### **Design improvements**

- Adaptation and self-optimization (learning)
- Security (hardware-implemented)
- Reliability via redundancy and self-repair
- Logic-in-memory designs (memory wall)
- Mixed analog/digital design style



#### **Performance improvements**

- Revolutionary new technologies
- New computational paradigms
- Brain-inspired and biological computing
- Speculation and value prediction
- Better performance per watt (power wall)









## Three Directions for Future Systems







# Trends in Processor Chip Density, Performance, Clock Speed, Power, and Number of Cores



NRC Report (2011): The Future of Computing Performance: Game Over or Next Level?





## The Quest for Higher Performance

Top-Five Supercomputers in November 2020 (http://www.top500.org)

| Rank<br>(previous) \$ | Rmax<br>Rpeak +<br>(PFLOPS) | Name \$              | Model ≑                    | CPU cores \$                         | Accelerator<br>(e.g. GPU) \$<br>cores | Interconnect \$            | Manufacturer + |  |
|-----------------------|-----------------------------|----------------------|----------------------------|--------------------------------------|---------------------------------------|----------------------------|----------------|--|
| 1                     | 442.010<br>537.212          | Fugaku               | Supercomputer<br>Fugaku    | 158,976 × 48<br>A64FX<br>@2.2 GHz    | 0                                     | Tofu<br>interconnect D     | Fujitsu        |  |
| 2 <b>(</b> 1)         | 148.600<br>200.795          | Summit               | IBM Power System<br>AC922  | 9,216 × 22<br>POWER9<br>@3.07 GHz    | 27,648 × 80<br>Tesla V100             | InfiniBand EDR             | IBM            |  |
| 3▼ (2)                | 94.640<br>125.7 <b>1</b> 2  | Sierra               | IBM Power System<br>S922LC | 8,640 × 22<br>POWER9<br>@3.1 GHz     | 17,280 × 80<br>Tesla V100             | InfiniBand EDR             | IBM            |  |
| 4▼ (3)                | 93.015<br>125.436           | Sunway<br>TaihuLight | Sunway MPP                 | 40,960 × 260<br>SW26010<br>@1.45 GHz | 0                                     | Sunway <sup>[26]</sup>     | NRCPC          |  |
| 5▲ (7)                | 63.460<br>79.215            | Selene               | Nvidia                     | 1,120 × 64<br>Epyc 7742<br>@2.25 GHz | 4,480 × 108<br>Ampere A100            | Mellanox HDR<br>Infiniband | R Nvidia       |  |

Mar. 2021





#### We Need More than Sheer Performance

#### **Environmentally responsible design**

Reusable designs, parts, and material

#### **Power efficiency**

Starting publication in 2016: IEEE Transactions on Sustainable Computing













#### Peak Performance of Supercomputers



Dongarra, J., "Trends in High Performance Computing," *Computer J.*, Vol. 47, No. 4, pp. 399-403, 2004. [Dong04]

Mar. 2021



Eight Key Ideas in Computer Architecture



#### Past and Current Performance Trends



Mar. 2021



Eight Key Ideas in Computer Architecture



## Energy Consumption is Getting out of Hand



Mar. 2021



Eight Key Ideas in Computer Architecture



#### Amdahl's Law



f = fraction
 unaffected

p = speedup
 of the rest

$$s = \frac{1}{f + (1 - f)/p}$$

$$\leq \min(p, 1/f)$$

## Amdahl's System Balance Rules of Thumb

#### The need for high-capacity, high-throughput secondary (disk) memory

| Processor speed | RAM<br>size | <u> </u> | Number of disks | Disk<br>capacity | Number of disks |
|-----------------|-------------|----------|-----------------|------------------|-----------------|
| 1 GIPS          | 1 GB        | 100 MB/s | 1               | 100 GB           | 1               |
| 1 TIPS          | 1 TB        | 100 GB/s | 1000            | 100 TB           | 100             |
| 1 PIPS          | 1 PB        | 100 TB/s | 1 Million       | 100 PB           | 100 000         |
| 1 EIPS          | 1 EB        | 100 PB/s | 1 Billion       | 100 EB           | 100 Million     |

1 RAM byte for each IPS

1 I/O bit per sec for each IPS

100 disk bytes for each RAM byte G Giga

Tera

Peta

E Exa



Part IV – Errors: Informational Distortions



#### The Flynn/Johnson Classification







#### **Shared-Control Systems**



From completely shared control to totally separate controls.



#### MIMD Architectures

Control parallelism: executing several instruction streams in parallel

GMSV: Shared global memory – symmetric multiprocessors

DMSV: Shared distributed memory – asymmetric multiprocessors

DMMP: Message passing – multicomputers





Centralized shared memory

Distributed memory

Mar. 2021



Eight Key Ideas in Computer Architecture



## Implementing Symmetric Multiprocessors



Very wide, high-bandwidth bus

Structure of a generic bus-based symmetric multiprocessor.





#### Interconnection Networks



Examples of direct and indirect interconnection networks.



# Direct Interconnection Networks



(a) 2D torus



(b) 4D hypercube



(c) Chordal ring



(d) Ring of rings

A sampling of common direct interconnection networks.

Only routers are shown; a computing node is implicit for each router.

Mar. 2021



Eight Key Ideas in Computer Architecture



## Design Space for Superscalar Pipelines

Front end: In-order or out-of-order

Instr. issue: In-order or out-of-order

Writeback: In-order or out-of-order

Commit: In-order or out-of-order

The more OoO stages, the higher the complexity

External Interface

Example of complexity due to out-of-order processing:
MIPS R10000

Control Logic

Source: Ahi, A. et al., "MIPS R10000 Superscalar Microprocessor,"

Proc. Hot Chips Conf., 1995.







Ext

#### Instruction-Level Parallelism





Available instruction-level parallelism and the speedup due to multiple instruction issue in superscalar processors [John91].



#### **Speculative Loads**





(a) Control speculation

(b) Data speculation

Examples of software speculation in IA-64.





#### Value Prediction



Value prediction for multiplication or division via a memo table.





#### Graphic Processors, Network Processors, ...



Simplified block diagram of Toaster2, Cisco Systems' network processor.





## Computing in the Cloud

Computational resources, both hardware and software, are provided by, and managed within, the cloud

Users pay a fee for access

Managing / upgrading is much more efficient in large, centralized facilities (warehouse-sized data centers or server farms)



This is a natural continuation of the outsourcing trend for special services, so that companies can focus their energies on their main business



