#### Computer Architecture Crash course



Frédéric Haziza <daz@it.uu.se>

Department of Computer Systems Uppsala University

Summer 2009

## Conclusions

- The multicore era is already here
- cost of parallelism is dropping dramatically
  - Cache/Thread is dropping
  - Memory bandwidth/thread is dropping
- The introduction of multicore processors will have profound impact on computational software

"For high-performance computing (HPC) applications, multicore processors introduces an additional layer of complexity, which will force users to go through a phase change in how they address parallelism or risk being left behind."

[IDC presentation on HPC at International Supercomputing Conference, ICS07, Dresden, June 26-29 2007]

#### Why multiple cores/threads? (even in your laptops)

- Instruction level parallelism is running out
- Wire delay is hurting
- Power dissipation is hurting
- The memory latency/bandwidth bottleneck is becoming even worse



## Focus: Performance

Create and explore:

- 1 Parallelism
  - Instruction level parallelism (ILP)
  - In some cases: Parallelism at the program/algorithm level ("Parallel computing")

#### 2 Locality of data

- Caches
- Temporal locality
- Cache blocks/lines: Spatial locality



 Concurrency

Hardware?

#### Old Trend 1: Deeper pipelines (exploring ILP)



Why? •••• Concurrency

Hardware?

#### Old Trend 2: Wider pipelines (exploring more ILP)



 Concurrency

Hardware?

#### Old Trend 3: Deeper memory hierarchy (exploring locality of data)



## Are we hitting the wall now?



## Classical microprocessors: Whatever it takes to run ONE program fast

Bad Const 1: Aready explored nost 1.P Exploring ILP (instruction-level parallelis

- Faster clocks → Deep pipelines
- Superscalar Pipelines
- Branch Prediction
- Out-of-Order Execution
- Trace Cache
- Speculation
- Predicate Ex
- Advanced (
- Return Addre

Concurrency

## #2: Future technology





Concurrency

# #2: How much of the chip area can you reach in one cycle?



(16) OS2'09

Concurrency

### Bad News #3: Mem latency/bandwidth



## Bad News #4: Power is the limit

#### Power consumption is the bottleneck

- Cooling servers is hard
- Battery lifetime for mobile computers
- Energy is money

- Dissipated effect is proportional to
  - ~ Frequency
  - ~ Voltage<sup>2</sup>





#### 1 Why mutliple threads/cores?

- Old Trends
- Bad news
- Solutions?

#### Introducing concurrency

- Scenario
- Definitions
- Amdhal's law





 Concurrency

Hardware?

## Bad News #1: Not enough ILP $\rightarrow$ feed one CPU with instr. from many threads



Concurrency

## SimItaneous multithreading



Concurrency

Hardware?

#### Bad News #2: Wire delay → Multiple small CPUs with private L1\$



Concurrency

Hardware?

#### Bad News #2: Wire delay → Multiple small CPUs with private L1\$



Concurrency

Hardware?

#### Bad News #3: Memory latency/bandwidth → memory accesses from many threads



Thread-Level Parallellism → Memory-Level Parallelism (MLP) (23) 052'09 Computer Architecture (Crash course)

Concurrency

## Bad News #4: Power consumption $\rightarrow$ Lower the frequency $\rightarrow$ lower voltage

$$P_{dyn} = C * f * V^2 pprox$$
 area  $*$  freq  $*$  voltage $^2$ 



Concurrency

Hardware?

#### Solving all the problems (?): Exploring thread parallelism

**#1** Running out of ILP

#2 Wire delay is starting to hurt

#3 Memory is the bottleneck

**#4** Power is the limit



Concurrency

Hardware?

#### In all computers very soon...

Multicore processors, probably also with simultaneous multithreading





Several cars want to drive from point A to point B.

They can compete for space on the same road and end up either:

- following each other
- or competing for positions (and having accidents!).

Or they could drive in parallel lanes, thus arriving at about the same time without getting in each other's way.

Or they could travel different routes, using separate roads.

#### Scenario

Several cars want to drive from point A to point B.

#### Programming

They can compete for space on the same road and end up either:

- following each other
- or competing for positions (and having accidents!).

#### Programming

Or they could drive in parallel lanes,

thus arriving at about the same time without getting in each other's way.

#### Programming

Or they could travel different routes, using separate roads.

## Definitions

#### Concurrent Program

- 2<sup>+</sup> processes working together to perform a task.
- Each process is a sequential program

(= sequence of statements executed one after another)

Single thread of control vs multiple thread of control

Communication

Synchronization

#### Correctness

Wanna write a concurrent program?

- What kinds of processes?
- How many processes?
- How should they interact?

#### Correctness

Ensure that processes interaction is properly synchronized

Ensuring the critical sections of statements do not execute at the same time

Delaying a process until a given condition is true

Our focus: imperative programs and asynchronous execution 31 OS2'09 | Computer Architecture (Crash course)

## Amdhal's law

- *P* is the fraction of a calculation that can be parallelized
- (1 P) is the fraction that is sequential
  - (i.e. cannot benefit from parallelization)
- N processors



#### Example

If  $P = 90\% \Rightarrow$  max speedup of 10 no matter how large the value of N used (ie  $N \rightarrow \infty$ )

Concurrency

## Single-Processor machine



Concurrency

Hardware?

## **Memory Hierarchy**



## Why do we miss in the cache?

## miss

Touching the data for the first time

#### niss

The cache is too small

#### miss

Non-ideal cache implementation (data hash to the same cache

line)







locality



Concurrency

Hardware?

### MultiProcessor world - Taxonomy



Concurrency

Hardware?

## Shared-Memory Multiprocessors



Concurrency

## **Programming Model**



Concurrency

## **Cache coherency**







# **Memory Ordering**

The defines a per-datum order of value changes.
The defines the order of value changes for all the data.

What ordering does the memory system guarantees? "Contract" between the HW and the SW developers Without it, we can't say much about the result of a parallel execution

# What order for these threads?

A' denotes a modified value to the datum at address A



Concurrency

Hardware?

### Other possible orders?





Concurrency

Hardware?

## Memory model flavors

: Programmer's intuition : Almost Programmer's intuition : No guaranty



Hardware?

# Dekker's algorithm



Left: The read (ie, test if B==0) can bypass the store (A := 1) Right: The read (ie, test if A==0) can bypass the store (B := 1)  $\Rightarrow$  Both loads can be performed before any of the stores  $\Rightarrow$  Yes, it is possible that both win!

Hardware?

# Dekker's algorithm for Total Store Order



Membar: the read is started after all previous stores have been "globally ordered"

- $\Rightarrow$  Behaves like a sequentially consistent machine
- $\Rightarrow$  No, they won't both win. Good job Mister Programmer!

Hardware?

# Dekker's algorithm, in general



So....





### New issues

- Compulsory miss
- Capacity miss
- Conflict miss



Side-effect from large cache lines

What about the compiler?

Code reordering? volatile keyword in C...





Good to know

# $\begin{array}{l} \mbox{Performance} \Rightarrow \mbox{Use of Cache} \\ \mbox{Memory hierarchy} \Rightarrow \mbox{Consistency problems} \end{array}$

#### To get maximal performance on a given machine,

the programmer has to know about the characteristics of the memory system and has to write programs to account them

Hardware?

# **Distributed Memory Architecture**



- Communication through Message Passing
- Own cache, but memory not shared ⇒ No coherency problems



Concurrency

Hardware?

# Isn't a CMP just a SMP on a chip?



Concurrency

Hardware?

### Cost of communication?



## Impact on Algorithms

For performance, we need to understand the interaction between algorithms and architecture.



We need to question old algorithms and results!

# Criteria for algorithm design

- Pre-CMP:
  - Communication is expensive: Minimize communication
  - Data locality is important
  - Maximize scalability for large-scale applications
- Within a CMP chip today:
  - (On-chip) communication is almost to free
  - Data locality is even more important
  - (SMT may help by hiding some poor locality)
  - Scalability to 2-32 threads
- In a multi-CMP system tomorrow:
  - Communication is sometimes almost free (on-chip), sometimes (very) expensive (between chips)
  - Data locality (minimizing of-chip references) is a key to efficiency
  - "Hierarchical scalability"