# Deep Learning

A course about theory & practice



## Hardware for Deep Learning

Marco Piastra

Deep Learning 2024–2025 Hardware for Deep Learning [1]

## GPU vs. CPU

The GPU resides on a separate board

An almost independent computer

GPU Board, with its own DRAM



RAM chips

Power Supplier

CPU, with ventilation

[image from https://www.researchgate.net/publication/322525660]

Deep Learning 2024–2025 Hardware for Deep Learning [2]

### GPU vs. CPU

#### Different hardware architectures

Different computing paradigms

#### A trade-off between

- fully independent cores (CPU)
- interdependent cores (GPU) with some (limited) degrees of independence





[images from http://www.nvidia.com/docs/]

Deep Learning 2024-2025 Hardware for Deep Learning [3]

## GPU vs. CPU

#### Different hardware architectures

#### Different computing paradigms

|                                     | Cores                                         | Clock<br>Speed | Memor<br>y          | Price           | Speed                                                   |
|-------------------------------------|-----------------------------------------------|----------------|---------------------|-----------------|---------------------------------------------------------|
| CPU<br>(Intel Core<br>i7-7700k)     | 10                                            | 4.3 GHz        | System<br>RAM       | \$385           | ~640 GFLOPs FP32                                        |
| GPU<br>(NVIDIA<br>RTX 3090)         | 10496                                         | 1.6 GHz        | 24 GB<br>GDDR<br>6X | \$1499          | ~35.6 TFLOPs FP32                                       |
| GPU<br>(Data Center)<br>NVIDIA A100 | 6912 CUDA,<br>432 Tensor                      | 1.5 GHz        | 40/80<br>GB<br>HBM2 | \$3/hr<br>(GCP) | ~9.7 TFLOPs FP64<br>~20 TFLOPs FP32<br>~312 TFLOPs FP16 |
| TPU<br>Google Cloud<br>TPUv3        | 2 Matrix Units<br>(MXUs) per<br>core, 4 cores | ?              | 128 GB<br>HBM       | \$8/hr<br>(GCP) | ~420 TFLOPs<br>(non-standard FP)                        |

[image http://cs231n.stanford.edu/slides/2021/lecture\_6.pdf]

Deep Learning 2024-2025 Hardware for Deep Learning [4]

## SIMT Parallelism

#### Single Instruction, Multiple Data (SIMD)

**Execution is parallel** 

All cores are executing the same instruction, in sync

Each core works on specific data



[images from https://www.sciencedirect.com/topics/computer-science/single-instruction-multiple-data]

Deep Learning 2024–2025 Hardware for Deep Learning [5]

## SIMT Parallelism

#### Single Instruction, Multiple Threads (SIMT)

**Execution** is parallel

All <u>active</u> cores are executing the same instruction, in sync

Each core works on specific data

The control system activates and deactivates cores on each <u>execution branch</u>

Moral: any computation might be performed, but divergent threads will be <u>sequentialized</u>



[images from https://www.sciencedirect.com/topics/computer-science/single-instruction-multiple-data]

Deep Learning 2024–2025 Hardware for Deep Learning [6]

## Selective parallelization

Not all parts of a program are worth executing in parallel...



[images from http://www.nvidia.com/docs/]

Deep Learning 2024–2025 Hardware for Deep Learning [7]

## GPU Processing Cycle

#### CPU > Memory Transfer > GPU and back

The program on the CPU drives the execution:

- All data (program + actual data) are transferred from main memory to GPU DRAM
- 2. The GPU kernel is launched
- The Kernel is executed in parallel onto GPU cores, using GPU DRAM
- Results are copied back from GPU DRAM to main memory



[image from https://commons.wikimedia.org/wiki/File:CUDA\_processing\_flow\_(En).PNG]

Deep Learning 2024–2025 Hardware for Deep Learning [8]

# PyTorch and GPUs

- PyTorch computations are optimized to be run on GPUs
   For the programmer, these implementation details are (mostly) transparent
   TF can also run on the CPU only, but with lower performance.
- PyTorch automatically manages memory transfers to/from GPUs
   Memory transfers are very costly, due to low bandwidth PCIe





Deep Learning 2024–2025 Hardware for Deep Learning [9]

## High Speed Interconnect

Available for large GPUs (data center)

Dedicate direct link between GPUs:

- High bandwidth, for faster data communication
- Low latency
- Scalability
- Energy efficiency





[image from https://www.cudocompute.com/blog/a-beginners-guide-to-nvidia-gpus]

Deep Learning 2024–2025 Hardware for Deep Learning [10]

# GPU Multiprocessing

Until recently, GPUs could only serve one process at time Now they can be partitioned among several processes



[images from https://docs.nvidia.com/deploy/mps/]

Deep Learning 2024–2025 Hardware for Deep Learning [11]

## Tensor Cores

Specilalized processing units to accelerate tensor algebra operations

- Matrix Multiply-Accumulate (MMA) units

  Each MMA unit can perform a 4x4 matrix multiply-accumulate operation in a single clock cycle
- Warp schedulers
   MMA units are kept busy and the data flow is optimized
- **High-speed registers and shared memory**For storing and sharing intermediate among threads



Deep Learning 2024–2025 Hardware for Deep Learning [12]

## In-Cloud TPVs

#### Tensor Processing Units (TPUs)

They are ASICs (Application-Specific Integrated Circuits) and are not on sale

As computation resources, they are only available in cloud (at Google)

TPUs are mounted on separate boards, much like GPUs





[Image from https://cloud.google.com/blog/products/ai-machine-learning/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu]

Deep Learning 2024-2025 Hardware for Deep Learning [13]

## Systolic Parallel Processing

#### Data flow through cores

TPU architecture is optimized natively for tensor processing and not for graphics

Arithmetic Logic Units are organized in a pipeline

Tensor data are made to 'flow through' the pipeline



Matrix multiplication

# Register ALU ALU ALU Code CPU and GPU TPU

#### TPUs can be much more efficient for tensor computations



[Image from https://cloud.google.com/blog/products/ai-machine-learning/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu]

Deep Learning 2024-2025 Hardware for Deep Learning [14]