#### GPU Algorithms II Models and CUDA

MADALGO Summer School on Algorithms for Modern Parallel and Distributed Models

> Suresh Venkatasubramanian University of Utah







### The GPU Pipeline



- Every pixel acts like a streaming SIMD processor
  - actually, some fixed number are processed in parallel
- Fragment processor could perform simple *straight-line operations* and conditionals (no looping)
- (limited) texture memory for local storage
- Each pixel processor could do a simple reduce (add, blend)
- Computation proceeds in *passes*: output could be rendered or stored in memory for next pass.
- All computation on GPU from start of the vertex pipeline

#### Algorithms via Lower Envelopes[AKMV03]



Voronoi diagram is *lower envelope* of collection of distance functions



• Let  $f_1, \ldots f_n$  be a set of functions from  $\mathbb{R}^2 \to \mathbb{R}$ 



• Let  $f_1, \ldots f_n$  be a set of functions from  $\mathbb{R}^2 \to \mathbb{R}$ 

 $U(x) = \max_i f_i(x)$ 



• Let  $f_1, \ldots f_n$  be a set of functions from  $\mathbb{R}^2 \to \mathbb{R}$ 

 $U(x) = \max_{i} f_{i}(x)$  $L(x) = \min_{i} f_{i}(x)$ 



• Let  $f_1, \ldots f_n$  be a set of functions from  $\mathbb{R}^2 \to \mathbb{R}$ 

 $U(x) = \max_{i} f_{i}(x)$  $L(x) = \min_{i} f_{i}(x)$ E(x) = U(x) - L(x)



• Let  $f_1, \ldots f_n$  be a set of functions from  $\mathbb{R}^2 \to \mathbb{R}$ 

 $U(x) = \max_{i} f_{i}(x)$  $L(x) = \min_{i} f_{i}(x)$ E(x) = U(x) - L(x)

Compute  $(\min, \max)_x E(x), L(x), U(x)$ 













$$\Delta(P) = \max_u f(u) - f(-u)$$









 $MEB(P) = \min_{c} \max_{p} d_{p}(c)$ 



 $MEB(P) = \min_{c} \max_{p} d_{p}(c)$ 



 $MEB(P) = \min_{c} \max_{p} d_{p}(c)$ 1-median(P) = min\_{x} g(x)

Hausdorff Distance





- Penetration depth between two surfaces
- Best-fit circle, minimum width annulus



- Penetration depth between two surfaces
- Best-fit circle, minimum width annulus
- Discretize the space to get an appropriate approximation



- Penetration depth between two surfaces
- Best-fit circle, minimum width annulus
- Discretize the space to get an appropriate approximation
- Some computations happen in dual space

### **Development Support**

#### Language Support

- Brook[BFH<sup>+</sup>04]
- Cg[MGAK03]
- GLSL/HLSL[Fou, Mic]
- LibSh[MDT04]

#### **Coding Support**

- Cg debugger
- Code optimizer

GPU programming moves closer to "real" programming

- API is still delinked from hardware, and this is deliberate
- Parallelism is still partly a fiction.





• A multipass GPU computation requires crossing the CPU-GPU interface



- A multipass GPU computation requires crossing the CPU-GPU interface
- Geometry shaders allow programming the vertex pipeline



- A multipass GPU computation requires crossing the CPU-GPU interface
- Geometry shaders allow programming the vertex pipeline
- Geometry can be generated inside the vertex pipeline



### Revisiting the GPU model



# Revisiting the GPU model

| (0,0) | (0,1) | (0,2) |
|-------|-------|-------|
| (1,0) | (1,1) | (1,2) |
| (2,0) | (2,1) | (2,2) |


$$(0,0) (0,1) (0,2)$$

$$(1,0) (1,1) (1,2)$$

$$C[i,j] = \sum_{k} A[i,k] \cdot B[k,j]$$

$$(2,0) (2,1) (2,2)$$

#### SIMD execution across all elements of grid

| ID <sub>0</sub> | $ID_1$ | ID <sub>2</sub> |
|-----------------|--------|-----------------|
| ID <sub>3</sub> | $ID_4$ | $ID_5$          |
| ID <sub>6</sub> | ID7    | $ID_8$          |







A "block" of threads executing in SIMD

- Lightweight threads that run SIMD (SIMT) in "blocks"
- Blocks run in "SPMD" mode (single program, multiple data)
- Memory at multiple levels (thread, blocks, global)
- Threads are very lightweight, and there are many of them.
- Two views: programmer-centric and hardware-centric

Block



• A block is a collection of threads



- A block is a collection of threads
- A block can have different "shapes"



- A block is a collection of threads
- A block can have different "shapes"
- All threads run the same instructions and can synchronize



- A block is a collection of threads
- A block can have different "shapes"
- All threads run the same instructions and can synchronize
- Theads have local memory



- A block is a collection of threads
- A block can have different "shapes"
- All threads run the same instructions and can synchronize
- Theads have local memory (and so do blocks)



- A block is a collection of threads
- A block can have different "shapes"
- All threads run the same instructions and can synchronize
- Theads have local memory (and so do blocks)
- Block memory is low-latency and shared among threads



• A grid is a collection of blocks



- A grid is a collection of blocks
- A grid can have different shapes



- A grid is a collection of blocks
- A grid can have different shapes
- A grid of blocks is initiated by a request from the host



- A grid is a collection of blocks
- A grid can have different shapes
- A grid of blocks is initiated by a request from the host
- A grid has shared memory



- A grid is a collection of blocks
- A grid can have different shapes
- A grid of blocks is initiated by a request from the host
- A grid has shared memory
- Blocks cannot coordinate with each other and are run independently















#### Problem

Multiply two  $64 \times 64$  matrices.

```
CUDAalloc(Md, 64 \times 64) {Allocate device memory}
CUDAalloc(Nd, 64 \times 64)
CUDAalloc(Pd, 64 \times 64)
```

CUDAcopy(Md, M) {Transfer matrices to device memory} CUDAcopy(Nd, N)

Initiate Kernel

CUDAcopy(P, Pd) {Retrieve result from device}

## "Hello World": CUDA Matrix Multiplication

- Thread (*i*, *j*) will compute the dot product of row *i* of *M* and column *j* of *N*
- All threads will be in a single block of a single grid

```
(tx, ty) \leftarrow (threadIdx.x, threadIdx.y)

P \leftarrow 0 {Local thread storage}

for i = 1...64 do

P + = Md[64 \cdot ty + i] \cdot Nd[64 \cdot i + tx]

end for

Pd[tx, ty] \leftarrow P {Write to global memory}
```

- Kernel is invoked as matmult  $\langle\langle(1,1),(64,64)\rangle\rangle$
- Blocks can only allocate a maximum of 512 threads

- CUDA looks different to programmer and hardware (like MapReduce)
- Understanding the execution model helps with design of algorithms



Nickolls, Buck, Garland, Skadron, ACM Queue, Mar 2008[NBGS08]













Streaming multiprocessor

"Grid" shared memory








• Each block is assigned to a single SP



- Each block is assigned to a single SP
- Grid is a software construct



- Each block is assigned to a single SP
- Grid is a software construct
- Block memory managed by SM



SP



• Each block is divided into groups of 32 threads called "warps"



- Each block is divided into groups of 32 threads called "warps"
- Warp threads are scheduled SIMD on the processor



- Each block is divided into groups of 32 threads called "warps"
- Warp threads are scheduled SIMD on the processor
- Warps are scheduled concurrently

- Each warp consists of at most 32 threads taken from a single block
- All threads in a warp are executed in parallel with zero overhead
- In each clock cycle, a GO command is issued to all threads of warp to execute same command
- If there's branching, branches are executed sequentially non-executing threads are inactive.
- Maximize throughput by minimizing branching

#### This is Single Instruction Multiple Threads (SIMT)





















- At each clock tick, SM determines which warp is ready to execute
- This is done by "scoreboarding": hardware table that tracks
  - instructions
  - resources
  - which instructions use which registers
- Using scoreboard, SM can figure out who's ready for execution next.

# Sorting with CUDA

#### Many different implementations of sorting algorithms

- Radix sort
- Merge sort
- Quick sort
- Sample sort
- Bitonic sort
- Hybrid sorting methods

For fixed keys, radix sort is fastest













# 1 0 1 0 0 1 1 0 0 0 1 A B C D E F G H I J L

# 1 0 1 0 0 1 1 0 0 0 1 A B C D E F G H I J L

#### 0 0 0 0 0 0 0 1 1 1 1 1 B D E H I J A C F G L







For each digit

- O Construct flag vector locally and write to shared memory
- O parallel reduce on flag vector to find offsets
- Move items to correct locations in global array
- Repeat

- If all reducers in one block, easy to synchronize
- If not, need to use global memory to communicate: Expensive !
- Create multiple kernels for different levels of the reduce tree (kernel creates sync)

- Distribute reduce operations to blocks
- Factor out branches to reduce divergence penalty
- Unroll operations in reduce when possible.

Overall 1 GKeys/second, 3-4x over Larrabee

#### Level Selection and Two-Sided Tests[GKMV03]



• Let  $f_1, \ldots f_n$  be a set of functions from  $\mathbb{R}^2 \to \mathbb{R}$ 

 $U(x) = \max_{i} f_{i}(x)$  $L(x) = \min_{i} f_{i}(x)$ E(x) = U(x) - L(x)


• Let  $f_1, \ldots f_n$  be a set of functions from  $\mathbb{R}^2 \to \mathbb{R}$ 

 $U(x) = \max_{i} f_{i}(x)$   $L(x) = \min_{i} f_{i}(x)$  E(x) = U(x) - L(x)  $M_{k}(x) = \text{kth-smallest} (f_{1}(x), f_{2}(x), \dots f_{n}(x))$ 







Best-Fit(P) = min<sub>c,R</sub>  $\sum |||p - c|| - R|$ 



Best-Fit(P) = min<sub>c,R</sub>  $\sum |||p - c|| - R|$ 

Solution is to minimize over the *median* layer of an associated arrangement

```
Fragment Program: takes input x_1, x_2, \ldots x_n and k
 (lo, hi) \leftarrow \arg \min x_i, \arg \max x_i
while hi - lo > 1 do
   Pick random mid between lo, hi
   c = number of elements x such that x_{lo} \le x \le x_{mid} {two-sided
   test}
   if c \ge k then
      \texttt{hi} \leftarrow \texttt{mid}
   else
      lo \leftarrow mid
   end if
end while
 Return 10
```

# QuickSelect

- QuickSelect as a fragment program extracts the *k*<sup>th</sup> level of the arrangement.
- It uses three conditionals (one for the branching, and two for the two-sided test
- Two-sided test evaluated many times.
- Overall complexity is  $O(\log n)$  passes on average

#### Lemma

*A fragment procesor that only uses a one-sided test, or is not randomized, must take n passes.* 

Tradeoff between penalty of more conditional branching and number of passes

- To make maximum use of SIMT, minimize branching
- Memory bank conflicts have to be dealt with
- If there are too many blocks, you pay switching overhead on an SM
- Two-level model allows for flexibility: CUDA program can be adapted to different hardware configurations easily
  - (or even run on a single core machine!)

- Examples of the streaming SIMD view of the GPU
  - Lower envelope computations
  - Multipass streaming median
- The CUDA model:
  - The programmer's view
  - Matrix multiplication
  - The hardware view
  - Radix Sorting

Solving different problems using CUDA:

- Multipole methods
- Sparse Matrix Operations
- Graphs I: BFS
- Graphs II: Coloring

# Questions?

## References I

P. Agarwal, S. Krishnan, N. Mustafa, and S. Venkatasubramanian. Streaming geometric optimization using graphics hardware. *Algorithms-ESA 2003*, pages 544–555, 2003.

I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan.

Brook for gpus: stream computing on graphics hardware.

In *ACM Transactions on Graphics (TOG)*, volume 23, pages 777–786. ACM, 2004.

- OpenGL Foundation.

Opengl shading language.

http://www.opengl.org/documentation/glsl/.

 S. Guha, S. Krishnan, K. Munagala, and S. Venkatasubramanian. Application of the two-sided depth test to csg rendering. In *Proceedings of the 2003 symposium on Interactive 3D graphics*, pages 177–180. ACM, 2003.

## **References II**

- M.D. McCool and S. Du Toit. *Metaprogramming GPUs with Sh.* AK Peters Wellesley, 2004.
- D.G. Merrill and A.S. Grimshaw.

Revisiting sorting for gpgpu stream architectures.

In *Proceedings of the 19th international conference on Parallel architectures and compilation techniques*, pages 545–546. ACM, 2010.

W.R. Mark, R.S. Glanville, K. Akeley, and M.J. Kilgard.

Cg: A system for programming graphics hardware in a c-like language. In *ACM Transactions on Graphics (TOG)*, volume 22, pages 896–907. ACM, 2003.

- Microsoft.

Hlsl.

http://msdn.microsoft.com/en-us/library/windows/desktop/ bb509561(v=vs.85).aspx.

#### 

J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable parallel programming with cuda. *Queue*, 6(2):40–53, 2008.

#### NVIDIA.

Parallel programming and computing platform | CUDA. http://www.nvidia.com/object/cuda\_home\_new.html.