CUDA Programming

Jump to bottom

Sherlock edited this page Mar 12, 2021 · 3 revisions

CUDA programming basics

Understand the hardward
- Architecture Generations
  - P100: Pascale / sm60
  - V100: Volta / sm 70
  - A100: Amper/ sm 80
- CUDA Core vs. Tensor Core
Programming model
- Thread
- Block
- Grid
- Stream
Must known funnctions
- cudaMalloc() vs. cudaFree()
- cudaMemcpy() vs. cudaMemcpyAsync()
- cudsMemset() vs. cudaMemsetAsync()
- cudaStreamSynchronize() vs. cudaDeviceSynchronize()
- cudaEventRecord() vs. cudaStreamWaitEvent()

Common tricks

Avoid memcpy
Avoid unnecessary Sync
Preprocess data in CPU
when to use #pragma unroll?

CUDA Kernel Examples

Easy: Dropout/DropGrad
Medium: SoftmaxCrossEntropyLoss(Grad)
Hard: LayerNormalization, ReduceSum, GatherGrad

Debugging CUDA kernels

printf() is working inside cuda code
Memcpy data to CPU for inspection?

Understanding IO bound and compute bound