Skip to content

CUDA Programming

Sherlock edited this page Mar 12, 2021 · 3 revisions

CUDA programming basics

  • Understand the hardward

    • Architecture Generations

      • P100: Pascale / sm60
      • V100: Volta / sm 70
      • A100: Amper/ sm 80
    • CUDA Core vs. Tensor Core

  • Programming model

    • Thread
    • Block
    • Grid
    • Stream
  • Must known funnctions

    • cudaMalloc() vs. cudaFree()
    • cudaMemcpy() vs. cudaMemcpyAsync()
    • cudsMemset() vs. cudaMemsetAsync()
    • cudaStreamSynchronize() vs. cudaDeviceSynchronize()
    • cudaEventRecord() vs. cudaStreamWaitEvent()

Common tricks

  • Avoid memcpy
  • Avoid unnecessary Sync
  • Preprocess data in CPU
  • when to use #pragma unroll?

CUDA Kernel Examples

  • Easy: Dropout/DropGrad
  • Medium: SoftmaxCrossEntropyLoss(Grad)
  • Hard: LayerNormalization, ReduceSum, GatherGrad

Debugging CUDA kernels

  • printf() is working inside cuda code
  • Memcpy data to CPU for inspection?

Understanding IO bound and compute bound