-
Notifications
You must be signed in to change notification settings - Fork 3.2k
CUDA Programming
Sherlock edited this page Mar 12, 2021
·
3 revisions
-
Understand the hardward
-
Architecture Generations
- P100: Pascale / sm60
- V100: Volta / sm 70
- A100: Amper/ sm 80
-
CUDA Core vs. Tensor Core
-
-
Programming model
- Thread
- Block
- Grid
- Stream
-
Must known funnctions
- cudaMalloc() vs. cudaFree()
- cudaMemcpy() vs. cudaMemcpyAsync()
- cudsMemset() vs. cudaMemsetAsync()
- cudaStreamSynchronize() vs. cudaDeviceSynchronize()
- cudaEventRecord() vs. cudaStreamWaitEvent()
- Avoid memcpy
- Avoid unnecessary Sync
- Preprocess data in CPU
- when to use #pragma unroll?
- Easy: Dropout/DropGrad
- Medium: SoftmaxCrossEntropyLoss(Grad)
- Hard: LayerNormalization, ReduceSum, GatherGrad
- printf() is working inside cuda code
- Memcpy data to CPU for inspection?
Please use the learning roadmap on the home wiki page for building general understanding of ORT.