Skip to content

Commit a232ff3

Browse files
orion160svekars
andauthored
Added CUDA graph, Tensor Core and Core pinning explaination (#2912)
* [Docs] Update performance tuning guide Added cuda graph explaination Added core pinning section Added tensor core usage section --------- Co-authored-by: Svetlana Karslioglu <[email protected]>
1 parent 67819bb commit a232ff3

File tree

1 file changed

+32
-0
lines changed

1 file changed

+32
-0
lines changed

recipes_source/recipes/tuning_guide.py

+32
Original file line numberDiff line numberDiff line change
@@ -213,6 +213,7 @@ def gelu(x):
213213

214214
###############################################################################
215215
# Typically, the following environment variables are used to set for CPU affinity with GNU OpenMP implementation. ``OMP_PROC_BIND`` specifies whether threads may be moved between processors. Setting it to CLOSE keeps OpenMP threads close to the primary thread in contiguous place partitions. ``OMP_SCHEDULE`` determines how OpenMP threads are scheduled. ``GOMP_CPU_AFFINITY`` binds threads to specific CPUs.
216+
# An important tuning parameter is core pinning which prevent the threads of migrating between multiple CPUs, enhancing data location and minimizing inter core communication.
216217
#
217218
# .. code-block:: sh
218219
#
@@ -318,6 +319,37 @@ def gelu(x):
318319
# GPU specific optimizations
319320
# --------------------------
320321

322+
###############################################################################
323+
# Enable Tensor cores
324+
# ~~~~~~~~~~~~~~~~~~~~~~~
325+
# Tensor cores are specialized hardware designed to compute matrix-matrix multiplication
326+
# operations, primarily utilized in deep learning and AI workloads. Tensor cores have
327+
# specific precision requirements which can be adjusted manually or via the Automatic
328+
# Mixed Precision API.
329+
#
330+
# In particular, tensor operations take advantage of lower precision workloads.
331+
# Which can be controlled via ``torch.set_float32_matmul_precision``.
332+
# The default format is set to 'highest,' which utilizes the tensor data type.
333+
# However, PyTorch offers alternative precision settings: 'high' and 'medium.'
334+
# These options prioritize computational speed over numerical precision."
335+
336+
###############################################################################
337+
# Use CUDA Graphs
338+
# ~~~~~~~~~~~~~~~~~~~~~~~
339+
# At the time of using a GPU, work first must be launched from the CPU and
340+
# in some cases the context switch between CPU and GPU can lead to bad resource
341+
# utilization. CUDA graphs are a way to keep computation within the GPU without
342+
# paying the extra cost of kernel launches and host synchronization.
343+
344+
# It can be enabled using
345+
torch.compile(m, "reduce-overhead")
346+
# or
347+
torch.compile(m, "max-autotune")
348+
349+
###############################################################################
350+
# Support for CUDA graph is in development, and its usage can incur in increased
351+
# device memory consumption and some models might not compile.
352+
321353
###############################################################################
322354
# Enable cuDNN auto-tuner
323355
# ~~~~~~~~~~~~~~~~~~~~~~~

0 commit comments

Comments
 (0)