Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
flash_attention		flash_attention
h100		h100
simt		simt
softmax		softmax
tensorops		tensorops
README.md		README.md

Repository files navigation

CUDA GEMM

This folder contains two projects exploring different approaches to optimizing matrix multiplication on NVIDIA GPUs:

1. CUDA GEMM with CUDA Cores

Implements matrix multiplication using traditional CUDA cores
Otimization techniques including warp tiling, vectorized memory access, etc
Achieves up to 85% of cuBLAS performance on an NVIDIA A6000 GPU

2. Matrix Multiplication with Tensor Cores

Implements matrix multiplication using NVIDIA Tensor Cores
Utilizes inline PTX and shared memory optimization
Achieves up to 95% of cuBLAS performance on an NVIDIA A6000 GPU

Both projects include performance benchmarks, optimization strategies, and references to relevant resources.

About

No description, website, or topics provided.

Report repository

Releases

No releases published

Packages

No packages published

Languages