Squeezing out faster inference on a 3090? Is CUDA_USE_TENSOR_CORES something I can compile for? #8422

wwoodsTM · 2024-07-10T22:00:02Z

wwoodsTM
Jul 10, 2024

Hi,

System specs wise I run a single 3090, have 64GB system RAM with a Ryzen 5 3600. I recently switched to using llama-server as a backend to get closer to the prompt-building process, especially with special tokens, for an app I am working on. Previously I was using Ooba's TextGen WebUI as my backend (so in other words, llama-cpp-python). I know for the most part it's all the same under the hood, but I am just wondering whether in my own compilation process I may be contributing to what appears to be somewhat lower inference speeds with some models since making the switch.

I also was wondering if there is anything I can maybe do at compile time to help. On Ooba I believe I was using the "CUDA_USE_TENSOR_CORES" option, and was wondering if that is just something for llama-cpp-python somehow, or is there a way for me to make sure that is used at compile time or run-time?

Here is some of the relevant output I get when I run llama-server:

./llama-server -m ./models/CommandR-35B-NEO-V1-D_AU-Q6_K-imat13.gguf -c 16128 -ngl 15 -b 512 -t 6 -tb 12 --port 8080 

INFO [ main] build info | tid="124954289917952" timestamp=1720642654 build=3353 commit="9925ca40"
INFO [main] system info | tid="124954289917952" timestamp=1720642654 n_threads=6 n_threads_batch=12 
total_threads=12  system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | 
AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | 
FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
MATMUL_INT8 = 0 | LLAMAFILE = 0 | "

[.... skipping ....]

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.31 MiB
llm_load_tensors: offloading 15 repeating layers to GPU
llm_load_tensors: offloaded 15/41 layers to GPU
llm_load_tensors:        CPU buffer size = 27366.91 MiB
llm_load_tensors:      CUDA0 buffer size =  9647.34 MiB
...........................................................................................
llama_new_context_with_model: n_ctx      = 16128
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 8000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size = 12600.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =  7560.00 MiB
llama_new_context_with_model: KV self size  = 20160.00 MiB, K (f16): 10080.00 MiB, V (f16): 10080.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     1.95 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  2620.50 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    63.51 MiB
llama_new_context_with_model: graph nodes  = 1208
llama_new_context_with_model: graph splits = 254

When I looked under "Issues" it seemed that people with similar RTX cards had been advised to build for "all" architectures if I understood correctly, so I am guessing I probably would not get much benefit from trying to specify anything beyond turning the CUDA flag on when running CMake? FWIW, I ran "cmake -B build -DGGML_CUDA=ON" for my compile.

Thank you in advance!

Answered by ggerganov

Jul 11, 2024

Try to quantize the KV cache and enable Flash Attention:

-ctk q8_0 -ctv q8_0 -fa 1

This should give you some room for extra layers on the GPU

View full answer

dspasyuk · 2024-07-11T02:11:12Z

dspasyuk
Jul 11, 2024

@wwoodsTM If you offload all layers to GPU you will get maximum performance:

From your log:
offloaded 15/41 layers to GPU

-ngl should be set to 41 this will you the best you can get for now.

1 reply

wwoodsTM Jul 11, 2024
Author

Unfortunately 17 is the most I can do before I get OOM errors on load with my total of 24 GB of VRAM, 15 layers is giving it a little bit of breathing room.

ggerganov · 2024-07-11T07:20:52Z

ggerganov
Jul 11, 2024
Maintainer

Try to quantize the KV cache and enable Flash Attention:

-ctk q8_0 -ctv q8_0 -fa 1

This should give you some room for extra layers on the GPU

2 replies

wwoodsTM Jul 11, 2024
Author

Thank you, that definitely helped!

goodglitch Mar 14, 2025

Try to quantize the KV cache and enable Flash Attention:
-ctk q8_0 -ctv q8_0 -fa 1
This should give you some room for extra layers on the GPU

I am writing a scientific paper using PhD level math. I use QwQ-32B Q5_K_L model and non-quantized cache to insure that a model will not miss some minus sign or confuse one variable for another. I am very tempted to go for K_M model version and quantized cache since it will give me much more context, but I am afraid that in my use case it can degraded a lot reliability of the results. Would you suggest me to quantize cache too?

PS: Obviously I can try it myself, but it will require at least 20 reruns, and at 2-3t/s (big share of the model is on DDR4) and average 20k+ answers from QwQ it is practically impossible. So I will really appreciate any suggestions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Squeezing out faster inference on a 3090? Is CUDA_USE_TENSOR_CORES something I can compile for? #8422

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Squeezing out faster inference on a 3090? Is CUDA_USE_TENSOR_CORES something I can compile for? #8422

wwoodsTM Jul 10, 2024

Replies: 2 comments · 3 replies

dspasyuk Jul 11, 2024

wwoodsTM Jul 11, 2024 Author

ggerganov Jul 11, 2024 Maintainer

wwoodsTM Jul 11, 2024 Author

goodglitch Mar 14, 2025

wwoodsTM
Jul 10, 2024

Replies: 2 comments 3 replies

dspasyuk
Jul 11, 2024

wwoodsTM Jul 11, 2024
Author

ggerganov
Jul 11, 2024
Maintainer

wwoodsTM Jul 11, 2024
Author