Skip to content

Squeezing out faster inference on a 3090? Is CUDA_USE_TENSOR_CORES something I can compile for? #8422

Answered by ggerganov
wwoodsTM asked this question in Q&A
Discussion options

You must be logged in to vote

Try to quantize the KV cache and enable Flash Attention:

-ctk q8_0 -ctv q8_0 -fa 1

This should give you some room for extra layers on the GPU

Replies: 2 comments 3 replies

Comment options

You must be logged in to vote
1 reply
@wwoodsTM
Comment options

Comment options

You must be logged in to vote
2 replies
@wwoodsTM
Comment options

@goodglitch
Comment options

Answer selected by wwoodsTM
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
4 participants