You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I know inference uses less compute on Maverick vs say Llama 70b,
Shouldn't the same apply to prompt processing?
Prompt processing does speed up with CPU only going from 70b to Maverick,
After adding a gpu 70b gets a huge speed boost, but Maverick actually slows down a little with a GPU
Machine is an Epyc 7F52 16 core + 1x RTX 3090 (PCIE x16 gen3)
Maverick CPU only:
CUDA_VISIBLE_DEVICES=-1 ./llama-server -m Maverick.gguf -c 16384
prompt eval time = 54376.33 ms / 1611 tokens ( 33.75 ms per token, 29.63 tokens per second)
eval time = 34414.90 ms / 310 tokens ( 111.02 ms per token, 9.01 tokens per second)
Maverick CPU + GPU:
./llama-server -m Maverick.gguf -c 16384 -ngl 49 -ot ".ffn_._exps.*=CPU"
prompt eval time = 71585.41 ms / 1611 tokens ( 44.44 ms per token, 22.50 tokens per second)
eval time = 10805.00 ms / 297 tokens ( 36.38 ms per token, 27.49 tokens per second)
Llama3.3 70b CPU only:
CUDA_VISIBLE_DEVICES=-1 ./llama-server -m Llama-3.3.gguf -c 16384
prompt eval time = 196771.44 ms / 1622 tokens ( 121.31 ms per token, 8.24 tokens per second)
Llama3.3 70b CPU + GPU:
./llama-server -m Llama-3.3.gguf -c 16384 -ngl 20
prompt eval time = 13547.21 ms / 1617 tokens ( 8.38 ms per token, 119.36 tokens per second)
On Maverick my pcie bandwidth was basically saturated the whole 54 seconds of prompt eval at 14GB/s
Just wondering if this is expected due to Maverick being huge, or I have bad settings, or maybe optimizations are possible?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I know inference uses less compute on Maverick vs say Llama 70b,
Shouldn't the same apply to prompt processing?
Prompt processing does speed up with CPU only going from 70b to Maverick,
After adding a gpu 70b gets a huge speed boost, but Maverick actually slows down a little with a GPU
Machine is an Epyc 7F52 16 core + 1x RTX 3090 (PCIE x16 gen3)
Maverick CPU only:
CUDA_VISIBLE_DEVICES=-1 ./llama-server -m Maverick.gguf -c 16384
prompt eval time = 54376.33 ms / 1611 tokens ( 33.75 ms per token, 29.63 tokens per second)
eval time = 34414.90 ms / 310 tokens ( 111.02 ms per token, 9.01 tokens per second)
Maverick CPU + GPU:
./llama-server -m Maverick.gguf -c 16384 -ngl 49 -ot ".ffn_._exps.*=CPU"
prompt eval time = 71585.41 ms / 1611 tokens ( 44.44 ms per token, 22.50 tokens per second)
eval time = 10805.00 ms / 297 tokens ( 36.38 ms per token, 27.49 tokens per second)
Llama3.3 70b CPU only:
CUDA_VISIBLE_DEVICES=-1 ./llama-server -m Llama-3.3.gguf -c 16384
prompt eval time = 196771.44 ms / 1622 tokens ( 121.31 ms per token, 8.24 tokens per second)
Llama3.3 70b CPU + GPU:
./llama-server -m Llama-3.3.gguf -c 16384 -ngl 20
prompt eval time = 13547.21 ms / 1617 tokens ( 8.38 ms per token, 119.36 tokens per second)
On Maverick my pcie bandwidth was basically saturated the whole 54 seconds of prompt eval at 14GB/s
Just wondering if this is expected due to Maverick being huge, or I have bad settings, or maybe optimizations are possible?
Beta Was this translation helpful? Give feedback.
All reactions