Use F16 for memory_k and memory_v (as suggested in #146) #154

ty-everett · 2023-03-15T06:16:30Z

As suggested in #146 we are able to save lots of memory by using float16 instead of float32. I implemented the suggested changes, and tested with the 7B and 13B models, and there were no issues on my Intel-based MacBook Pro.

Merging these changes should allow more models to run more performantly on a wider range of hardware.

Green-Sky · 2023-03-15T14:15:04Z

can confirm ggml ctx size 4529.34 MB -> 4273.34 MB
speed stayed the same.

it is hard to tell if the quality changes, but the prediction does (obviously).

ggerganov · 2023-03-15T19:14:41Z

I was worried that it might degrade quality, but I have no evals as you can guess.
I think it is best to gate this through a command line argument. Have it F32 by default, and if requested by the user - set it to F16

ggerganov

.

Green-Sky · 2023-03-16T14:52:53Z

I ran some more, non scientific tests:

7B:

30B:

both where ran with -t 4 -n 2048 --repeat_penalty 1.176 --repeat_last_n 256 --temp 0.8 --top_p 0.1 -c 2048 --color -i -r "User:" -f prompts/i_example1.txt

Green-Sky · 2023-03-18T00:00:17Z

@ty-everett are you going to write the cli-param conditional version? if not, I will do it.

…154) (#294) * Use F16 for memory_k and memory_v * add command line switch to use f16 instead of f32 for memory k+v --------- Co-authored-by: Ty Everett <[email protected]>

Slim-Bullseye based docker image

Use F16 for memory_k and memory_v

1b73521

ggerganov requested changes Mar 15, 2023

View reviewed changes

setzer22 mentioned this pull request Mar 15, 2023

Good ideas from llama.cpp rustformers/llm#15

Closed

6 tasks

This was referenced Mar 18, 2023

Add requirements to readme #269

Merged

RISC-V (TH1520&D1) benchmark and hack for <1GB DDR device #288

Closed

Command line switch to use F16 for memory_k and memory_v (refactor of #154) #294

Merged

ggerganov closed this Mar 19, 2023

univerz mentioned this pull request Mar 21, 2023

Use F16 for memory_k and memory_v rustformers/llm#51

Closed

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this pull request Dec 19, 2023

Merge pull request ggml-org#154 from th-neu/th-neu-dockerfile-slim

38b8eee

Slim-Bullseye based docker image

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use F16 for memory_k and memory_v (as suggested in #146) #154

Use F16 for memory_k and memory_v (as suggested in #146) #154

Uh oh!

ty-everett commented Mar 15, 2023

Uh oh!

Green-Sky commented Mar 15, 2023

Uh oh!

ggerganov commented Mar 15, 2023

Uh oh!

ggerganov left a comment

Uh oh!

Green-Sky commented Mar 16, 2023

Uh oh!

Green-Sky commented Mar 18, 2023

Uh oh!

Uh oh!

Use F16 for memory_k and memory_v (as suggested in #146) #154

Use F16 for memory_k and memory_v (as suggested in #146) #154

Uh oh!

Conversation

ty-everett commented Mar 15, 2023

Uh oh!

Green-Sky commented Mar 15, 2023

Uh oh!

ggerganov commented Mar 15, 2023

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Green-Sky commented Mar 16, 2023

Uh oh!

Green-Sky commented Mar 18, 2023

Uh oh!

Uh oh!