cublas can run now but max token size is greatly reduced. #231
Replies: 4 comments
-
There are many parameters to set - what's the batch size you are using? Is f16 enabled? |
Beta Was this translation helpful? Give feedback.
-
@mudler i'm only having mirostat = 2, temp = 0.3, ngl = 43, t = 1, ctx = 1920, n = 1920. these are the prompt parameters i use for llama.cpp which works. how to disable F16Mem?
|
Beta Was this translation helpful? Give feedback.
-
@mudler any help to get go-llama running like the llama.cpp setting mentioned? |
Beta Was this translation helpful? Give feedback.
-
bringing this discussion to the latest question |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
can run but i cant seem to generate the same amount of context size tokens as without using golang. why?
with 4060 rtx, i can do 1920 max tokens using pure llama.cpp cuda offload 100%
on go-llama, i can only do around ctx size of 650 without oom
@mudler do u know why? how do i fix this?
same setting with llama.cpp but in golang...
Beta Was this translation helpful? Give feedback.
All reactions