-
Notifications
You must be signed in to change notification settings - Fork 11.6k
Generation with cuBLAS not deterministic for long prompts #1340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I accidentally opened this issue prematurely by pressing CTRL+Enter. I am not yet done with ensuring that everything is correct. |
Everything should be in order now; sorry for the inconvenience. |
Have you noticed if this also happens with smaller models (7B, 13B)? |
The bug also occurs with 13b:
|
I think there are two possible causes for this:
Adding a Unless there is a bug with the multi-stream synchronization, I am not sure that we should do anything about it, unless this affects the generation quality. Note that the generation quality needs to be evaluated in an objective way, such as the perplexity. |
I can confirm that the bug also occurs at 7b:
I did not do any objective measure of generation quality. Subjectively I was not able to tell a difference in terms of quality. In any case, if cuBLAS does not guarantee reproducibility anyways then this is probably the reason. I was simply confused because this behavior made me question whether I accidentally introduced race conditions in #1341 ; perhaps a warning should be printed when the user specifies a seed in combination with cuBLAS? In any case, I agree that this would probably not be worth sacrificing performance for. |
Adding |
I just ran perplexity tests for 8 CUDA streams vs. 1 stream. The perplexity of 7b q4_0 was 6.2838 for both configurations. 8 streams was 6% faster than 1 stream with 8.66 ms / token vs. 9.20 ms / token. |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
When I set a seed and repeat a generation with the exact same parameters I expect to get the exact same text again.
Current Behavior
I re-run a generation with the same seed and parameters and the generated text is not always the same between generations. It is sometimes the same, but not always.
Environment and Context
Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
git commit: 173d0e6
Physical (or virtual) hardware you are using, e.g. for Linux:
$ lscpu
$ uname -a
Linux johannes-pc 6.3.0-1-MANJARO #1 SMP PREEMPT_DYNAMIC Mon Apr 3 10:46:56 UTC 2023 x86_64 GNU/Linux
Failure Information (for bugs)
I suspect that there is a race condition somewhere that affects the generated text, and depending on the race condition one of several outputs is produced. I only get the bug when compiling with
LLAMA_CUBLAS=1
. I only get the bug with a prompt that is sufficiently long (navy seals copypasta, 399 tokens) but not with a short prompt ("People die when they are killed.", 8 tokens). The number of threads does not matter. Quantization scheme does not matter.Steps to Reproduce
make clean && LLAMA_CUBLAS=1 make
./main --model models/llama-33b-ggml-q4_0.bin --ignore-eos --n_predict 16 --ctx_size 2048 --batch_size 512 --threads 6 --seed 1337 --file navy_seals_copypasta.txt
with the filenavy_seals_copypasta.txt
containing the navy seals copypasta as a prompt (399 tokens).Failure Logs
Below is a log of my console when repeatedly running the same seed and parameters.
Outputs are in order:
Labels: 4chan, epic win, fail, fun
Labels: 4chan, epic win, fail, fun
(thing) by Kalkin Tue Jul 10
You think this is abuse? This is how I treat people who
The text was updated successfully, but these errors were encountered: