-
-
Notifications
You must be signed in to change notification settings - Fork 7.3k
Code llama output gibberish code. #883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I am experiencing the same issue on both running on A10G at commit |
I got the same problem. The results output by any version of CodeLamma are nonsense. |
I think vllm does not work with llama code. Try with Wizard Coder.
|
I am experiencing the same gibberish for both CodeLLama and WizardCoder using vllm via skypilot. Has anyone tried turning down the temperature? I was hoping to turn it down using the opanai api feature but had trouble getting that working on skypilot. |
Huggingface version of codellama has "torch_dtype": "bfloat16" which is not supported by H100. I had change this to float16 to get it working. In this setting even with temperature 0, the values were gibberish. I managed to get my hands on a A100 which supports bfloat. With bfloat the responses are similar to the ones we see in meta inference code |
Thanks to the helpful comment from @matheper in #900 about safetensors being the culprit I got the instruct version working tonight 🎉 INFO 09-06 02:59:49 llm_engine.py:70] Initializing an LLM engine with config: model='codellama/CodeLlama-7b-Instruct-hf', tokenizer='codellama/CodeLlama-7b-Instruct-hf', tokenizer_mode=auto, trust_remote_code=True, dtype=torch.bfloat16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0) Hints:
|
Code Llama 13B Instruct - works for me with latest vLLM source code like this. Dockerfile
requirements.txt
entrypoint.sh/usr/bin/python3 -m vllm.entrypoints.openai.api_server \
--model $MODEL_NAME \
--host $HOST \
--port $PORT Request:curl "http://$EXTERNAL_SERVER_IP:8000/v1/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "codellama/CodeLlama-13b-Instruct-hf",
"prompt": "<s>[INST]Write a python function that returns the first 10 number of the fibonacci sequence[/INST]",
"max_tokens": 1000,
"temperature": 0
}' | jq -r '.choices[0].text' ============= Output ============
This function uses a recursive approach to calculate the Fibonacci sequence. It starts by checking if the input The |
@LopezGG @the-crypt-keeper @c3-avidmych @TomExMachina IMK vLLM doesn't support codellama with float16, where RoPE has problem with precision. |
Solve this by applying #870. Need build by yourself. Good Luck!😇 |
1. use VLLM_ENABLE_RUNTIME_DEQUANT=1 to run with runtime dequantize 2. use VLLM_DMOE_DYNAMIC_SCALE=1 to run with dynamic dequantize + dynamic MOE 3.acc looks good as below ``` VLLM_DMOE_DYNAMIC_SCALE=1 python scripts/run_lm_eval.py -l 64 --batch_size 8 {"gsm8k": {"alias": "gsm8k", "exact_match,strict-match": 0.96875, "exact_match_stderr,strict-match": 0.021921011700381302, "exact_match,flexible-extract": 0.96875, "exact_match_stderr,flexible-extract": 0.021921011700381302}}{"e2e time(secs)": 938.2986768169999} ``` --------- Signed-off-by: Chendi Xue <[email protected]>
Test on this
https://huggingface.co/docs/transformers/main/model_doc/code_llama
Env:
Code Llama 7b Instruct
vLLM
main
branch 0.1.4 on 4b6f069.Generated text from
transformers
:Generated text from
vLLM
:Engine Log:
The text was updated successfully, but these errors were encountered: