-
-
Notifications
You must be signed in to change notification settings - Fork 7.7k
vllm #1030
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
yiliu30
pushed a commit
to yiliu30/vllm-fork
that referenced
this issue
Apr 16, 2025
migrated from a PR to habana_main: HabanaAI#1014 For Best performance, this PR is recommended to run with INC: [[SW-223553] [VLLM] Merge deepseek changes into habana_main - Habana Labs](https://jira.habana-labs.com/browse/SW-223553) **test acc of G3**: ```bash huggingface-cli download Yi30/inc-woq-default-pile-one-cache-408 --local-dir ./scripts/nc_workspace_measure_kvache cat inc_quant_with_fp8kv_config.json { "mode": "QUANTIZE", "observer": "maxabs", "scale_method": "maxabs_hw", "scale_format": "const", "allowlist": { "types": [], "names": [] }, "blocklist": { "types": [], "names": [ "lm_head", "mlp\\.gate\\b", "block2batch_matmul" ] }, "dump_stats_path": "./inc-woq-default-pile-one-cache-408-for-fp8-mla/inc_measure_output" } QUANT_CONFIG=inc_quant_with_fp8kv_config.json \ PT_HPU_LAZY_MODE=1 \ VLLM_SKIP_WARMUP=true \ PT_HPU_ENABLE_LAZY_COLLECTIVES=true \ PT_HPU_WEIGHT_SHARING=0 \ VLLM_MLA_DISABLE_REQUANTIZATION=1 \ lm_eval --model vllm \ --model_args "pretrained=/mnt/weka/data/pytorch/DeepSeek-R1/,tensor_parallel_size=8,distributed_executor_backend=mp,trust_remote_code=true,max_model_len=4096,use_v2_block_manager=True,dtype=bfloat16,kv_cache_dtype=fp8_inc" \ --tasks gsm8k --num_fewshot "5" --limit "256" \ --batch_size "8" ``` **test acc of G2**: **convert original DeepSeek-R1** using [convert_for_g2.py](https://github.com/yangulei/vllm-fork/blob/deepseek_r1_g2/scripts/convert_for_g2.py) (this step will be removed as INC updates.) ```bash huggingface-cli download Yi30/inc-woq-default-pile-one-cache-412-g2 --local-dir ./scripts/nc_workspace_measure_kvache cat inc_quant_with_fp8kv_config.json { "mode": "QUANTIZE", "observer": "maxabs", "scale_method": "maxabs_hw", "scale_format": "const", "allowlist": { "types": [], "names": [] }, "blocklist": { "types": [], "names": [ "lm_head", "mlp\\.gate\\b", "block2batch_matmul" ] }, "dump_stats_path": "./nc_workspace_measure_kvache/inc_measure_output" } ``` vllm (pretrained=/mnt/weka/data/pytorch/DeepSeek-R1/,tensor_parallel_size=8,distributed_executor_backend=mp,trust_remote_code=true,max_model_len=4096,use_v2_block_manager=True,dtype=bfloat16,kv_cache_dtype=fp8_inc), gen_kwargs: (None), limit: 256.0, num_fewshot: 5, batch_size: 128 |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9492|± |0.0137| | | |strict-match | 5|exact_match|↑ |0.9453|± |0.0142| ---------- Need to use vllm-hpu-extension: https://github.com/HabanaAI/vllm-hpu-extension/tree/dev/chendi/deepseek_r1 Status: runnable with Deepseek-R1. Accuracy check: for block fp8 weight => garbage output accuracy check for BF16 weight => looks good. test scripts: ``` from vllm import LLM, SamplingParams import os os.environ['VLLM_SKIP_WARMUP'] = 'true' os.environ['PT_HPU_LAZY_MODE'] = '1' os.environ['PT_HPU_ENABLE_LAZY_COLLECTIVES']='true' os.environ['PT_HPU_WEIGHT_SHARING']='0' #os.environ['HABANA_LOGS']="vllm_inc_debug" #os.environ["LOG_LEVEL_ALL"]="3" os.environ['VLLM_MLA_DISABLE_REQUANTIZATION']='1' #os.environ["QUANT_CONFIG"] = "inc_quant_with_fp8kv_config.json" #os.environ["LOGLEVEL"] = "DEBUG" prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] if __name__ == "__main__": # Create a sampling params object. sampling_params = SamplingParams(temperature=0.0, max_tokens=16, ignore_eos=True) # Create an LLM. model_path = "/data/models/DeepSeek-R1" llm = LLM(model=model_path, trust_remote_code=True, enforce_eager=True, dtype="bfloat16", use_v2_block_manager=True, max_model_len=1024, max_num_seqs=1, tensor_parallel_size=8, distributed_executor_backend='mp', gpu_memory_utilization=0.8, #kv_cache_dtype="fp8_inc", seed=2024) # Generate texts from the prompts. The output is a list of RequestOutput objects # that contain the prompt, generated text, and other information. outputs = llm.generate(prompts, sampling_params) # Print the outputs. for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") if os.environ.get("QUANT_CONFIG", None) is not None: llm.llm_engine.model_executor.shutdown() ``` --------- Signed-off-by: Chendi.Xue <[email protected]> Signed-off-by: kwisniewski98 <[email protected]> Signed-off-by: Chendi Xue <[email protected]> Co-authored-by: kwisniewski98 <[email protected]>
yiliu30
added a commit
to yiliu30/vllm-fork
that referenced
this issue
May 8, 2025
JIRA: https://jira.habana-labs.com/browse/SW-227174 cherry-pick vllm-project#1030 and fixed conflicts after rebase Dependency: HabanaAI/vllm-hpu-extension#161 Verified with below 3 methods: 1. test with deepseek-v2 BF16 weight. => Passed 2. evaluate acc on deepseek-r1 with out of box block fp8 weight => Passed 3. evaluate acc on deepseek-r1 with out of box block fp8 weight + INC calibrated per-channel scale => Passed acc check, performance reach goal(number is in jira ticket) == Details == 1. test with deepseek-v2 BF16 weight: ``` PT_HPU_LAZY_MODE=1 python run_example_tp.py --model DeepSeek-V2-Lite --tokenizer DeepSeek-V2-Lite --osl 32 ``` ``` (VllmWorkerProcess pid=1039) WARNING 04-25 03:01:53 [hpu_model_runner.py:1039] Configuration: ('decode', 4, 128) was not warmed-up! (VllmWorkerProcess pid=1038) WARNING 04-25 03:01:53 [hpu_model_runner.py:1039] Configuration: ('decode', 4, 128) was not warmed-up! (VllmWorkerProcess pid=1041) WARNING 04-25 03:01:53 [hpu_model_runner.py:1039] Configuration: ('decode', 4, 128) was not warmed-up! WARNING 04-25 03:01:53 [hpu_model_runner.py:1039] Configuration: ('decode', 4, 128) was not warmed-up! Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00, 1.57it/s, est. speed input: 12.59 toks/s, output: 50.37 toks/s] e2e took 2.5509743690199684 seconds ==================================== Prompt: 'Hello, my name is' Generated text: '\nI am a 20 year old student from the UK. I am currently studying for a degree in English Literature and Creative Writing at the University of East' Ground truth: None ==================================== ==================================== Prompt: '0.999 compares to 0.9 is ' Generated text: '100%\n0.9999999999999999999999999' Ground truth: None ==================================== ==================================== Prompt: 'The capital of France is' Generated text: ' Paris, which is also the largest city in the country. The city is located on the Seine River and is known for its beautiful architecture, museums, and art' Ground truth: None ==================================== ==================================== Prompt: 'The future of AI is' Generated text: ' in the hands of the people\nThe future of AI is in the hands of the people\nThe future of AI is in the hands of the people\nThe' Ground truth: None ==================================== ``` 2. evaluate acc on deepseek-r1 with out of box block fp8 weight - limit 256 |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9648|± |0.0115| | | |strict-match | 5|exact_match|↑ |0.9648|± |0.0115| 3. evaluate acc on deepseek-r1 with out of box block fp8 weight + INC calibrated per-channel scale |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9688|± |0.0109| | | |strict-match | 5|exact_match|↑ |0.9688|± |0.0109| --------- Signed-off-by: Chendi.Xue <[email protected]> Signed-off-by: kwisniewski98 <[email protected]> Signed-off-by: Chendi Xue <[email protected]> Signed-off-by: Yi Liu <[email protected]> Co-authored-by: kwisniewski98 <[email protected]> Co-authored-by: Youlei Yang <[email protected]> Co-authored-by: Yi Liu <[email protected]> Co-authored-by: Yi Liu <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
No description provided.
The text was updated successfully, but these errors were encountered: