Skip to content

vllm #1030

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sitabulaixizawaluduo opened this issue Sep 13, 2023 · 0 comments
Closed

vllm #1030

sitabulaixizawaluduo opened this issue Sep 13, 2023 · 0 comments

Comments

@sitabulaixizawaluduo
Copy link

No description provided.

yiliu30 pushed a commit to yiliu30/vllm-fork that referenced this issue Apr 16, 2025
migrated from a PR to habana_main:
HabanaAI#1014

For Best performance, this PR is recommended to run with INC:
[[SW-223553] [VLLM] Merge deepseek changes into habana_main - Habana
Labs](https://jira.habana-labs.com/browse/SW-223553)

**test acc of G3**:
```bash
huggingface-cli download Yi30/inc-woq-default-pile-one-cache-408  --local-dir ./scripts/nc_workspace_measure_kvache

cat inc_quant_with_fp8kv_config.json
{
    "mode": "QUANTIZE",
    "observer": "maxabs",
    "scale_method": "maxabs_hw",
    "scale_format": "const",
    "allowlist": {
        "types": [],
        "names": []
    },
    "blocklist": {
        "types": [],
        "names": [
            "lm_head",
            "mlp\\.gate\\b",
            "block2batch_matmul"
        ]
    },
    "dump_stats_path": "./inc-woq-default-pile-one-cache-408-for-fp8-mla/inc_measure_output"
}


QUANT_CONFIG=inc_quant_with_fp8kv_config.json \
PT_HPU_LAZY_MODE=1 \
VLLM_SKIP_WARMUP=true \
PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
PT_HPU_WEIGHT_SHARING=0 \
VLLM_MLA_DISABLE_REQUANTIZATION=1 \
lm_eval --model vllm \
  --model_args "pretrained=/mnt/weka/data/pytorch/DeepSeek-R1/,tensor_parallel_size=8,distributed_executor_backend=mp,trust_remote_code=true,max_model_len=4096,use_v2_block_manager=True,dtype=bfloat16,kv_cache_dtype=fp8_inc" \
  --tasks gsm8k --num_fewshot "5" --limit "256" \
  --batch_size "8"
```

**test acc of G2**:
**convert original DeepSeek-R1** using
[convert_for_g2.py](https://github.com/yangulei/vllm-fork/blob/deepseek_r1_g2/scripts/convert_for_g2.py)
(this step will be removed as INC updates.)

```bash

huggingface-cli download Yi30/inc-woq-default-pile-one-cache-412-g2  --local-dir ./scripts/nc_workspace_measure_kvache

cat inc_quant_with_fp8kv_config.json
{
    "mode": "QUANTIZE",
    "observer": "maxabs",
    "scale_method": "maxabs_hw",
    "scale_format": "const",
    "allowlist": {
        "types": [],
        "names": []
    },
    "blocklist": {
        "types": [],
        "names": [
            "lm_head",
            "mlp\\.gate\\b",
            "block2batch_matmul"
        ]
    },
    "dump_stats_path": "./nc_workspace_measure_kvache/inc_measure_output"
}
```


vllm
(pretrained=/mnt/weka/data/pytorch/DeepSeek-R1/,tensor_parallel_size=8,distributed_executor_backend=mp,trust_remote_code=true,max_model_len=4096,use_v2_block_manager=True,dtype=bfloat16,kv_cache_dtype=fp8_inc),
gen_kwargs: (None), limit: 256.0, num_fewshot: 5, batch_size: 128
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|

|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9492|± |0.0137|
| | |strict-match | 5|exact_match|↑ |0.9453|± |0.0142|


----------
Need to use vllm-hpu-extension:
https://github.com/HabanaAI/vllm-hpu-extension/tree/dev/chendi/deepseek_r1

Status:

runnable with Deepseek-R1.
Accuracy check: for block fp8 weight => garbage output
accuracy check for BF16 weight => looks good.

test scripts:
```
from vllm import LLM, SamplingParams
import os

os.environ['VLLM_SKIP_WARMUP'] = 'true'
os.environ['PT_HPU_LAZY_MODE'] = '1'
os.environ['PT_HPU_ENABLE_LAZY_COLLECTIVES']='true'
os.environ['PT_HPU_WEIGHT_SHARING']='0'
#os.environ['HABANA_LOGS']="vllm_inc_debug"
#os.environ["LOG_LEVEL_ALL"]="3"
os.environ['VLLM_MLA_DISABLE_REQUANTIZATION']='1'
#os.environ["QUANT_CONFIG"] = "inc_quant_with_fp8kv_config.json"
#os.environ["LOGLEVEL"] = "DEBUG"

 
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
 
if __name__ == "__main__":
    # Create a sampling params object.
    sampling_params = SamplingParams(temperature=0.0, max_tokens=16, ignore_eos=True)
 
    # Create an LLM.
    model_path = "/data/models/DeepSeek-R1"
 
    llm = LLM(model=model_path,
            trust_remote_code=True,
            enforce_eager=True,
            dtype="bfloat16",
            use_v2_block_manager=True,
            max_model_len=1024,
            max_num_seqs=1,
            tensor_parallel_size=8,
            distributed_executor_backend='mp',
            gpu_memory_utilization=0.8,
            #kv_cache_dtype="fp8_inc",
            seed=2024)
 
    # Generate texts from the prompts. The output is a list of RequestOutput objects
    # that contain the prompt, generated text, and other information.
    outputs = llm.generate(prompts, sampling_params)
    # Print the outputs.
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
    if os.environ.get("QUANT_CONFIG", None) is not None:
        llm.llm_engine.model_executor.shutdown()
```

---------

Signed-off-by: Chendi.Xue <[email protected]>
Signed-off-by: kwisniewski98 <[email protected]>
Signed-off-by: Chendi Xue <[email protected]>
Co-authored-by: kwisniewski98 <[email protected]>
yiliu30 added a commit to yiliu30/vllm-fork that referenced this issue May 8, 2025
JIRA: https://jira.habana-labs.com/browse/SW-227174

cherry-pick vllm-project#1030 and fixed conflicts after rebase
Dependency: HabanaAI/vllm-hpu-extension#161

Verified with below 3 methods:

1. test with deepseek-v2 BF16 weight. => Passed
2. evaluate acc on deepseek-r1 with out of box block fp8 weight =>
Passed
3. evaluate acc on deepseek-r1 with out of box block fp8 weight + INC
calibrated per-channel scale => Passed acc check, performance reach
goal(number is in jira ticket)

== Details ==

1. test with deepseek-v2 BF16 weight:
```
PT_HPU_LAZY_MODE=1 python run_example_tp.py --model DeepSeek-V2-Lite --tokenizer DeepSeek-V2-Lite --osl 32 
```
```
(VllmWorkerProcess pid=1039) WARNING 04-25 03:01:53 [hpu_model_runner.py:1039] Configuration: ('decode', 4, 128) was not warmed-up!
(VllmWorkerProcess pid=1038) WARNING 04-25 03:01:53 [hpu_model_runner.py:1039] Configuration: ('decode', 4, 128) was not warmed-up!
(VllmWorkerProcess pid=1041) WARNING 04-25 03:01:53 [hpu_model_runner.py:1039] Configuration: ('decode', 4, 128) was not warmed-up!
WARNING 04-25 03:01:53 [hpu_model_runner.py:1039] Configuration: ('decode', 4, 128) was not warmed-up!
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00,  1.57it/s, est. speed input: 12.59 toks/s, output: 50.37 toks/s]
e2e took 2.5509743690199684 seconds
====================================
Prompt: 'Hello, my name is'
Generated text: '\nI am a 20 year old student from the UK. I am currently studying for a degree in English Literature and Creative Writing at the University of East'
Ground truth: None
====================================
====================================
Prompt: '0.999 compares to 0.9 is '
Generated text: '100%\n0.9999999999999999999999999'
Ground truth: None
====================================
====================================
Prompt: 'The capital of France is'
Generated text: ' Paris, which is also the largest city in the country. The city is located on the Seine River and is known for its beautiful architecture, museums, and art'
Ground truth: None
====================================
====================================
Prompt: 'The future of AI is'
Generated text: ' in the hands of the people\nThe future of AI is in the hands of the people\nThe future of AI is in the hands of the people\nThe'
Ground truth: None
====================================
```

2. evaluate acc on deepseek-r1 with out of box block fp8 weight - limit
256

|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|

|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9648|± |0.0115|
| | |strict-match | 5|exact_match|↑ |0.9648|± |0.0115|

3. evaluate acc on deepseek-r1 with out of box block fp8 weight + INC
calibrated per-channel scale

|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|

|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9688|± |0.0109|
| | |strict-match | 5|exact_match|↑ |0.9688|± |0.0109|

---------

Signed-off-by: Chendi.Xue <[email protected]>
Signed-off-by: kwisniewski98 <[email protected]>
Signed-off-by: Chendi Xue <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Co-authored-by: kwisniewski98 <[email protected]>
Co-authored-by: Youlei Yang <[email protected]>
Co-authored-by: Yi Liu <[email protected]>
Co-authored-by: Yi Liu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant