You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Deepseek R1][v0] Porting deepseek r1 to habana_main (vllm-project#1161)
JIRA: https://jira.habana-labs.com/browse/SW-227174
cherry-pick vllm-project#1030 and fixed conflicts after rebase
Dependency: HabanaAI/vllm-hpu-extension#161
Verified with below 3 methods:
1. test with deepseek-v2 BF16 weight. => Passed
2. evaluate acc on deepseek-r1 with out of box block fp8 weight =>
Passed
3. evaluate acc on deepseek-r1 with out of box block fp8 weight + INC
calibrated per-channel scale => Passed acc check, performance reach
goal(number is in jira ticket)
== Details ==
1. test with deepseek-v2 BF16 weight:
```
PT_HPU_LAZY_MODE=1 python run_example_tp.py --model DeepSeek-V2-Lite --tokenizer DeepSeek-V2-Lite --osl 32
```
```
(VllmWorkerProcess pid=1039) WARNING 04-25 03:01:53 [hpu_model_runner.py:1039] Configuration: ('decode', 4, 128) was not warmed-up!
(VllmWorkerProcess pid=1038) WARNING 04-25 03:01:53 [hpu_model_runner.py:1039] Configuration: ('decode', 4, 128) was not warmed-up!
(VllmWorkerProcess pid=1041) WARNING 04-25 03:01:53 [hpu_model_runner.py:1039] Configuration: ('decode', 4, 128) was not warmed-up!
WARNING 04-25 03:01:53 [hpu_model_runner.py:1039] Configuration: ('decode', 4, 128) was not warmed-up!
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00, 1.57it/s, est. speed input: 12.59 toks/s, output: 50.37 toks/s]
e2e took 2.5509743690199684 seconds
====================================
Prompt: 'Hello, my name is'
Generated text: '\nI am a 20 year old student from the UK. I am currently studying for a degree in English Literature and Creative Writing at the University of East'
Ground truth: None
====================================
====================================
Prompt: '0.999 compares to 0.9 is '
Generated text: '100%\n0.9999999999999999999999999'
Ground truth: None
====================================
====================================
Prompt: 'The capital of France is'
Generated text: ' Paris, which is also the largest city in the country. The city is located on the Seine River and is known for its beautiful architecture, museums, and art'
Ground truth: None
====================================
====================================
Prompt: 'The future of AI is'
Generated text: ' in the hands of the people\nThe future of AI is in the hands of the people\nThe future of AI is in the hands of the people\nThe'
Ground truth: None
====================================
```
2. evaluate acc on deepseek-r1 with out of box block fp8 weight - limit
256
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9648|± |0.0115|
| | |strict-match | 5|exact_match|↑ |0.9648|± |0.0115|
3. evaluate acc on deepseek-r1 with out of box block fp8 weight + INC
calibrated per-channel scale
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9688|± |0.0109|
| | |strict-match | 5|exact_match|↑ |0.9688|± |0.0109|
---------
Signed-off-by: Chendi.Xue <[email protected]>
Signed-off-by: kwisniewski98 <[email protected]>
Signed-off-by: Chendi Xue <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Co-authored-by: kwisniewski98 <[email protected]>
Co-authored-by: Youlei Yang <[email protected]>
Co-authored-by: Yi Liu <[email protected]>
Co-authored-by: Yi Liu <[email protected]>
Copy file name to clipboardExpand all lines: README_GAUDI.md
+3Lines changed: 3 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -408,6 +408,9 @@ measurements for a given model. The quantization configuration is used during in
408
408
> If you are prototyping or testing your model with FP8, you can use the `VLLM_SKIP_WARMUP=true` environment variable to disable the warmup stage, which is time-consuming.
409
409
However, disabling this feature in production environments is not recommended, as it can lead to a significant performance decrease.
410
410
411
+
> [!TIP]
412
+
> If you are benchmarking an FP8 model with `scale_format=const`, setting `VLLM_DISABLE_MARK_SCALES_AS_CONST=true` can help speed up the warmup stage.
413
+
411
414
> [!TIP]
412
415
> When using FP8 models, you may experience timeouts caused by the long compilation time of FP8 operations. To mitigate this, set the following environment variables:
413
416
> -`VLLM_ENGINE_ITERATION_TIMEOUT_S` - to adjust the vLLM server timeout. You can set the value in seconds, e.g., 600 equals 10 minutes.
0 commit comments