Skip to content

Integrating torchao quantization into vllm #13588

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from

Conversation

jerryzh168
Copy link
Contributor

@jerryzh168 jerryzh168 commented Feb 20, 2025

Summary:
att, initial PR that adds support for torchao as a quantization option for vllm

Test Plan:

pytest tests/quantization/test_torchao.py

=== process_weights_after_loading ===

Tested in A100 machine

python benchmarks/benchmark_throughput.py --input-len 256 --output-len 256 --model meta-llama/Meta-Llama-3-8B
python benchmarks/benchmark_throughput.py --input-len 256 --output-len 256 --model meta-llama/Meta-Llama-3-8B --quantization torchao --torchao-config int4wo-128

bfloat16:
Throughput: 14.23 requests/s, 7285.92 total tokens/s, 3642.96 output tokens/s

torchao int4wo-128
Throughput: 2.34 requests/s, 1197.03 total tokens/s, 598.52 output tokens/s

Note: int4wo-128 is can only give speedup for batch size 1, we expect float8 can give overall boost, can test this later.

python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model meta-llama/Meta-Llama-3-8B --batch-size 1
python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model meta-llama/Meta-Llama-3-8B --batch-size 1 --quantization torchao --torchao-config int4wo-128

bfloat16:

Avg latency: 4.563035774851839 seconds
10% percentile latency: 3.3907309625297786 seconds
25% percentile latency: 3.424301794730127 seconds
50% percentile latency: 3.727481307461858 seconds
75% percentile latency: 5.625706314109266 seconds
90% percentile latency: 5.935522455349565 seconds
99% percentile latency: 7.159032964520157 seconds

torchao int4wo-128:

Avg latency: 2.1109102573245764 seconds
10% percentile latency: 2.089452960342169 seconds
25% percentile latency: 2.092462383210659 seconds
50% percentile latency: 2.1093569844961166 seconds
75% percentile latency: 2.119327493943274 seconds
90% percentile latency: 2.132301461696625 seconds
99% percentile latency: 2.192038407512009 seconds

int8wo without compile:

Avg latency: 11.678632816672325 seconds
10% percentile latency: 11.624498645961285 seconds
25% percentile latency: 11.638677136972547 seconds
50% percentile latency: 11.690440801903605 seconds
75% percentile latency: 11.707305849529803 seconds
90% percentile latency: 11.724864188954234 seconds
99% percentile latency: 11.745748021267355 seconds

int8wo with compile (enabled by default in the PR):

Avg latency: 2.144259312748909 seconds
10% percentile latency: 2.111430574581027 seconds
25% percentile latency: 2.117820209823549 seconds
50% percentile latency: 2.1458087861537933 seconds
75% percentile latency: 2.1610238971188664 seconds
90% percentile latency: 2.175424510613084 seconds
99% percentile latency: 2.201516461186111 seconds

=== loading pre quantized model results ===

need pytorch/ao#1791 from torchao
or install nightly after it is landed

python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model jerryzh168/llama3-8b-int8wo --batch-size 1 --quantization torchao --torchao-config int8wo

Avg latency: 2.145327783127626 seconds
10% percentile latency: 2.1314279098063706 seconds
25% percentile latency: 2.137767093256116 seconds
50% percentile latency: 2.144422769546509 seconds
75% percentile latency: 2.150923326611519 seconds
90% percentile latency: 2.1607464250177144 seconds
99% percentile latency: 2.1807862490043046 seconds

Benchmarks against existing quant method:

python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model nm-testing/Meta-Llama-3-8B-Instruct-W4A16-G128 --batch-size 1
Avg latency: 1.780059675872326 seconds
10% percentile latency: 1.7601410120725631 seconds
25% percentile latency: 1.768859044648707 seconds
50% percentile latency: 1.775211926549673 seconds
75% percentile latency: 1.7830392718315125 seconds
90% percentile latency: 1.8060954470187425 seconds
99% percentile latency: 1.843151821307838 seconds

Next:

Reviewers:

Subscribers:

Tasks:

Tags:

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link

mergify bot commented Feb 20, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jerryzh168.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mgoin
Copy link
Member

mgoin commented Feb 20, 2025

Exciting! It would be cool to see if this can work especially well with torch.compile in vLLM. Feel free to ping when ready for review

@jerryzh168
Copy link
Contributor Author

jerryzh168 commented Feb 20, 2025

thanks, this is still WIP. I suspect we might have to do something special for it to work with the torch.compile integration in vLLM

@jerryzh168 jerryzh168 changed the title Integrating torchao quantization into vllm [WIP] Integrating torchao quantization into vllm Feb 20, 2025
@jerryzh168 jerryzh168 changed the title [WIP] Integrating torchao quantization into vllm Integrating torchao quantization into vllm Feb 26, 2025
@mergify mergify bot added the documentation Improvements or additions to documentation label Feb 27, 2025
@jerryzh168
Copy link
Contributor Author

@mgoin initial integration is done, please take a look.

I can repro the speedup we get from torchao, e.g. 2x speedup with int4wo and slightly better speedup for int8wo than in torchao somehow (also around 2x).

wondering if it's required to benchmark against existing techniques in vllm or we can merge now and follow up with these later

Summary:
att, initial PR that adds support for torchao as a quantization option for vllm

Test Plan:
```
python benchmarks/benchmark_throughput.py --input-len 256 --output-len 256 --model meta-llama/Meta-Llama-3-8B
python benchmarks/benchmark_throughput.py --input-len 256 --output-len 256 --model meta-llama/Meta-Llama-3-8B --quantization torchao
```
Reviewers:

Subscribers:

Tasks:

Tags:
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I don't think it is necessary to benchmark against other methods, although it would be nice to validate through an evaluation that the model is running correctly through vLLM.

One thing I don't love with this proposal is adding a torchao specific arg to the ModelConfig, aka the --torchao-config int4wo-128 argument.
We've run into this issue with other dynamic quantization backends where we don't have a way currently to pass specific overrides to the quant_config. So I think it would be nice to achieve this in a general way:

  • One idea is reusing the --hf-overrides arg to override the model's existing (or non existent) quantization_config

    parser.add_argument('--hf-overrides',
    type=json.loads,
    default=EngineArgs.hf_overrides,
    help='Extra arguments for the HuggingFace config. '
    'This should be a JSON string that will be '
    'parsed into a dictionary.')

    So thinking something like --hf-overrides '{"quantization_config" : {"torchao_config": "int4wo-128"}'

  • Another idea is to add an explicit --quant-config or --quant-overrides argument that overrides the kwargs we pass into quant_config

Comment on lines +20 to +25
# Lazy import to suppress some warnings
from torchao.quantization import (float8_dynamic_activation_float8_weight,
int4_weight_only,
int8_dynamic_activation_int8_weight,
int8_weight_only, quantize_)
from torchao.quantization.observer import PerRow, PerTensor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you wrap the torchao import in a try-except so we can give a nice error message instructing the user how to install torchao? See deepspeed as an example

try:
import deepspeed
if deepspeed.__version__ < "0.14.2":
raise ImportError("deepspeed version is wrong. Please "
"install deepspeed>=0.14.2.")
from deepspeed.ops.fp_quantizer import FP_Quantize
except ImportError as err:
raise ImportError("Please install deepspeed>=0.14.2 via "
"`pip install deepspeed>=0.14.2` to use "
"deepspeedfp quantizer.") from err

@jerryzh168
Copy link
Contributor Author

thanks for the review @mgoin I'm out of office recently and @drisspg will continue this work while I'm out

@drisspg drisspg mentioned this pull request Mar 4, 2025
@drisspg
Copy link
Contributor

drisspg commented Mar 4, 2025

@mgoin I might end up opening up a new PR since it will be easier for me to iterate on

@mgoin
Copy link
Member

mgoin commented Mar 4, 2025

@drisspg sure that's fine with me, just give me a ping when ready!

@jerryzh168
Copy link
Contributor Author

merged in #14231

@jerryzh168 jerryzh168 closed this May 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants