-
-
Notifications
You must be signed in to change notification settings - Fork 7.8k
Integrating torchao quantization into vllm #13588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This pull request has merge conflicts that must be resolved before it can be |
Exciting! It would be cool to see if this can work especially well with torch.compile in vLLM. Feel free to ping when ready for review |
thanks, this is still WIP. I suspect we might have to do something special for it to work with the torch.compile integration in vLLM |
@mgoin initial integration is done, please take a look. I can repro the speedup we get from torchao, e.g. 2x speedup with int4wo and slightly better speedup for int8wo than in torchao somehow (also around 2x). wondering if it's required to benchmark against existing techniques in vllm or we can merge now and follow up with these later |
Summary: att, initial PR that adds support for torchao as a quantization option for vllm Test Plan: ``` python benchmarks/benchmark_throughput.py --input-len 256 --output-len 256 --model meta-llama/Meta-Llama-3-8B python benchmarks/benchmark_throughput.py --input-len 256 --output-len 256 --model meta-llama/Meta-Llama-3-8B --quantization torchao ``` Reviewers: Subscribers: Tasks: Tags:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! I don't think it is necessary to benchmark against other methods, although it would be nice to validate through an evaluation that the model is running correctly through vLLM.
One thing I don't love with this proposal is adding a torchao specific arg to the ModelConfig, aka the --torchao-config int4wo-128
argument.
We've run into this issue with other dynamic quantization backends where we don't have a way currently to pass specific overrides to the quant_config. So I think it would be nice to achieve this in a general way:
-
One idea is reusing the
--hf-overrides
arg to override the model's existing (or non existent)quantization_config
Lines 598 to 603 in 084bbac
parser.add_argument('--hf-overrides', type=json.loads, default=EngineArgs.hf_overrides, help='Extra arguments for the HuggingFace config. ' 'This should be a JSON string that will be ' 'parsed into a dictionary.')
So thinking something like--hf-overrides '{"quantization_config" : {"torchao_config": "int4wo-128"}'
-
Another idea is to add an explicit
--quant-config
or--quant-overrides
argument that overrides the kwargs we pass into quant_config
# Lazy import to suppress some warnings | ||
from torchao.quantization import (float8_dynamic_activation_float8_weight, | ||
int4_weight_only, | ||
int8_dynamic_activation_int8_weight, | ||
int8_weight_only, quantize_) | ||
from torchao.quantization.observer import PerRow, PerTensor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you wrap the torchao import in a try-except so we can give a nice error message instructing the user how to install torchao? See deepspeed as an example
vllm/vllm/model_executor/layers/quantization/deepspeedfp.py
Lines 144 to 153 in 084bbac
try: | |
import deepspeed | |
if deepspeed.__version__ < "0.14.2": | |
raise ImportError("deepspeed version is wrong. Please " | |
"install deepspeed>=0.14.2.") | |
from deepspeed.ops.fp_quantizer import FP_Quantize | |
except ImportError as err: | |
raise ImportError("Please install deepspeed>=0.14.2 via " | |
"`pip install deepspeed>=0.14.2` to use " | |
"deepspeedfp quantizer.") from err |
@mgoin I might end up opening up a new PR since it will be easier for me to iterate on |
@drisspg sure that's fine with me, just give me a ping when ready! |
merged in #14231 |
Summary:
att, initial PR that adds support for torchao as a quantization option for vllm
Test Plan:
pytest tests/quantization/test_torchao.py
=== process_weights_after_loading ===
Tested in A100 machine
bfloat16:
Throughput: 14.23 requests/s, 7285.92 total tokens/s, 3642.96 output tokens/s
torchao int4wo-128
Throughput: 2.34 requests/s, 1197.03 total tokens/s, 598.52 output tokens/s
Note: int4wo-128 is can only give speedup for batch size 1, we expect float8 can give overall boost, can test this later.
bfloat16:
torchao int4wo-128:
int8wo without compile:
int8wo with compile (enabled by default in the PR):
=== loading pre quantized model results ===
need pytorch/ao#1791 from torchao
or install nightly after it is landed
python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model jerryzh168/llama3-8b-int8wo --batch-size 1 --quantization torchao --torchao-config int8wo
Benchmarks against existing quant method:
Next:
Reviewers:
Subscribers:
Tasks:
Tags: