Integrating torchao quantization into vllm #13588

jerryzh168 · 2025-02-20T04:07:45Z

Summary:
att, initial PR that adds support for torchao as a quantization option for vllm

Test Plan:

pytest tests/quantization/test_torchao.py

=== process_weights_after_loading ===

Tested in A100 machine

python benchmarks/benchmark_throughput.py --input-len 256 --output-len 256 --model meta-llama/Meta-Llama-3-8B
python benchmarks/benchmark_throughput.py --input-len 256 --output-len 256 --model meta-llama/Meta-Llama-3-8B --quantization torchao --torchao-config int4wo-128

bfloat16:
Throughput: 14.23 requests/s, 7285.92 total tokens/s, 3642.96 output tokens/s

torchao int4wo-128
Throughput: 2.34 requests/s, 1197.03 total tokens/s, 598.52 output tokens/s

Note: int4wo-128 is can only give speedup for batch size 1, we expect float8 can give overall boost, can test this later.

python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model meta-llama/Meta-Llama-3-8B --batch-size 1
python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model meta-llama/Meta-Llama-3-8B --batch-size 1 --quantization torchao --torchao-config int4wo-128

bfloat16:

Avg latency: 4.563035774851839 seconds
10% percentile latency: 3.3907309625297786 seconds
25% percentile latency: 3.424301794730127 seconds
50% percentile latency: 3.727481307461858 seconds
75% percentile latency: 5.625706314109266 seconds
90% percentile latency: 5.935522455349565 seconds
99% percentile latency: 7.159032964520157 seconds

torchao int4wo-128:

Avg latency: 2.1109102573245764 seconds
10% percentile latency: 2.089452960342169 seconds
25% percentile latency: 2.092462383210659 seconds
50% percentile latency: 2.1093569844961166 seconds
75% percentile latency: 2.119327493943274 seconds
90% percentile latency: 2.132301461696625 seconds
99% percentile latency: 2.192038407512009 seconds

int8wo without compile:

Avg latency: 11.678632816672325 seconds
10% percentile latency: 11.624498645961285 seconds
25% percentile latency: 11.638677136972547 seconds
50% percentile latency: 11.690440801903605 seconds
75% percentile latency: 11.707305849529803 seconds
90% percentile latency: 11.724864188954234 seconds
99% percentile latency: 11.745748021267355 seconds

int8wo with compile (enabled by default in the PR):

Avg latency: 2.144259312748909 seconds
10% percentile latency: 2.111430574581027 seconds
25% percentile latency: 2.117820209823549 seconds
50% percentile latency: 2.1458087861537933 seconds
75% percentile latency: 2.1610238971188664 seconds
90% percentile latency: 2.175424510613084 seconds
99% percentile latency: 2.201516461186111 seconds

=== loading pre quantized model results ===

need pytorch/ao#1791 from torchao
or install nightly after it is landed

python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model jerryzh168/llama3-8b-int8wo --batch-size 1 --quantization torchao --torchao-config int8wo

Avg latency: 2.145327783127626 seconds
10% percentile latency: 2.1314279098063706 seconds
25% percentile latency: 2.137767093256116 seconds
50% percentile latency: 2.144422769546509 seconds
75% percentile latency: 2.150923326611519 seconds
90% percentile latency: 2.1607464250177144 seconds
99% percentile latency: 2.1807862490043046 seconds

Benchmarks against existing quant method:

python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model nm-testing/Meta-Llama-3-8B-Instruct-W4A16-G128 --batch-size 1

Avg latency: 1.780059675872326 seconds
10% percentile latency: 1.7601410120725631 seconds
25% percentile latency: 1.768859044648707 seconds
50% percentile latency: 1.775211926549673 seconds
75% percentile latency: 1.7830392718315125 seconds
90% percentile latency: 1.8060954470187425 seconds
99% percentile latency: 1.843151821307838 seconds

using config file instead of command line for configuring type of quantization
more benchmarks against existing methods (probably will need to generate the quantized checkpoint ourselves manually for benchmarking, e.g. https://docs.vllm.ai/en/latest/features/quantization/int8.html

Reviewers:

Subscribers:

Tasks:

Tags:

github-actions · 2025-02-20T04:07:56Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mergify · 2025-02-20T04:08:21Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jerryzh168.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mgoin · 2025-02-20T21:03:19Z

Exciting! It would be cool to see if this can work especially well with torch.compile in vLLM. Feel free to ping when ready for review

jerryzh168 · 2025-02-20T21:19:20Z

thanks, this is still WIP. I suspect we might have to do something special for it to work with the torch.compile integration in vLLM

vllm/model_executor/model_loader/weight_utils.py

jerryzh168 · 2025-02-28T00:39:04Z

@mgoin initial integration is done, please take a look.

I can repro the speedup we get from torchao, e.g. 2x speedup with int4wo and slightly better speedup for int8wo than in torchao somehow (also around 2x).

wondering if it's required to benchmark against existing techniques in vllm or we can merge now and follow up with these later

Summary: att, initial PR that adds support for torchao as a quantization option for vllm Test Plan: ``` python benchmarks/benchmark_throughput.py --input-len 256 --output-len 256 --model meta-llama/Meta-Llama-3-8B python benchmarks/benchmark_throughput.py --input-len 256 --output-len 256 --model meta-llama/Meta-Llama-3-8B --quantization torchao ``` Reviewers: Subscribers: Tasks: Tags:

mgoin

Nice work! I don't think it is necessary to benchmark against other methods, although it would be nice to validate through an evaluation that the model is running correctly through vLLM.

One thing I don't love with this proposal is adding a torchao specific arg to the ModelConfig, aka the --torchao-config int4wo-128 argument.
We've run into this issue with other dynamic quantization backends where we don't have a way currently to pass specific overrides to the quant_config. So I think it would be nice to achieve this in a general way:

One idea is reusing the --hf-overrides arg to override the model's existing (or non existent) quantization_config

vllm/vllm/engine/arg_utils.py

Lines 598 to 603 in 084bbac

    
           parser.add_argument('--hf-overrides', 
        
                               type=json.loads, 
        
                               default=EngineArgs.hf_overrides, 
        
                               help='Extra arguments for the HuggingFace config. ' 
        
                               'This should be a JSON string that will be ' 
        
                               'parsed into a dictionary.')

So thinking something like --hf-overrides '{"quantization_config" : {"torchao_config": "int4wo-128"}'

Another idea is to add an explicit --quant-config or --quant-overrides argument that overrides the kwargs we pass into quant_config

mgoin · 2025-02-28T22:16:06Z

vllm/model_executor/layers/quantization/torchao_utils.py

+    # Lazy import to suppress some warnings
+    from torchao.quantization import (float8_dynamic_activation_float8_weight,
+                                      int4_weight_only,
+                                      int8_dynamic_activation_int8_weight,
+                                      int8_weight_only, quantize_)
+    from torchao.quantization.observer import PerRow, PerTensor


Can you wrap the torchao import in a try-except so we can give a nice error message instructing the user how to install torchao? See deepspeed as an example

vllm/vllm/model_executor/layers/quantization/deepspeedfp.py

Lines 144 to 153 in 084bbac

try:

import deepspeed

if deepspeed.__version__ < "0.14.2":

raise ImportError("deepspeed version is wrong. Please "

"install deepspeed>=0.14.2.")

from deepspeed.ops.fp_quantizer import FP_Quantize

except ImportError as err:

raise ImportError("Please install deepspeed>=0.14.2 via "

"`pip install deepspeed>=0.14.2` to use "

"deepspeedfp quantizer.") from err

jerryzh168 · 2025-03-01T19:19:53Z

thanks for the review @mgoin I'm out of office recently and @drisspg will continue this work while I'm out

drisspg · 2025-03-04T04:18:36Z

@mgoin I might end up opening up a new PR since it will be easier for me to iterate on

mgoin · 2025-03-04T16:29:11Z

@drisspg sure that's fine with me, just give me a ping when ready!

jerryzh168 · 2025-05-15T17:38:02Z

merged in #14231

jerryzh168 requested review from mgoin, robertgshaw2-redhat and tlrmchlsmth as code owners February 20, 2025 04:07

mergify bot added the needs-rebase label Feb 20, 2025

jerryzh168 force-pushed the torchao branch from 515b433 to 12961df Compare February 20, 2025 04:10

mergify bot removed the needs-rebase label Feb 20, 2025

jerryzh168 changed the title ~~Integrating torchao quantization into vllm~~ [WIP] Integrating torchao quantization into vllm Feb 20, 2025

jerryzh168 changed the title ~~[WIP] Integrating torchao quantization into vllm~~ Integrating torchao quantization into vllm Feb 26, 2025

houseroad reviewed Feb 26, 2025

View reviewed changes

vllm/model_executor/model_loader/weight_utils.py Outdated Show resolved Hide resolved

mergify bot added the documentation Improvements or additions to documentation label Feb 27, 2025

jerryzh168 added 10 commits February 27, 2025 21:43

update

7ea835a

format

107de5e

add compile

a17b5b5

torchao_config

06950fb

loading prequantized model

48fe4ef

add version check

f7cc720

docs

a9ba0c1

add torchao to main page

00f3e2b

formatting

15e35ba

jerryzh168 force-pushed the torchao branch from fde3217 to 15e35ba Compare February 28, 2025 05:43

mgoin reviewed Feb 28, 2025

View reviewed changes

drisspg mentioned this pull request Mar 4, 2025

Config serde pytorch/ao#1806

Closed

jerryzh168 closed this May 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Integrating torchao quantization into vllm #13588

Integrating torchao quantization into vllm #13588

Uh oh!

jerryzh168 commented Feb 20, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Feb 20, 2025

Uh oh!

mergify bot commented Feb 20, 2025

Uh oh!

mgoin commented Feb 20, 2025

Uh oh!

jerryzh168 commented Feb 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

jerryzh168 commented Feb 28, 2025

Uh oh!

mgoin left a comment

Uh oh!

mgoin Feb 28, 2025

Uh oh!

jerryzh168 commented Mar 1, 2025

Uh oh!

drisspg commented Mar 4, 2025

Uh oh!

mgoin commented Mar 4, 2025

Uh oh!

jerryzh168 commented May 15, 2025

Uh oh!

Uh oh!

	parser.add_argument('--hf-overrides',
	type=json.loads,
	default=EngineArgs.hf_overrides,
	help='Extra arguments for the HuggingFace config. '
	'This should be a JSON string that will be '
	'parsed into a dictionary.')

	try:
	import deepspeed
	if deepspeed.__version__ < "0.14.2":
	raise ImportError("deepspeed version is wrong. Please "
	"install deepspeed>=0.14.2.")
	from deepspeed.ops.fp_quantizer import FP_Quantize
	except ImportError as err:
	raise ImportError("Please install deepspeed>=0.14.2 via "
	"`pip install deepspeed>=0.14.2` to use "
	"deepspeedfp quantizer.") from err

Uh oh!

Integrating torchao quantization into vllm #13588

Integrating torchao quantization into vllm #13588

Uh oh!

Conversation

jerryzh168 commented Feb 20, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 20, 2025

Uh oh!

mergify bot commented Feb 20, 2025

Uh oh!

mgoin commented Feb 20, 2025

Uh oh!

jerryzh168 commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jerryzh168 commented Feb 28, 2025

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

mgoin Feb 28, 2025

Choose a reason for hiding this comment

Uh oh!

jerryzh168 commented Mar 1, 2025

Uh oh!

drisspg commented Mar 4, 2025

Uh oh!

mgoin commented Mar 4, 2025

Uh oh!

jerryzh168 commented May 15, 2025

Uh oh!

Uh oh!

jerryzh168 commented Feb 20, 2025 •

edited by github-actions bot

Loading

jerryzh168 commented Feb 20, 2025 •

edited

Loading