[TPU][V1] Refine tpu_model_runner to mitigate future recompilation issues #16275

yaochengji · 2025-04-08T16:34:01Z

Wrap all the TPU computation into torch.compile, currently they're
- forward: total_bucket_size = token_bucket_num
- select_hidden_states: total_bucket_size = token_bucket_num * req_bucket_num
- sample_from_hidden: total_bucket_size = req_bucket_num * 2
Remove pytorch operations in TPUSupportedSamplingMetadata.from_input_batch
Change the implementation of torch.where to if else branches

github-actions · 2025-04-08T16:34:11Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

NickLucche · 2025-04-08T16:39:47Z

I can't review rn sorry. Is it faster than main?

yaochengji · 2025-04-08T16:41:50Z

I can't review rn sorry. Is it faster than main?

Yes, both compilation and execution

NickLucche

Thanks for the PR, overall this lgtm, especially the from_input_batch change.

I don't fully grasp where the optimization lies with the if/else branch as we're just forking two different graphs now, but that's about it.

vllm/v1/sample/tpu/metadata.py

vllm/v1/worker/tpu_model_runner.py

yaochengji · 2025-04-08T18:00:23Z

I don't fully grasp where the optimization lies with the if/else branch as we're just forking two different graphs now, but that's about it.

If we use torch.where(pred, A, B), we only need a compile one graph for both pred is True and False, but both A and B will be computed no matter what the value of pred is.

If we use

if pred:
    A
else:
    B

We need to compile two graphs for pred is True or False. The benefit is that we don't need to compute both A and B in execution.

also cc @robertgshaw2-redhat , I remember you had similar questions.

NickLucche · 2025-04-08T18:19:51Z

but both A and B will be computed no matter what the value of pred is

Ok I see the issue now. The simpler if/else solution fits our use-case surely better. Thanks for explaining!

vanbasten23 · 2025-04-09T00:48:34Z

vllm/v1/sample/tpu/metadata.py

-        Eg. 3 requests, tensors padded to 4 
-            temperature: [0.7, 0.2, 0.9]=>[0.7, 0.2, 0.9, 0.0]
-            sample indices: [4, 10, 11]=>indices_do_sample: [4, 10, 11, 0]
+        ops to CPU and produces tensors of fixed `padded_num_reqs` size.


nit: it seems that the impl directly uses the cpu tensor and move them to the xla device

I think it aligns with the description?

vllm/v1/worker/tpu_model_runner.py

vllm/v1/sample/tpu/metadata.py

vllm/v1/worker/tpu_model_runner.py

mgoin · 2025-04-09T14:49:23Z

Looks like the TPU V1 test is failing


ERROR collecting tests/v1/tpu/worker/test_tpu_model_runner.py 
  | ImportError while importing test module '/workspace/vllm/tests/v1/tpu/worker/test_tpu_model_runner.py'.
  | Hint: make sure your test modules/packages have valid Python names.
  | Traceback:
  | /usr/local/lib/python3.10/importlib/__init__.py:126: in import_module
  | return _bootstrap._gcd_import(name[level:], package, level)
  | tests/v1/tpu/worker/test_tpu_model_runner.py:10: in <module>
  | from vllm.v1.worker.tpu_model_runner import (TPUModelRunner,
  | E   ImportError: cannot import name '_get_paddings' from 'vllm.v1.worker.tpu_model_runner' (/workspace/vllm/vllm/v1/worker/tpu_model_runner.py)

Signed-off-by: Chengji Yao <[email protected]>

vllm/v1/worker/tpu_model_runner.py

Signed-off-by: Chengji Yao <[email protected]>

yaochengji · 2025-04-09T16:03:13Z

Looks like the TPU V1 test is failing

@mgoin thanks for reminding, it should be fixed now. BTW, can we make the TPU CI test not a soft-fail now? cc @robertgshaw2-redhat

mgoin

Nice work, I feel this is a good path to go down! Thank you

yaochengji requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners April 8, 2025 16:34

mergify bot added v1 tpu Related to Google TPUs labels Apr 8, 2025

yaochengji changed the title ~~[TPU][V1] Refine tpu_model_runner to mitigate the future recompilation issues~~ [TPU][V1] Refine tpu_model_runner to mitigate future recompilation issues Apr 8, 2025

NickLucche approved these changes Apr 8, 2025

View reviewed changes

vllm/v1/sample/tpu/metadata.py Outdated Show resolved Hide resolved

vllm/v1/worker/tpu_model_runner.py Show resolved Hide resolved

vllm/v1/worker/tpu_model_runner.py Outdated Show resolved Hide resolved

yaochengji added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 8, 2025

yaochengji mentioned this pull request Apr 8, 2025

[RFC]: How to handle the compilation of PyTorch/XLA in vLLM #16282

Closed

1 task

yaochengji marked this pull request as draft April 8, 2025 20:08

yaochengji marked this pull request as ready for review April 8, 2025 20:23

vanbasten23 reviewed Apr 9, 2025

View reviewed changes

mgoin reviewed Apr 9, 2025

View reviewed changes

vllm/v1/worker/tpu_model_runner.py Outdated Show resolved Hide resolved

mgoin reviewed Apr 9, 2025

View reviewed changes

vllm/v1/sample/tpu/metadata.py Outdated Show resolved Hide resolved

vllm/v1/sample/tpu/metadata.py Show resolved Hide resolved

vanbasten23 reviewed Apr 9, 2025

View reviewed changes

vllm/v1/worker/tpu_model_runner.py Show resolved Hide resolved

vanbasten23 reviewed Apr 9, 2025

View reviewed changes

vllm/v1/worker/tpu_model_runner.py Show resolved Hide resolved

NickLucche mentioned this pull request Apr 9, 2025

[TPU][V1][DEBUG] Provide Env Variable To Disable Sampler #16063

Closed

yaochengji added 4 commits April 9, 2025 15:57

init

e273dd4

Signed-off-by: Chengji Yao <[email protected]>

format

99726e3

Signed-off-by: Chengji Yao <[email protected]>

fix comments

af53da5

Signed-off-by: Chengji Yao <[email protected]>

fix test

d113a59

Signed-off-by: Chengji Yao <[email protected]>

yaochengji added 2 commits April 9, 2025 15:57

fix test

8cf83ee

Signed-off-by: Chengji Yao <[email protected]>

fix test

824a5b7

Signed-off-by: Chengji Yao <[email protected]>

NickLucche suggested changes Apr 9, 2025

View reviewed changes

vllm/v1/worker/tpu_model_runner.py Show resolved Hide resolved

fix comment

dc5c641

Signed-off-by: Chengji Yao <[email protected]>

yaochengji force-pushed the chengji/improve-recompile branch from f2a4dfd to dc5c641 Compare April 9, 2025 15:58

NickLucche approved these changes Apr 9, 2025

View reviewed changes

mgoin approved these changes Apr 10, 2025

View reviewed changes

mgoin merged commit a454748 into vllm-project:main Apr 10, 2025
41 checks passed

yaochengji mentioned this pull request Apr 16, 2025

[Feature][Hardware][TPU]:Reduce the compile time #14582

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TPU][V1] Refine tpu_model_runner to mitigate future recompilation issues #16275

[TPU][V1] Refine tpu_model_runner to mitigate future recompilation issues #16275

yaochengji commented Apr 8, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Apr 8, 2025

NickLucche commented Apr 8, 2025

yaochengji commented Apr 8, 2025

NickLucche left a comment

yaochengji commented Apr 8, 2025 •

edited

Loading

NickLucche commented Apr 8, 2025

vanbasten23 Apr 9, 2025

yaochengji Apr 9, 2025

mgoin commented Apr 9, 2025

yaochengji commented Apr 9, 2025

mgoin left a comment

[TPU][V1] Refine tpu_model_runner to mitigate future recompilation issues #16275

[TPU][V1] Refine tpu_model_runner to mitigate future recompilation issues #16275

Conversation

yaochengji commented Apr 8, 2025 • edited by github-actions bot Loading

github-actions bot commented Apr 8, 2025

NickLucche commented Apr 8, 2025

yaochengji commented Apr 8, 2025

NickLucche left a comment

Choose a reason for hiding this comment

yaochengji commented Apr 8, 2025 • edited Loading

NickLucche commented Apr 8, 2025

vanbasten23 Apr 9, 2025

Choose a reason for hiding this comment

yaochengji Apr 9, 2025

Choose a reason for hiding this comment

mgoin commented Apr 9, 2025

yaochengji commented Apr 9, 2025

mgoin left a comment

Choose a reason for hiding this comment

yaochengji commented Apr 8, 2025 •

edited by github-actions bot

Loading

yaochengji commented Apr 8, 2025 •

edited

Loading