Skip to content

[V1] Move usage stats to worker and start logging TPU hardware #16211

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 58 commits into from
Apr 25, 2025
Merged
Changes from 13 commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
f9d82ea
Track TPU usages in vLLM's data dashboards
dyli-google Mar 27, 2025
731b68a
Merge branch 'vllm-project:main' into main
dyli-google Mar 27, 2025
d2d9b9e
Make the code more robust
dyli-google Mar 27, 2025
f168647
Merge branch 'main' of https://github.com/dyli-google/vllm
dyli-google Mar 27, 2025
ee00cf7
Merge branch 'vllm-project:main' into main
dyli-google Apr 7, 2025
39d610f
Your descriptive message about the changes you made
dyli-google Apr 7, 2025
558c60f
format
dyli-google Apr 7, 2025
639f77b
use new API
dyli-google Apr 7, 2025
d5e7533
Merge branch 'vllm-project:main' into main
dyli-google Apr 7, 2025
d9b9d61
Merge branch 'vllm-project:main' into main
dyli-google Apr 7, 2025
8f055c9
address Simon's comments
dyli-google Apr 7, 2025
63bea36
Silence ImportError
dyli-google Apr 7, 2025
25fa30b
Merge branch 'vllm-project:main' into main
dyli-google Apr 8, 2025
8124c99
Merge branch 'vllm-project:main' into main
dyli-google Apr 9, 2025
6a4eea4
Use torch_xla.tpu.get_tpu_type() to get TPU version
dyli-google Apr 9, 2025
ae2f5a6
Merge branch 'vllm-project:main' into main
dyli-google Apr 10, 2025
5d2f2b6
Merge branch 'vllm-project:main' into main
dyli-google Apr 11, 2025
9b3a67c
Merge branch 'vllm-project:main' into main
dyli-google Apr 14, 2025
35fb26b
Merge branch 'vllm-project:main' into main
dyli-google Apr 14, 2025
b0912f0
Merge branch 'vllm-project:main' into main
dyli-google Apr 20, 2025
88dd6c6
Merge branch 'vllm-project:main' into main
dyli-google Apr 22, 2025
727bed5
Add usage to more engines
dyli-google Apr 22, 2025
4f94631
Merge branch 'vllm-project:main' into main
dyli-google Apr 22, 2025
619e496
fix error
dyli-google Apr 22, 2025
a1ae7ff
format
dyli-google Apr 23, 2025
1667fab
Merge branch 'vllm-project:main' into main
dyli-google Apr 23, 2025
9f725f6
Revert "format"
dyli-google Apr 23, 2025
b17dbc9
format
dyli-google Apr 23, 2025
5286466
Merge branch 'vllm-project:main' into main
dyli-google Apr 23, 2025
3bd0c9b
Use import torch_xla
dyli-google Apr 23, 2025
625d21c
Merge branch 'main' of https://github.com/dyli-google/vllm
dyli-google Apr 23, 2025
718729a
format
dyli-google Apr 23, 2025
6e61fba
format
dyli-google Apr 23, 2025
737646d
format
dyli-google Apr 23, 2025
0e093cc
Merge branch 'vllm-project:main' into main
dyli-google Apr 23, 2025
9940dad
Merge branch 'vllm-project:main' into main
dyli-google Apr 23, 2025
f825349
Try Qiliang's idea
dyli-google Apr 23, 2025
7798bde
Merge branch 'vllm-project:main' into main
dyli-google Apr 23, 2025
bbd7f5a
Use Yarong's 2nd idea
dyli-google Apr 24, 2025
5bf9f34
Merge branch 'main' into main
dyli-google Apr 24, 2025
4e38e67
revert vllm/engine/async_llm_engine.py
dyli-google Apr 24, 2025
fc18a7a
simplify code
dyli-google Apr 24, 2025
cf7997a
simplify
dyli-google Apr 24, 2025
3bd5730
fix typo
dyli-google Apr 24, 2025
4374c3c
format
dyli-google Apr 24, 2025
6829371
simplify
dyli-google Apr 24, 2025
3c55fc7
silence error
dyli-google Apr 24, 2025
bbee546
Suppress all exceptions
dyli-google Apr 24, 2025
429b6aa
format
dyli-google Apr 24, 2025
8939235
remove comment
dyli-google Apr 24, 2025
bc284db
Merge branch 'vllm-project:main' into main
dyli-google Apr 24, 2025
bac067a
report usage of TPU and GPU during worker init time
dyli-google Apr 24, 2025
3ad33a2
remove useless import
dyli-google Apr 24, 2025
5b0ab6d
format
dyli-google Apr 24, 2025
1f592e4
Merge branch 'vllm-project:main' into main
dyli-google Apr 24, 2025
98e7ae0
Merge branch 'vllm-project:main' into main
dyli-google Apr 24, 2025
689d343
Merge branch 'vllm-project:main' into main
dyli-google Apr 25, 2025
4eea0a9
Merge branch 'vllm-project:main' into main
dyli-google Apr 25, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions vllm/usage/usage_lib.py
Original file line number Diff line number Diff line change
Expand Up @@ -174,6 +174,15 @@ def _report_usage_once(self, model_architecture: str,
self.gpu_memory_per_device = device_property.total_memory
if current_platform.is_cuda():
self.cuda_runtime = torch.version.cuda
if current_platform.is_tpu():
try:
import torch_xla.runtime as xr
from torch_xla.core import xla_model as xm
self.gpu_count = xr.world_size()
self.gpu_type = xm.xla_device_hw(xm.xla_device())
self.gpu_memory_per_device = xm.get_memory_info().bytes_limit
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xm.xla_device_hw(xm.xla_device()) return TPU as result.

Or do we want something like v6e, v5e?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yarongmu-google @simon-mo What do you think? I believe TPU should be OK?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Version number will be useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Simon.

@yaochengji Do we have ways to get the version number?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use torch_xla.tpu.get_tpu_type()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks. I just updated the code to use torch_xla.tpu.get_tpu_type()

except ImportError:
pass
self.provider = _detect_cloud_provider()
self.architecture = platform.machine()
self.platform = platform.platform()
Expand Down