You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
vLLM currently utilizes PyTorch/XLA to provide TPU backend support. However, PyTorch/XLA differs significantly from native PyTorch in terms of usage. PyTorch/XLA is a compilation only framework, it doesn't have a real eager mode. In particular, for LLM serving services, recompilation should be avoided once the server is running.
When compiling, it's important to consider which code might create PyTorch operations (e.g., tensor.copy(), tensor[:index], torch.ones(...)) and when graph capture and compilation is triggered (e.g., xm.mark_step(), xla_tensor.cpu(), if xla_tensor:, torch.compile(backend="openxla")). Due to the complexity of PyTorch/XLA, this document will only provide basic rules to simplify vLLM development on TPU.
Ways to avoid recompilation
The model executor has two primary components:
preparing the model and sampler inputs
executing the model and sampler.
Step 1
It is recommended to avoid TPU operations when preparing the model and sampler inputs. CPU tensors can be prepared and transferred to the XLA device using cpu_tensor.to(xla_device), which only triggers CPU to TPU transfers and avoids compilation.
Step 2
The TPU execution should be decomposed into subgraphs (4 at the moment):
the main model
selecting hidden states for each request
sampler
encoder.
Each subgraph should be decorated in a torch.compile. This is used to make sure that we have the same subgraph topology in both dummy_run and execute_model. The results from these subgraphs should either be passed to other subgraphs, or transferred from TPU to CPU using xla_tensor.cpu() for subsequent processing on the CPU.
Step 3
The dummy_run should be comprehensive, ensuring all potential input shapes and branch predictions are included as subgraph inputs to facilitate pre-compilation.
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
Uh oh!
There was an error while loading. Please reload this page.
Motivation.
vLLM currently utilizes PyTorch/XLA to provide TPU backend support. However, PyTorch/XLA differs significantly from native PyTorch in terms of usage. PyTorch/XLA is a compilation only framework, it doesn't have a real eager mode. In particular, for LLM serving services, recompilation should be avoided once the server is running.
When compiling, it's important to consider which code might create PyTorch operations (e.g., tensor.copy(), tensor[:index], torch.ones(...)) and when graph capture and compilation is triggered (e.g., xm.mark_step(), xla_tensor.cpu(), if xla_tensor:, torch.compile(backend="openxla")). Due to the complexity of PyTorch/XLA, this document will only provide basic rules to simplify vLLM development on TPU.
Ways to avoid recompilation
The model executor has two primary components:
Step 1
It is recommended to avoid TPU operations when preparing the model and sampler inputs. CPU tensors can be prepared and transferred to the XLA device using cpu_tensor.to(xla_device), which only triggers CPU to TPU transfers and avoids compilation.
Step 2
The TPU execution should be decomposed into subgraphs (4 at the moment):
Each subgraph should be decorated in a torch.compile. This is used to make sure that we have the same subgraph topology in both dummy_run and execute_model. The results from these subgraphs should either be passed to other subgraphs, or transferred from TPU to CPU using xla_tensor.cpu() for subsequent processing on the CPU.
Step 3
The dummy_run should be comprehensive, ensuring all potential input shapes and branch predictions are included as subgraph inputs to facilitate pre-compilation.
Feedback Period.
No response
CC List.
@robertgshaw2-redhat @NickLucche @WoosukKwon @yarongmu-google @bvrockwell
Any Other Things.
No response
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: