Skip to content

[Usage]: Is pipeline parallelism supported on machines that are not in the same local network? #11285

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
oldcpple opened this issue Dec 18, 2024 · 4 comments
Closed
1 task done
Labels
stale Over 90 days of inactivity usage How to use vllm

Comments

@oldcpple
Copy link

oldcpple commented Dec 18, 2024

How would you like to use vllm

Hi there, since the communications between nodes are done by NCCL(which typically relies on RDMA I guess), I wonder if I can setup an inference pipeline with machines from different networks, for example, one on Google Cloud and another on AWS Cloud, through vLLM's pipeline parallelism?
Thanks a lot if anyone can answer this.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@oldcpple oldcpple added the usage How to use vllm label Dec 18, 2024
@noooop
Copy link
Contributor

noooop commented Dec 18, 2024

Do you really want to do this?

A typical vllm step takes about 20ms, and copying an intermediate result (a large tenser) over the network is very slow.

And now vllm is scheduled synchronously, so the delay in network transmission of intermediate results will greatly reduce GPU utilization, increase latency, and reduce throughput.

You can pay attention to progress of Disaggregated prefilling

It seems to be asynchronous, that Awesome

@lihuahua123
Copy link
Contributor

Do you really want to do this?

A typical vllm step takes about 20ms, and copying an intermediate result (a large tenser) over the network is very slow.

And now vllm is scheduled synchronously, so the delay in network transmission of intermediate results will greatly reduce GPU utilization, increase latency, and reduce throughput.

You can pay attention to progress of Disaggregated prefilling

It seems to be asynchronous, that Awesome

Is the Disaggregated prefilling better when the network between the machines is slow? And it also need double memory i guess?

Copy link

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

@github-actions github-actions bot added the stale Over 90 days of inactivity label Mar 22, 2025
Copy link

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Over 90 days of inactivity usage How to use vllm
Projects
None yet
Development

No branches or pull requests

3 participants