Skip to content

[V1] DP scale-out (2/N): Decouple engine process management and comms #15977

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
May 13, 2025

Conversation

njhill
Copy link
Member

@njhill njhill commented Apr 3, 2025

This decouples the management of engine processes from the IPC, and adds support for a mix of local and/or remote engines (where remote are running on a different node).

When there are any remote engines, tcp transport is used for the zmq sockets, otherwise ipc (domain-socket based) is used.

Engines are bootstrapped with the input queue address and use this to perform a handshake with the front-end running in the head node, which provides other necessary configuration details:

  Front-End                    Engine Core (N)
      |                              |
 (1)  | <-------- HELLO ------------ |
 (2)  | ---- config / conn info ---> |  (address of output queue and data parallel torch process group)
      |                              | 
      |                  [ engine init - load model ]
      |                              |
 (3)  | <-------- READY ------------ |

There is a new --headless option for vllm serve to run on secondary nodes, which launches one or more engines (data parallel ranks) without the front-end / API server).

Examples

This will run DP=4 with DP ranks 0 and 1 on the head node and ranks 2 and 3 on the second node:

# Node 1  (with ip address 10.99.48.128)
vllm serve $MODEL --data-parallel-size 4 --data-parallel-size-local 2 \
                  --data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345
# Node 2
vllm serve $MODEL --headless --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 \
                  --data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345

This will run DP=4 with only the API server on the first node and all engines on the second node:

# Node 1  (with ip address 10.99.48.128)
vllm serve $MODEL --data-parallel-size 4 --data-parallel-size-local 0 \
                  --data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345
# Node 2
vllm serve $MODEL --headless --data-parallel-size 4 --data-parallel-size-local 4 \
                  --data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345

It's assumed that local engine ranks (if any) will always be lower than remote engine ranks. Note that it's not actually necessary to specify the (global) dp size on the secondary nodes since this is obtained during the handshake. It would be straightforward to extend this to any other config which must be consistent across the ranks - so that you only need to specify it in the head node cli args.

TODO (this PR):

  • Feedback
  • Some code simplification
  • Fix offline DP compatiblity
  • CI tests

Next PR:

Copy link

github-actions bot commented Apr 3, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the v1 label Apr 3, 2025
@zxfan-cpu
Copy link
Contributor

My pr #15863 is based on Ray's support for multi-node DP. Could you provide some feedback for modifications so we can merge with your PR?

@youkaichao
Copy link
Member

is this "frontend" is one api server process?

default=EngineArgs.data_parallel_start_rank,
help='Starting data parallel rank for secondary '
'nodes.')
parser.add_argument('--data-parallel-address',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have data_parallel_master_ip and data_parallel_master_port. can we use those two field and rename the cli arg? i feel data_parallel_master_ip and data_parallel_master_port are easier to understand compared to --data-parallel-address and --data-parallel-rpc-port.

in addition, I reserved 10 ports starting from data_parallel_master_port, so you can use, say data_parallel_master_port + 2, as --data-parallel-rpc-port, which will make the user-interface easier to use.

see

if dp_port <= port < dp_port + 10:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could rename data-parallel-address to data-parallel-ip-address? or data-parallel-head-ip? I was also trying to avoid using the "master" term :)

Re the port, the specified port in this case is actually the zmq socket port, the torch process group is assigned an available port arbitrarily (which is propagated to the other nodes).

Comment on lines 441 to 452
parser.add_argument('--data-parallel-size-local',
'-dpl',
type=int,
default=EngineArgs.data_parallel_size_local,
help='Number of data parallel replicas to run on '
'this node.')
parser.add_argument('--data-parallel-start-rank',
'-dpr',
type=int,
default=EngineArgs.data_parallel_start_rank,
help='Starting data parallel rank for secondary '
'nodes.')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these two args should go to vllm/entrypoints/cli/serve.py since they are for online serving only.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@youkaichao I moved the start-rank one but the other one is used within the AsyncLLM impl which in theory could be used in other contexts (e.g. different front-end), so it makes sense for it to be in the config I think.

I could move it from EngineArgs into AsyncEngineArgs but then it would be on it's own so thought better to keep with the other data parallel args.

@njhill
Copy link
Member Author

njhill commented Apr 4, 2025

is this "frontend" is one api server process?

So far, yes. I've been trying to open incremental PRs, working on multi api server right now as the next one (on top of this PR). It won't change much in terms of the deployment semantics though - just maybe one additional arg to specify how many api server procs (applies only to head node and so mutually exclusive with --headless).

Copy link

mergify bot commented Apr 4, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @njhill.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Apr 4, 2025
@njhill njhill force-pushed the decouple-engines branch from 8e1425f to a662169 Compare April 5, 2025 00:05
@mergify mergify bot removed the needs-rebase label Apr 5, 2025
@njhill njhill added the needs-tests Tests needed for this PR label Apr 5, 2025
@afeldman-nm
Copy link
Contributor

afeldman-nm commented Apr 7, 2025

TODO (this PR):

  • Feedback
  • Some code simplification
  • Fix offline DP compatiblity
  • CI tests

Next PR:

  • API server scale-out

Hey @njhill , just want to clarify - when you include "CI tests" in the TODO, does this mean you will be adding new unit tests? Maybe it would be valuable to unit-test the communication process in scaled-out scenarios for example.

Copy link
Contributor

@afeldman-nm afeldman-nm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @njhill , I'm probably going to spend a little more time reviewing this but I left some initial nits.

super().__init__(vllm_config, executor_class, log_stats)

self.step_fn = (self.step if self.batch_queue is None else
self.step_with_batch_queue)

self.global_unfinished_reqs = False

# Send ready message.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assigning on_head_node to the local field suggests that local has an absolute meaning, i.e. "on the head node" as opposed to on any other node. However, I thought that "local" had a relative meaning, i.e. "on the same node as a particular engine core instance"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@afeldman-nm in this context it will be running on the front-end server which is always on the head node.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Maybe it's out of scope for this PR, but why not change the "local" field of the dict to "on_head_node" (and make equivalent terminological changes in other parts of the code)?

It may just be me but I find I am getting confused when local is used in an absolute sense to refer to being on the head node.

local_dp_rank: int = 0,
):
self.index = index
def __init__(self, index: int = 0, local: bool = True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Unclear what local means here; see my previous comment on this topic. Is it possible that we have overloaded the word "local"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Local means that it's local to the front-end (same node), otherwise it's remote.

Comment on lines 370 to 371
# SPMD mode is where there is an LLM instance per DP rank and one
# core engine per LLM, see examples/offline_inference/data_parallel.py.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I checked examples/offline_inference/data_parallel.py on main and there appears to be no reference to SPMD, nor do I see a tweak of the example code in this PR. Would it make sense to add a comment mentioning SPMD in the example?

@njhill
Copy link
Member Author

njhill commented Apr 7, 2025

TODO (this PR):

  • Feedback
  • Some code simplification
  • Fix offline DP compatiblity
  • CI tests

Next PR:

  • API server scale-out

Hey @njhill , just want to clarify - when you include "CI tests" in the TODO, does this mean you will be adding new unit tests? Maybe it would be valuable to unit-test the communication process in scaled-out scenarios for example.

Yes, plan to add tests to cover launching some engine(s) remotely, but wanted to get some feedback on the design first.

Signed-off-by: Nick Hill <[email protected]>

# Conflicts:
#	vllm/v1/engine/core_client.py
@mergify mergify bot removed the needs-rebase label May 11, 2025
@njhill
Copy link
Member Author

njhill commented May 11, 2025

We've talked about this some offline, but for the record, I would prefer that we don't introduce any additional multi-node interfaces that are not secured in any way, and worse, allow arbitrary code execution via pickle. I understand the argument that PyTorch says its distributed communications are insecure, but that's not a great reason to make our own code worse and harder to fix in the future.

The security concerns are hopefully addressed now that we have #17490 merged.

@simon-mo
Copy link
Collaborator

simon-mo commented May 11, 2025

@russellb @youkaichao can you please help final round of review?

@simon-mo simon-mo added the ready ONLY add when PR is ready to merge/full CI is needed label May 11, 2025
@russellb
Copy link
Member

We've talked about this some offline, but for the record, I would prefer that we don't introduce any additional multi-node interfaces that are not secured in any way, and worse, allow arbitrary code execution via pickle. I understand the argument that PyTorch says its distributed communications are insecure, but that's not a great reason to make our own code worse and harder to fix in the future.

The security concerns are hopefully addressed now that we have #17490 merged.

Confirmed, yes -- thank you for working with me on this!

@njhill
Copy link
Member Author

njhill commented May 12, 2025

Fixing rebase issues

@russellb russellb dismissed their stale review May 12, 2025 16:29

my security concern has been resolved

@njhill
Copy link
Member Author

njhill commented May 12, 2025

I've opened a separate PR for one of the new CI test failures which is unrelated: #18007

Copy link

mergify bot commented May 12, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @njhill.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 12, 2025
@mergify mergify bot removed the needs-rebase label May 12, 2025
@simon-mo simon-mo merged commit 55aa7af into vllm-project:main May 13, 2025
53 of 60 checks passed
@njhill njhill deleted the decouple-engines branch May 13, 2025 18:20
huachenheli pushed a commit to huachenheli/vllm that referenced this pull request May 22, 2025
@GGKOP
Copy link

GGKOP commented May 23, 2025

When will vllm.entrypoints.openai.api_server support --headless and --data-parallel-start-rank? @njhill

zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025
@SparkJiao
Copy link

Hi, does currently this approach support IPv6 address?

@russellb
Copy link
Member

Hi, does currently this approach support IPv6 address?

I've been trying to fix IPv6 support anywhere I see that doesn't work correctly. If you see it not working anywhere, let me know. I looked at it in this PR and think it should be OK now, though I didn't try it myself.

@SparkJiao
Copy link

@russellb Hi, I think this PR can fix the problem: #18991. I have tried it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
frontend needs-tests Tests needed for this PR ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants