[V1] DP scale-out (2/N): Decouple engine process management and comms #15977

njhill · 2025-04-03T01:18:54Z

This decouples the management of engine processes from the IPC, and adds support for a mix of local and/or remote engines (where remote are running on a different node).

When there are any remote engines, tcp transport is used for the zmq sockets, otherwise ipc (domain-socket based) is used.

Engines are bootstrapped with the input queue address and use this to perform a handshake with the front-end running in the head node, which provides other necessary configuration details:

  Front-End                    Engine Core (N)
      |                              |
 (1)  | <-------- HELLO ------------ |
 (2)  | ---- config / conn info ---> |  (address of output queue and data parallel torch process group)
      |                              | 
      |                  [ engine init - load model ]
      |                              |
 (3)  | <-------- READY ------------ |

There is a new --headless option for vllm serve to run on secondary nodes, which launches one or more engines (data parallel ranks) without the front-end / API server).

Examples

This will run DP=4 with DP ranks 0 and 1 on the head node and ranks 2 and 3 on the second node:

# Node 1  (with ip address 10.99.48.128)
vllm serve $MODEL --data-parallel-size 4 --data-parallel-size-local 2 \
                  --data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345
# Node 2
vllm serve $MODEL --headless --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 \
                  --data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345

This will run DP=4 with only the API server on the first node and all engines on the second node:

# Node 1  (with ip address 10.99.48.128)
vllm serve $MODEL --data-parallel-size 4 --data-parallel-size-local 0 \
                  --data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345
# Node 2
vllm serve $MODEL --headless --data-parallel-size 4 --data-parallel-size-local 4 \
                  --data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345

It's assumed that local engine ranks (if any) will always be lower than remote engine ranks. Note that it's not actually necessary to specify the (global) dp size on the secondary nodes since this is obtained during the handshake. It would be straightforward to extend this to any other config which must be consistent across the ranks - so that you only need to specify it in the head node cli args.

TODO (this PR):

Feedback
Some code simplification
Fix offline DP compatiblity
CI tests

Next PR:

API server scale-out: [Perf] API-server scaleout with many-to-many server-engine comms #17546

github-actions · 2025-04-03T01:19:05Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

zxfan-cpu · 2025-04-03T02:46:35Z

My pr #15863 is based on Ray's support for multi-node DP. Could you provide some feedback for modifications so we can merge with your PR?

youkaichao · 2025-04-04T16:22:57Z

is this "frontend" is one api server process?

youkaichao · 2025-04-04T16:51:44Z

vllm/engine/arg_utils.py

+                            default=EngineArgs.data_parallel_start_rank,
+                            help='Starting data parallel rank for secondary '
+                            'nodes.')
+        parser.add_argument('--data-parallel-address',


we have data_parallel_master_ip and data_parallel_master_port. can we use those two field and rename the cli arg? i feel data_parallel_master_ip and data_parallel_master_port are easier to understand compared to --data-parallel-address and --data-parallel-rpc-port.

in addition, I reserved 10 ports starting from data_parallel_master_port, so you can use, say data_parallel_master_port + 2, as --data-parallel-rpc-port, which will make the user-interface easier to use.

see

vllm/vllm/utils.py

Line 581 in 2386803

if dp_port <= port < dp_port + 10:

I could rename data-parallel-address to data-parallel-ip-address? or data-parallel-head-ip? I was also trying to avoid using the "master" term :)

Re the port, the specified port in this case is actually the zmq socket port, the torch process group is assigned an available port arbitrarily (which is propagated to the other nodes).

youkaichao · 2025-04-04T16:52:34Z

vllm/engine/arg_utils.py

+        parser.add_argument('--data-parallel-size-local',
+                            '-dpl',
+                            type=int,
+                            default=EngineArgs.data_parallel_size_local,
+                            help='Number of data parallel replicas to run on '
+                            'this node.')
+        parser.add_argument('--data-parallel-start-rank',
+                            '-dpr',
+                            type=int,
+                            default=EngineArgs.data_parallel_start_rank,
+                            help='Starting data parallel rank for secondary '
+                            'nodes.')


these two args should go to vllm/entrypoints/cli/serve.py since they are for online serving only.

@youkaichao I moved the start-rank one but the other one is used within the AsyncLLM impl which in theory could be used in other contexts (e.g. different front-end), so it makes sense for it to be in the config I think.

I could move it from EngineArgs into AsyncEngineArgs but then it would be on it's own so thought better to keep with the other data parallel args.

njhill · 2025-04-04T18:15:03Z

is this "frontend" is one api server process?

So far, yes. I've been trying to open incremental PRs, working on multi api server right now as the next one (on top of this PR). It won't change much in terms of the deployment semantics though - just maybe one additional arg to specify how many api server procs (applies only to head node and so mutually exclusive with --headless).

mergify · 2025-04-04T20:37:17Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @njhill.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Nick Hill <[email protected]>

afeldman-nm · 2025-04-07T18:23:52Z

TODO (this PR):

Feedback

Some code simplification

Fix offline DP compatiblity

CI tests

Next PR:

API server scale-out

Hey @njhill , just want to clarify - when you include "CI tests" in the TODO, does this mean you will be adding new unit tests? Maybe it would be valuable to unit-test the communication process in scaled-out scenarios for example.

afeldman-nm

Hi @njhill , I'm probably going to spend a little more time reviewing this but I left some initial nits.

afeldman-nm · 2025-04-07T18:42:57Z

vllm/v1/engine/core.py

        super().__init__(vllm_config, executor_class, log_stats)

        self.step_fn = (self.step if self.batch_queue is None else
                        self.step_with_batch_queue)

        self.global_unfinished_reqs = False

+        # Send ready message.


assigning on_head_node to the local field suggests that local has an absolute meaning, i.e. "on the head node" as opposed to on any other node. However, I thought that "local" had a relative meaning, i.e. "on the same node as a particular engine core instance"?

@afeldman-nm in this context it will be running on the front-end server which is always on the head node.

I see. Maybe it's out of scope for this PR, but why not change the "local" field of the dict to "on_head_node" (and make equivalent terminological changes in other parts of the code)?

It may just be me but I find I am getting confused when local is used in an absolute sense to refer to being on the head node.

vllm/v1/engine/core.py

afeldman-nm · 2025-04-07T20:45:05Z

vllm/v1/engine/core_client.py

-        local_dp_rank: int = 0,
-    ):
-        self.index = index
+    def __init__(self, index: int = 0, local: bool = True):


Nit: Unclear what local means here; see my previous comment on this topic. Is it possible that we have overloaded the word "local"?

Local means that it's local to the front-end (same node), otherwise it's remote.

vllm/v1/engine/core_client.py

afeldman-nm · 2025-04-07T20:54:59Z

vllm/v1/engine/core_client.py

+        # SPMD mode is where there is an LLM instance per DP rank and one
+        # core engine per LLM, see examples/offline_inference/data_parallel.py.


Nit: I checked examples/offline_inference/data_parallel.py on main and there appears to be no reference to SPMD, nor do I see a tweak of the example code in this PR. Would it make sense to add a comment mentioning SPMD in the example?

vllm/v1/engine/core.py

njhill · 2025-04-07T21:33:42Z

TODO (this PR):

Feedback

Some code simplification

Fix offline DP compatiblity

CI tests

Next PR:

API server scale-out

Hey @njhill , just want to clarify - when you include "CI tests" in the TODO, does this mean you will be adding new unit tests? Maybe it would be valuable to unit-test the communication process in scaled-out scenarios for example.

Yes, plan to add tests to cover launching some engine(s) remotely, but wanted to get some feedback on the design first.

Signed-off-by: Nick Hill <[email protected]> # Conflicts: # vllm/v1/engine/core_client.py

njhill · 2025-05-11T21:59:11Z

We've talked about this some offline, but for the record, I would prefer that we don't introduce any additional multi-node interfaces that are not secured in any way, and worse, allow arbitrary code execution via pickle. I understand the argument that PyTorch says its distributed communications are insecure, but that's not a great reason to make our own code worse and harder to fix in the future.

The security concerns are hopefully addressed now that we have #17490 merged.

simon-mo · 2025-05-11T23:46:52Z

@russellb @youkaichao can you please help final round of review?

russellb · 2025-05-12T01:48:19Z

We've talked about this some offline, but for the record, I would prefer that we don't introduce any additional multi-node interfaces that are not secured in any way, and worse, allow arbitrary code execution via pickle. I understand the argument that PyTorch says its distributed communications are insecure, but that's not a great reason to make our own code worse and harder to fix in the future.

The security concerns are hopefully addressed now that we have #17490 merged.

Confirmed, yes -- thank you for working with me on this!

njhill · 2025-05-12T15:24:57Z

Fixing rebase issues

my security concern has been resolved

Signed-off-by: Nick Hill <[email protected]>

njhill · 2025-05-12T17:51:02Z

I've opened a separate PR for one of the new CI test failures which is unrelated: #18007

mergify · 2025-05-12T19:10:28Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @njhill.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

# Conflicts: # vllm/v1/engine/core.py

…-engines

…vllm-project#15977) Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Chenheli Hua <[email protected]>

GGKOP · 2025-05-23T08:32:59Z

When will vllm.entrypoints.openai.api_server support --headless and --data-parallel-start-rank? @njhill

…vllm-project#15977) Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Yuqi Zhang <[email protected]>

SparkJiao · 2025-05-31T09:12:18Z

Hi, does currently this approach support IPv6 address?

russellb · 2025-05-31T19:27:00Z

Hi, does currently this approach support IPv6 address?

I've been trying to fix IPv6 support anywhere I see that doesn't work correctly. If you see it not working anywhere, let me know. I looked at it in this PR and think it should be OK now, though I didn't try it myself.

SparkJiao · 2025-06-01T01:43:41Z

@russellb Hi, I think this PR can fix the problem: #18991. I have tried it.

mergify bot added the v1 label Apr 3, 2025

mergify bot added the frontend label Apr 3, 2025

tlrmchlsmth mentioned this pull request Apr 3, 2025

[RFC]: Data Parallel Attention and Expert Parallel MoEs #16037

Open

13 tasks

njhill requested a review from youkaichao April 4, 2025 05:12

njhill marked this pull request as ready for review April 4, 2025 05:13

njhill requested review from WoosukKwon, robertgshaw2-redhat, ywang96, comaniac and alexm-redhat as code owners April 4, 2025 05:13

youkaichao reviewed Apr 4, 2025

View reviewed changes

mergify bot added the needs-rebase label Apr 4, 2025

njhill added 5 commits April 4, 2025 17:04

[V1] DP scale-out (2/N): Decouple engine process management and comms

8802521

Signed-off-by: Nick Hill <[email protected]>

Headless mode

e869380

Signed-off-by: Nick Hill <[email protected]>

Wire data_parallel_address arg

1ca3d15

Signed-off-by: Nick Hill <[email protected]>

Some code cleanup

a551183

Signed-off-by: Nick Hill <[email protected]>

Fix offline DP compatibility

a662169

Signed-off-by: Nick Hill <[email protected]>

njhill force-pushed the decouple-engines branch from 8e1425f to a662169 Compare April 5, 2025 00:05

mergify bot removed the needs-rebase label Apr 5, 2025

njhill added the needs-tests Tests needed for this PR label Apr 5, 2025

afeldman-nm suggested changes Apr 7, 2025

View reviewed changes

markmc mentioned this pull request Apr 7, 2025

[V1][Bugfix]: vllm v1 verison metric num_gpu_blocks is None #15755

Merged

Merge remote-tracking branch 'origin/main' into decouple-engines

f7a909e

Signed-off-by: Nick Hill <[email protected]> # Conflicts: # vllm/v1/engine/core_client.py

mergify bot removed the needs-rebase label May 11, 2025

simon-mo added the ready ONLY add when PR is ready to merge/full CI is needed label May 11, 2025

njhill added 3 commits May 12, 2025 09:33

Fix test_startup_failure

42c30bf

Signed-off-by: Nick Hill <[email protected]>

Fix mock config related test failure

3904d10

Signed-off-by: Nick Hill <[email protected]>

Merge remote-tracking branch 'origin/main' into decouple-engines

cece58a

mergify bot added the needs-rebase label May 12, 2025

Merge remote-tracking branch 'origin/main' into decouple-engines

02f7263

# Conflicts: # vllm/v1/engine/core.py

mergify bot removed the needs-rebase label May 12, 2025

Merge remote-tracking branch 'refs/remotes/origin/main' into decouple…

e1400f7

…-engines

simon-mo merged commit 55aa7af into vllm-project:main May 13, 2025
53 of 60 checks passed

njhill deleted the decouple-engines branch May 13, 2025 18:20

ruisearch42 mentioned this pull request May 15, 2025

(deprecated) [V1] Support DP with Ray #18233

Open

DarkLight1337 mentioned this pull request May 21, 2025

[Core] Parallel multi-modal processor #17831

Open

markmc mentioned this pull request May 21, 2025

[Bug][Failing Test] distributed tests (4 GPUS) - v1/test_async_llm_dp.py::test_load #18466

Closed

1 task

yiz-liu mentioned this pull request May 22, 2025

[Bug]: data_parallel.py not working in multi-node case #18553

Closed

1 task

huachenheli pushed a commit to huachenheli/vllm that referenced this pull request May 22, 2025

[V1] DP scale-out (2/N): Decouple engine process management and comms (…

5458d39

…vllm-project#15977) Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Chenheli Hua <[email protected]>

zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025

[V1] DP scale-out (2/N): Decouple engine process management and comms (…

f466453

…vllm-project#15977) Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Yuqi Zhang <[email protected]>

ruisearch42 mentioned this pull request May 27, 2025

[V1] Support DP with Ray #18779

Open

		# SPMD mode is where there is an LLM instance per DP rank and one
		# core engine per LLM, see examples/offline_inference/data_parallel.py.

Uh oh!

[V1] DP scale-out (2/N): Decouple engine process management and comms #15977

[V1] DP scale-out (2/N): Decouple engine process management and comms #15977

Conversation

njhill commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Examples

TODO (this PR):

Next PR:

Uh oh!

github-actions bot commented Apr 3, 2025

Uh oh!

zxfan-cpu commented Apr 3, 2025

Uh oh!

youkaichao commented Apr 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

njhill commented Apr 4, 2025

Uh oh!

mergify bot commented Apr 4, 2025

Uh oh!

afeldman-nm commented Apr 7, 2025 • edited by njhill Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODO (this PR):

Next PR:

Uh oh!

afeldman-nm left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

njhill commented Apr 7, 2025

TODO (this PR):

Next PR:

Uh oh!

njhill commented May 11, 2025

Uh oh!

simon-mo commented May 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

russellb commented May 12, 2025

Uh oh!

njhill commented May 12, 2025

Uh oh!

njhill commented May 12, 2025

Uh oh!

mergify bot commented May 12, 2025

Uh oh!

Uh oh!

GGKOP commented May 23, 2025

Uh oh!

SparkJiao commented May 31, 2025

Uh oh!

russellb commented May 31, 2025

Uh oh!

SparkJiao commented Jun 1, 2025

njhill commented Apr 3, 2025 •

edited

Loading

afeldman-nm commented Apr 7, 2025 •

edited by njhill

Loading

afeldman-nm left a comment •

edited

Loading

simon-mo commented May 11, 2025 •

edited

Loading