[core] set up data parallel communication #13591

youkaichao · 2025-02-20T06:51:15Z

We need to explore data parallel in many cases, e.g. in deepseek models, and moe models.

While the end-user interface is still to be designed, this PR first creates the necessary communication channel for data parallel, and leave the interface for future design.

In the future, as long as an external launcher can set up VLLM_DP_RANK, VLLM_DP_SIZE, VLLM_DP_MASTER_IP, VLLM_DP_MASTER_PORT, and CUDA_VISIBLE_DEVICES correctly, it will be compatible with this PR.
The main communication inside the worker now has DP group
The engine process also has a separate DP group to communicate across DP instances.

Example commands to use data parallel: torchrun --nproc-per-node=2 examples/offline_inference/data_parallel.py

Note: this PR only set up the communication channel. It is not used in the model forward pass yet. To enjoy the benefit of data parallel, especially with the combination of expert parallel, we need to:

Implement execute_dummy_batch when should_execute_dummy_batch == True, in engines
synchronize use_cuda_graph in model runner across DP groups. this is technically not necessary, but if we have some collective operations that do something different w/ and w/o cudagraph, this sync would be necessary.
change the MoE loading logic to shard experts in world size, instead of TP size.
Add some all-to-all communication before and after MoE computation to gather selection logits from DP ranks.

NOTE: I think currently PP is not really compatible with DP. This is right now quite complicated to reason about.

Signed-off-by: youkaichao <[email protected]>

github-actions · 2025-02-20T06:51:25Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

youkaichao · 2025-02-20T07:00:19Z

vllm/distributed/parallel_state.py

+                                    backend,
+                                    group_name="dp")
+
+    logger.info(


example of the rank assignment for DP=2 x TP=2:

rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0 rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1 rank 2 in world size 4 is assigned as DP rank 1, PP rank 0, TP rank 0 rank 3 in world size 4 is assigned as DP rank 1, PP rank 0, TP rank 1

tlrmchlsmth

JFYI: I ran into an issue with the master port already being in use (see comment in config.py)

vllm/v1/engine/llm_engine.py

tlrmchlsmth · 2025-02-20T19:06:37Z

vllm/config.py

+        self.data_parallel_size = envs.VLLM_DP_SIZE
+        self.data_parallel_rank = envs.VLLM_DP_RANK
+        self.data_parallel_master_ip = envs.VLLM_DP_MASTER_IP
+        self.data_parallel_master_port = envs.VLLM_DP_MASTER_PORT


Note that I'm hitting issues like:

RuntimeError: The server socket has failed to listen on any local network address. port: 29500, useIpv6: 0, code: -98, name: EADDRINUSE, message: address already in use

This is true even if I change the master port with torchrun --master-port .... Currently hacking around it by changing this to self.data_parallel_master_port = envs.VLLM_DP_MASTER_PORT + 1

that's strange. I also met it once but then it disappeared.

it seems this disappeared when i remove torchrun in af53b4b

vllm/config.py

comaniac · 2025-02-20T22:40:57Z

vllm/config.py

+        answer = self.data_parallel_master_port
+        self.data_parallel_master_port += 1


What if the port is already being used by other services?

Then it will error.

We can document and say we will use more than one port starting from the specified port. And the assumption usually should be fine.

NOTE: even if we only use the specified port, there're still chances that some other services already use that port before we start to use that port. It is unavoidable if we are running multiple services in the same host. But for cloud deployment, where each service runs in a separate container, it should be fine.

Alternatively we can just check if this port is being used using socket? So we just keep searching for the next available port

this is not feasible because non-zero ranks will directly connect to the specified port, and it does not know if it is the master rank or some other services. and it also needs to wait for some time in case the master rank is not started yet.

I added the code in 267cd82, at least vllm's internal port usage will not conflict with the dp master ports.

comaniac · 2025-02-20T22:48:52Z

vllm/v1/engine/llm_engine.py

+        if self.should_execute_dummy_batch:
+            self.should_execute_dummy_batch = False
+            # TODO: execute a dummy batch to sync across ranks


Looks like this is not the right place for this logic? This should be in the EngineCore's busy loop I feel.

I'm not familiar with the engine part, can you show me where i should put it?

Should be in this loop I guess

https://github.com/vllm-project/vllm/blob/main/vllm/v1/engine/core.py#L305

a bitter lesson, we need to place this logic at the top level, which is the llmengine level in offline inference.

we cannot put it in the EngineCore's busy loop, otherwise the llmengine will exit directly without checking the status of other dp ranks.

Signed-off-by: youkaichao <[email protected]>

youkaichao · 2025-02-22T04:03:12Z

@youkaichao instead of calling dummy forward as a utility method, could we instead modify the step() method in core.py like this.. and have model runner execute_model call _dummy_run if it gets None as the scheduler output?
    def step(self) -> EngineCoreOutputs:
        """Schedule, execute, and make output."""

        if not self.scheduler.has_unfinished_requests():
            self.model_executor.execute_model(None)
            return EngineCoreOutputs(
                outputs=[], scheduler_stats=self.scheduler.make_stats())

@njhill I tried that approach as well, but didn't succeed. It needs more changes, e.g. we need to change the semantic of execute_model to define what does it mean to have None as input, and breaks several other code. I gave it up because I'm not familiar with that part of code, but feel free to have a try after this PR.

Signed-off-by: youkaichao <[email protected]>

tlrmchlsmth

LGTM

youkaichao · 2025-02-22T11:28:46Z

failed tests are due to hf timeout, merging.

njhill · 2025-02-22T22:01:08Z

examples/offline_inference/data_parallel.py

+    if len(prompts) == 0:
+        # if any rank has no prompts to process,
+        # we need to set a placeholder prompt
+        prompts = ["Placeholder"]


I know this is just an example but in practice I guess you'd want to set max_tokens to 1 for any placeholder prompts.

Signed-off-by: youkaichao <[email protected]>

lewisword · 2025-03-14T10:17:09Z

May I ask if this feature can be used in a service-oriented way? I see from the example in examples/offline_inference/data_parallel.py that it uses an offline multi-process invocation approach. @youkaichao

njhill · 2025-03-17T19:08:59Z

@lewisword not yet, but it will be coming via #13923.

Signed-off-by: youkaichao <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>

Signed-off-by: youkaichao <[email protected]>

QiuMike · 2025-05-09T02:34:45Z

@youkaichao
I run the offline examples in H20 with two nodes, each node has 8 cards.

export VLLM_DP_MASTER_IP=10.13.3.163
export GLOO_SOCKET_IFNAME=eth0
export TP_SOCKET_IFNAME=eth0

python3 examples/offline_inference/data_parallel.py --node-size 2 --node-rank 0 --master-addr 10.13.3.163 --model /home/xxxxx/DeepSeek-R1 --master-port 13345 --dp-size 2 --tp-size 8

python3 examples/offline_inference/data_parallel.py --node-size 2 --model /home/xxxxx/DeepSeek-R1/ --node-rank 1 --master-addr 10.13.3.163 --master-port 13345 --dp-size 2 --tp-size 8

DP rank 0, Prompt: 'Hello, my name is', Generated text: ' Danielle. I’m a new Master’s student in the Sustainability and Energy' [16/1933]
DP rank 0, Prompt: 'The president of the United States is', Generated text: ' the head of state and head of government of the United States, indirectly elected to'
DP rank 0, Prompt: 'The capital of France is', Generated text: ' Paris, and the three major cities are Paris, Lyon, and Marseille. France'
DP rank 0, Prompt: 'The future of AI is', Generated text: ' a topic that has been discussed and debated by experts, researchers, and enthusiasts alike'
DP rank 0, Prompt: 'Hello, my name is', Generated text: " Mr. Sato.\nLet's learn how to identify themes in literature.\nFirst of"
(EngineCore_0 pid=28846) INFO 05-08 13:09:23 [core.py:372] EngineCore exiting with signum 15
(EngineCore_0 pid=28846) Process EngineCore_0:
(EngineCore_0 pid=28846) Traceback (most recent call last):
(EngineCore_0 pid=28846) File "/home/admin/michael/vllm/vllm/v1/engine/core.py", line 394, in run_engine_core
(EngineCore_0 pid=28846) engine_core.run_busy_loop()
(EngineCore_0 pid=28846) File "/home/admin/michael/vllm/vllm/v1/engine/core.py", line 687, in run_busy_loop
(EngineCore_0 pid=28846) self.execute_dummy_batch()
(EngineCore_0 pid=28846) File "/home/admin/michael/vllm/vllm/v1/engine/core.py", line 281, in execute_dummy_batch
(EngineCore_0 pid=28846) self.model_executor.collective_rpc("execute_dummy_batch")
(EngineCore_0 pid=28846) File "/home/admin/michael/vllm/vllm/v1/executor/multiproc_executor.py", line 215, in collective_rpc
(EngineCore_0 pid=28846) result = get_response(w, dequeue_timeout)
(EngineCore_0 pid=28846) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=28846) File "/home/admin/michael/vllm/vllm/v1/executor/multiproc_executor.py", line 198, in get_response
(EngineCore_0 pid=28846) status, result = w.worker_response_mq.dequeue(
(EngineCore_0 pid=28846) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=28846) File "/home/admin/michael/vllm/vllm/distributed/device_communicators/shm_broadcast.py", line 479, in dequeue
(EngineCore_0 pid=28846) with self.acquire_read(timeout, cancel) as buf:
(EngineCore_0 pid=28846) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=28846) File "/usr/lib/python3.12/contextlib.py", line 137, in enter
(EngineCore_0 pid=28846) return next(self.gen)
(EngineCore_0 pid=28846) ^^^^^^^^^^^^^^
(EngineCore_0 pid=28846) File "/home/admin/michael/vllm/vllm/distributed/device_communicators/shm_broadcast.py", line 425, in acquire_read
(EngineCore_0 pid=28846) sched_yield()
(EngineCore_0 pid=28846) File "/home/admin/michael/vllm/vllm/distributed/device_communicators/shm_broadcast.py", line 41, in sched_yield
(EngineCore_0 pid=28846) os.sched_yield()
(EngineCore_0 pid=28846) File "/home/admin/michael/vllm/vllm/v1/engine/core.py", line 376, in signal_handler
(EngineCore_0 pid=28846) raise SystemExit()
(EngineCore_0 pid=28846) SystemExit
(EngineCore_0 pid=28846)
(EngineCore_0 pid=28846) During handling of the above exception, another exception occurred:
(EngineCore_0 pid=28846)
(EngineCore_0 pid=28846) Traceback (most recent call last):
(EngineCore_0 pid=28846) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap

youkaichao added 19 commits February 20, 2025 10:35

add env vars

9f6c969

Signed-off-by: youkaichao <[email protected]>

add tests

756425f

Signed-off-by: youkaichao <[email protected]>

add examples

eac1046

Signed-off-by: youkaichao <[email protected]>

adjust init

6f2de3c

Signed-off-by: youkaichao <[email protected]>

fix ip

3d5f971

Signed-off-by: youkaichao <[email protected]>

move init device into init worker so that vllm config is set

ccb2f75

Signed-off-by: youkaichao <[email protected]>

add logs

9e10661

Signed-off-by: youkaichao <[email protected]>

init groups

fea1ab2

Signed-off-by: youkaichao <[email protected]>

support multiple groups with dp

ab63901

Signed-off-by: youkaichao <[email protected]>

add field

4652353

Signed-off-by: youkaichao <[email protected]>

add utils

c26df65

Signed-off-by: youkaichao <[email protected]>

sync on has_unfinished_requests

ebbcb18

Signed-off-by: youkaichao <[email protected]>

cancel env vars

0c26f38

Signed-off-by: youkaichao <[email protected]>

cancel env vars

afbaca4

Signed-off-by: youkaichao <[email protected]>

simplify code

0f93cb3

Signed-off-by: youkaichao <[email protected]>

change step

f8ffe7e

Signed-off-by: youkaichao <[email protected]>

v1 support

8330dcd

Signed-off-by: youkaichao <[email protected]>

improve examples

3ab3465

Signed-off-by: youkaichao <[email protected]>

simplify examples

80ae5b6

Signed-off-by: youkaichao <[email protected]>

mergify bot added ci/build v1 labels Feb 20, 2025

youkaichao commented Feb 20, 2025

View reviewed changes

tlrmchlsmth reviewed Feb 20, 2025

View reviewed changes

comaniac reviewed Feb 20, 2025

View reviewed changes

youkaichao added 5 commits February 21, 2025 13:05

unify code

1bf33ea

Signed-off-by: youkaichao <[email protected]>

unify code

fba6287

Signed-off-by: youkaichao <[email protected]>

fix v0?

81468ad

Signed-off-by: youkaichao <[email protected]>

sync num_tokens_across_dp

b284b36

Signed-off-by: youkaichao <[email protected]>

fix

be8c281

Signed-off-by: youkaichao <[email protected]>

youkaichao added 3 commits February 22, 2025 11:52

list multiply

3d92428

Signed-off-by: youkaichao <[email protected]>

list multiply

0c6b1db

Signed-off-by: youkaichao <[email protected]>

short line

779cb33

Signed-off-by: youkaichao <[email protected]>

youkaichao added 2 commits February 22, 2025 12:06

clean up DP

29e6e60

Signed-off-by: youkaichao <[email protected]>

fix port conflict

267cd82

Signed-off-by: youkaichao <[email protected]>

tlrmchlsmth approved these changes Feb 22, 2025

View reviewed changes

youkaichao enabled auto-merge (squash) February 22, 2025 06:44

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 22, 2025

youkaichao disabled auto-merge February 22, 2025 11:28

youkaichao merged commit 3e472d8 into vllm-project:main Feb 22, 2025
67 of 72 checks passed

youkaichao deleted the manual_dp branch February 22, 2025 11:29

youkaichao mentioned this pull request Feb 22, 2025

[ci] fix linter #13701

Merged

njhill reviewed Feb 22, 2025

View reviewed changes

tlrmchlsmth mentioned this pull request Feb 26, 2025

[V1] EP/TP MoE + DP Attention #13931

Merged

Akshat-Tripathi pushed a commit to krai/vllm that referenced this pull request Mar 3, 2025

[core] set up data parallel communication (vllm-project#13591)

f1a809e

Signed-off-by: youkaichao <[email protected]>

weedge mentioned this pull request Mar 5, 2025

feat: add vllm deploy modal inference test ai-bot-pro/achatbot#126

Merged

hiyouga mentioned this pull request Mar 16, 2025

[Bugfix] torchrun compatibility #14899

Merged

hmellor mentioned this pull request Apr 2, 2025

[Performance]: 0.8.1 vs 0.7.4dev122 R1 H20 performance benchmark test，0.8.1 What is the reason for the 14% performance improvement(throughput tokens/s) #15881

Closed

1 task

tlrmchlsmth mentioned this pull request Apr 3, 2025

[RFC]: Data Parallel Attention and Expert Parallel MoEs #16037

Open

13 tasks

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025

[core] set up data parallel communication (vllm-project#13591)

41b9653

Signed-off-by: youkaichao <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025

[core] set up data parallel communication (vllm-project#13591)

4511af4

Signed-off-by: youkaichao <[email protected]>

		answer = self.data_parallel_master_port
		self.data_parallel_master_port += 1

Uh oh!

[core] set up data parallel communication #13591

[core] set up data parallel communication #13591

Uh oh!

Conversation

youkaichao commented Feb 20, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

youkaichao commented Feb 22, 2025

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

youkaichao commented Feb 22, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lewisword commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

njhill commented Mar 17, 2025

Uh oh!

QiuMike commented May 9, 2025

Uh oh!

Uh oh!

youkaichao commented Feb 20, 2025 •

edited by github-actions bot

Loading

tlrmchlsmth left a comment •

edited

Loading

lewisword commented Mar 14, 2025 •

edited

Loading