[Perf] API-server scaleout with many-to-many server-engine comms #17546

njhill · 2025-05-01T17:58:05Z

This is a follow-on from #15977.

A new --api-server-count arg to vllm serve can be used to specify an arbitrary number of API servers to run. When used in conjunction with --data-parallel-size there's all-to-all zmq-based communication between API servers and data parallel engines.

It works with multi-node as described in #15977. All of the API servers run on the head node.

A separate "coordinator" process is now used for DP>1. This is responsible for ensuring that the engines run in tandem, and for publishing real-time request count information (and later likely other engine state info) back to the api server(s) for load balancing purposes.

More design discussion: https://docs.google.com/document/d/10jhCNxJYvsUhtMtiMAaW2MxU5LU8HVje2pGDnj49gH4/edit?tab=t.0

Performance now scales much better with DP size. Observe TTFT in particular below.

Benchmark with 2xA100, llama-3.2-1B, share-gpt with request rate 120 req/sec:

DP=2 before

============ Serving Benchmark Result ============
Successful requests:                     10000     
Benchmark duration (s):                  130.74    
Total input tokens:                      2206428   
Total generated tokens:                  1994815   
Request throughput (req/s):              76.49     
Output token throughput (tok/s):         15258.46  
Total Token throughput (tok/s):          32135.56  
---------------Time to First Token----------------
Mean TTFT (ms):                          13176.40  
Median TTFT (ms):                        13953.03  
P99 TTFT (ms):                           26842.02  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          19.83     
Median TPOT (ms):                        22.28     
P99 TPOT (ms):                           36.10     
---------------Inter-token Latency----------------
Mean ITL (ms):                           24.19     
Median ITL (ms):                         21.98     
P99 ITL (ms):                            81.11     
==================================================

DP=2 with `--api-server-count=2`

============ Serving Benchmark Result ============
Successful requests:                     10000     
Benchmark duration (s):                  116.84    
Total input tokens:                      2206428   
Total generated tokens:                  1994815   
Request throughput (req/s):              85.59     
Output token throughput (tok/s):         17073.43  
Total Token throughput (tok/s):          35958.03  
---------------Time to First Token----------------
Mean TTFT (ms):                          67.54     
Median TTFT (ms):                        60.81     
P99 TTFT (ms):                           329.10    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          23.90     
Median TPOT (ms):                        24.13     
P99 TPOT (ms):                           36.89     
---------------Inter-token Latency----------------
Mean ITL (ms):                           23.62     
Median ITL (ms):                         22.14     
P99 ITL (ms):                            51.75     
==================================================

This is working functionally but there are still a number of tasks remaining:

Initial benchmark results for DP=1 and multiple API servers are disappointing - I am looking into why this currently hurts ITL and throughput slightly (though TTFT slightly improves).
(medium) Multiple api servers don't currently work properly with metrics publishing/logging. I have discussed this with @markmc but it needs a bit more work. @kouroshHakha is helping to look at this, I will add some more notes below.
(small) The multi-modal embeddings cache currently won't work with DP and/or mutli-API so will need to be auto-disabled when dp > 1 and/or api-server-count > 1. Hopefully the scale-out should hide the performance downsides to that however (discussed this with @ywang96 and @DarkLight1337).
(small) When there are many API servers, a lot of the startup logs are duplicated. We probably want to suppress some of these.
(tbd) Need to look into implications for LoRA adapter loading.
(medium) Some more work on error handling and clean shutdown with the new process topologies.
(medium) Full test coverage of the various permutations.

Follow-on work (not for this PR):

Rework how the multi-modal feature cache is implemented to make it compatible with the any-to-any process architecture.

Signed-off-by: Nick Hill <[email protected]>

…-engines Signed-off-by: Nick Hill <[email protected]> # Conflicts: # vllm/v1/engine/core_client.py # vllm/v1/utils.py

Signed-off-by: Nick Hill <[email protected]>

Signed-off-by: Nick Hill <[email protected]> # Conflicts: # vllm/config.py # vllm/engine/arg_utils.py # vllm/v1/engine/core.py # vllm/v1/engine/core_client.py

Signed-off-by: Nick Hill <[email protected]> # Conflicts: # vllm/v1/engine/core.py # vllm/v1/engine/core_client.py

Signed-off-by: Nick Hill <[email protected]>

…-engines Signed-off-by: Nick Hill <[email protected]> # Conflicts: # vllm/config.py # vllm/v1/engine/core.py

Signed-off-by: Nick Hill <[email protected]> # Conflicts: # vllm/v1/engine/core_client.py # vllm/v1/utils.py

…-engines

…-engines Signed-off-by: Nick Hill <[email protected]> # Conflicts: # vllm/v1/engine/core.py # vllm/v1/engine/core_client.py

Signed-off-by: Nick Hill <[email protected]>

github-actions · 2025-05-01T17:58:13Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Avoid exception but still needs more work to be functional with multiple api server procs. Signed-off-by: Nick Hill <[email protected]>

Signed-off-by: Nick Hill <[email protected]>

yinghai · 2025-05-06T14:26:34Z

How does run_rpc work if we want to bcast this to each engine and run it exactly once? How to guarantee that each engine core runs it in lock step if we want?

yinghai · 2025-05-06T14:30:25Z

There isn't a lot of work in apiserver that needs multiprocessing right? It's mostly async_llm, most specifically MM data handling that needs scale out?

Co-authored-by: Will Eaton <[email protected]> Signed-off-by: Nick Hill <[email protected]>

# Conflicts: # vllm/entrypoints/openai/api_server.py

Co-authored-by: Will Eaton <[email protected]> Signed-off-by: Nick Hill <[email protected]>

Signed-off-by: Nick Hill <[email protected]>

njhill · 2025-05-29T14:35:37Z

I think all the CI issues are fixed and remaining failures should be unrelated, we should let it finish though.

Signed-off-by: Nick Hill <[email protected]>

mergify · 2025-05-29T18:35:20Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @njhill.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Nick Hill <[email protected]> # Conflicts: # tests/v1/engine/test_engine_core.py # vllm/v1/engine/core.py # vllm/v1/engine/core_client.py

lgeiger · 2025-05-31T01:13:26Z

I am now seeing the following warnings on main when running some test e.g.:

pytest tests/v1/engine/test_async_llm.py::test_load -s

ERROR 05-31 01:04:14 [prometheus.py:77] Error during metrics cleanup: expected str, bytes or os.PathLike object, not NoneType

This warning is thrown during the prometheus cleanup though not sure where this is coming from exactly.

lgeiger · 2025-05-31T01:38:11Z

Looks like the error comes from this prometheus function

njhill · 2025-05-31T03:12:01Z

Thanks @lgeiger. I think the message is harmless but I'll fix this.

Introduced in vllm-project#17546. We should only call mark_process_dead when we're using prometheus multiprocessing mode (with > 1 API servers). Signed-off-by: Nick Hill <[email protected]>

njhill · 2025-05-31T16:14:29Z

@lgeiger fixed in #18992.

…m-project#17546) Signed-off-by: amit <[email protected]>

njhill added 19 commits April 4, 2025 17:04

[V1] DP scale-out (2/N): Decouple engine process management and comms

8802521

Signed-off-by: Nick Hill <[email protected]>

Headless mode

e869380

Signed-off-by: Nick Hill <[email protected]>

Wire data_parallel_address arg

1ca3d15

Signed-off-by: Nick Hill <[email protected]>

Some code cleanup

a551183

Signed-off-by: Nick Hill <[email protected]>

Fix offline DP compatibility

a662169

Signed-off-by: Nick Hill <[email protected]>

Merge remote-tracking branch 'refs/remotes/origin/main' into decouple…

b29dcf4

…-engines Signed-off-by: Nick Hill <[email protected]> # Conflicts: # vllm/v1/engine/core_client.py # vllm/v1/utils.py

Address some review comments

8126f72

Signed-off-by: Nick Hill <[email protected]>

Address other minor review comments

8fdc6f5

Signed-off-by: Nick Hill <[email protected]>

Merge remote-tracking branch 'origin/main' into decouple-engines

9c90ad4

Signed-off-by: Nick Hill <[email protected]> # Conflicts: # vllm/config.py # vllm/engine/arg_utils.py # vllm/v1/engine/core.py # vllm/v1/engine/core_client.py

Merge remote-tracking branch 'origin/main' into decouple-engines

80f9c98

Signed-off-by: Nick Hill <[email protected]> # Conflicts: # vllm/v1/engine/core.py # vllm/v1/engine/core_client.py

Fix merge error, address @russellb's ipv6 review comment

efa8ad8

Signed-off-by: Nick Hill <[email protected]>

Hande ipv6 URIs in all places

30ab14b

Signed-off-by: Nick Hill <[email protected]>

Fix head node with no engines, don't require dp size on other nodes

acc5af3

Signed-off-by: Nick Hill <[email protected]>

Merge remote-tracking branch 'refs/remotes/origin/main' into decouple…

1649d7d

…-engines Signed-off-by: Nick Hill <[email protected]> # Conflicts: # vllm/config.py # vllm/v1/engine/core.py

Merge remote-tracking branch 'origin/main' into decouple-engines

4fbf90e

Signed-off-by: Nick Hill <[email protected]> # Conflicts: # vllm/v1/engine/core_client.py # vllm/v1/utils.py

Merge remote-tracking branch 'refs/remotes/origin/main' into decouple…

86a0453

…-engines

Merge remote-tracking branch 'origin/main' into decouple-engines

e70545c

Merge remote-tracking branch 'refs/remotes/origin/main' into decouple…

24b2e1e

…-engines Signed-off-by: Nick Hill <[email protected]> # Conflicts: # vllm/v1/engine/core.py # vllm/v1/engine/core_client.py

[Perf] API-server scaleout with all-to-all server-engine comms

c76e8e5

Signed-off-by: Nick Hill <[email protected]>

mergify bot added frontend v1 labels May 1, 2025

Fix engine init num_gpu_blocks logging

742b532

Avoid exception but still needs more work to be functional with multiple api server procs. Signed-off-by: Nick Hill <[email protected]>

njhill force-pushed the all-to-all branch from 01535a4 to 742b532 Compare May 1, 2025 19:19

DarkLight1337 mentioned this pull request May 5, 2025

[Bug]: KeyError in multi-modal cache when using DP #17284

Open

1 task

njhill requested a review from youkaichao May 5, 2025 16:00

njhill added 2 commits May 5, 2025 14:26

Improve load balancing

6340c87

Signed-off-by: Nick Hill <[email protected]>

small fixes

877f195

Signed-off-by: Nick Hill <[email protected]>

mergify bot added the needs-rebase label May 28, 2025

chaunceyjiang mentioned this pull request May 28, 2025

[RFC]: Controlling the maximum length of the waiting queue #18826

Open

1 task

njhill mentioned this pull request May 28, 2025

[BugFix] Make DP work with connector-delayed new requests #18559

Merged

njhill and others added 3 commits May 28, 2025 09:33

fix test_engine_core.py

951046a

Co-authored-by: Will Eaton <[email protected]> Signed-off-by: Nick Hill <[email protected]>

Merge remote-tracking branch 'refs/remotes/origin/main' into all-to-all

4f97f33

# Conflicts: # vllm/entrypoints/openai/api_server.py

Add multi-api-server CI tests

66a5295

Co-authored-by: Will Eaton <[email protected]> Signed-off-by: Nick Hill <[email protected]>

mergify bot added ci/build and removed needs-rebase labels May 28, 2025

njhill added 4 commits May 28, 2025 15:24

test fixes

8058507

Signed-off-by: Nick Hill <[email protected]>

Merge remote-tracking branch 'origin/main' into all-to-all

51ee6e9

fix kv_connector test

3a4b5ce

Signed-off-by: Nick Hill <[email protected]>

Merge remote-tracking branch 'origin/main' into all-to-all

3f27387

njhill added 2 commits May 29, 2025 09:31

more test fixes

0d75fc8

Signed-off-by: Nick Hill <[email protected]>

Merge remote-tracking branch 'origin/main' into all-to-all

902c44e

mergify bot added the needs-rebase label May 29, 2025

Merge remote-tracking branch 'refs/remotes/origin/main' into all-to-all

5b177c5

Signed-off-by: Nick Hill <[email protected]> # Conflicts: # tests/v1/engine/test_engine_core.py # vllm/v1/engine/core.py # vllm/v1/engine/core_client.py

mergify bot removed the needs-rebase label May 29, 2025

simon-mo enabled auto-merge (squash) May 29, 2025 22:58

simon-mo merged commit 2dbe8c0 into vllm-project:main May 30, 2025
92 of 94 checks passed

njhill deleted the all-to-all branch May 30, 2025 18:43

njhill mentioned this pull request May 31, 2025

[BugFix] Fix incorrect metrics shutdown error log message #18992

Merged

amitm02 pushed a commit to amitm02/vllm that referenced this pull request Jun 1, 2025

[Perf] API-server scaleout with many-to-many server-engine comms (vll…

c570e19

…m-project#17546) Signed-off-by: amit <[email protected]>

amitm02 pushed a commit to amitm02/vllm that referenced this pull request Jun 1, 2025

[Perf] API-server scaleout with many-to-many server-engine comms (vll…

466fc0e

…m-project#17546) Signed-off-by: amit <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Perf] API-server scaleout with many-to-many server-engine comms #17546

[Perf] API-server scaleout with many-to-many server-engine comms #17546

Uh oh!

njhill commented May 1, 2025 •

edited

Loading

Uh oh!

github-actions bot commented May 1, 2025

Uh oh!

yinghai commented May 6, 2025

Uh oh!

yinghai commented May 6, 2025

Uh oh!

njhill commented May 29, 2025

Uh oh!

mergify bot commented May 29, 2025

Uh oh!

Uh oh!

lgeiger commented May 31, 2025

Uh oh!

lgeiger commented May 31, 2025

Uh oh!

njhill commented May 31, 2025

Uh oh!

njhill commented May 31, 2025

Uh oh!

Uh oh!

Uh oh!

[Perf] API-server scaleout with many-to-many server-engine comms #17546

[Perf] API-server scaleout with many-to-many server-engine comms #17546

Uh oh!

Conversation

njhill commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

DP=2 before

DP=2 with --api-server-count=2

Uh oh!

github-actions bot commented May 1, 2025

Uh oh!

yinghai commented May 6, 2025

Uh oh!

yinghai commented May 6, 2025

Uh oh!

njhill commented May 29, 2025

Uh oh!

mergify bot commented May 29, 2025

Uh oh!

Uh oh!

lgeiger commented May 31, 2025

Uh oh!

lgeiger commented May 31, 2025

Uh oh!

njhill commented May 31, 2025

Uh oh!

njhill commented May 31, 2025

Uh oh!

Uh oh!

njhill commented May 1, 2025 •

edited

Loading

DP=2 with `--api-server-count=2`