-
-
Notifications
You must be signed in to change notification settings - Fork 7.8k
[Perf] API-server scaleout with many-to-many server-engine comms #17546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
…-engines Signed-off-by: Nick Hill <[email protected]> # Conflicts: # vllm/v1/engine/core_client.py # vllm/v1/utils.py
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]> # Conflicts: # vllm/config.py # vllm/engine/arg_utils.py # vllm/v1/engine/core.py # vllm/v1/engine/core_client.py
Signed-off-by: Nick Hill <[email protected]> # Conflicts: # vllm/v1/engine/core.py # vllm/v1/engine/core_client.py
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
…-engines Signed-off-by: Nick Hill <[email protected]> # Conflicts: # vllm/config.py # vllm/v1/engine/core.py
Signed-off-by: Nick Hill <[email protected]> # Conflicts: # vllm/v1/engine/core_client.py # vllm/v1/utils.py
…-engines Signed-off-by: Nick Hill <[email protected]> # Conflicts: # vllm/v1/engine/core.py # vllm/v1/engine/core_client.py
Signed-off-by: Nick Hill <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Avoid exception but still needs more work to be functional with multiple api server procs. Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
How does run_rpc work if we want to bcast this to each engine and run it exactly once? How to guarantee that each engine core runs it in lock step if we want? |
There isn't a lot of work in apiserver that needs multiprocessing right? It's mostly async_llm, most specifically MM data handling that needs scale out? |
Co-authored-by: Will Eaton <[email protected]> Signed-off-by: Nick Hill <[email protected]>
# Conflicts: # vllm/entrypoints/openai/api_server.py
Co-authored-by: Will Eaton <[email protected]> Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
I think all the CI issues are fixed and remaining failures should be unrelated, we should let it finish though. |
Signed-off-by: Nick Hill <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Nick Hill <[email protected]> # Conflicts: # tests/v1/engine/test_engine_core.py # vllm/v1/engine/core.py # vllm/v1/engine/core_client.py
I am now seeing the following warnings on main when running some test e.g.: pytest tests/v1/engine/test_async_llm.py::test_load -s
This warning is thrown during the prometheus cleanup though not sure where this is coming from exactly. |
Looks like the error comes from this prometheus function |
Thanks @lgeiger. I think the message is harmless but I'll fix this. |
Introduced in vllm-project#17546. We should only call mark_process_dead when we're using prometheus multiprocessing mode (with > 1 API servers). Signed-off-by: Nick Hill <[email protected]>
This is a follow-on from #15977.
A new
--api-server-count
arg tovllm serve
can be used to specify an arbitrary number of API servers to run. When used in conjunction with--data-parallel-size
there's all-to-all zmq-based communication between API servers and data parallel engines.It works with multi-node as described in #15977. All of the API servers run on the head node.
A separate "coordinator" process is now used for DP>1. This is responsible for ensuring that the engines run in tandem, and for publishing real-time request count information (and later likely other engine state info) back to the api server(s) for load balancing purposes.
More design discussion: https://docs.google.com/document/d/10jhCNxJYvsUhtMtiMAaW2MxU5LU8HVje2pGDnj49gH4/edit?tab=t.0
Performance now scales much better with DP size. Observe TTFT in particular below.
Benchmark with 2xA100, llama-3.2-1B, share-gpt with request rate 120 req/sec:
DP=2 before
DP=2 with
--api-server-count=2
This is working functionally but there are still a number of tasks remaining:
Follow-on work (not for this PR):