-
-
Notifications
You must be signed in to change notification settings - Fork 7.7k
[V1] DP scale-out (2/N): Decouple engine process management and comms #15977
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
My pr #15863 is based on Ray's support for multi-node DP. Could you provide some feedback for modifications so we can merge with your PR? |
is this "frontend" is one api server process? |
vllm/engine/arg_utils.py
Outdated
default=EngineArgs.data_parallel_start_rank, | ||
help='Starting data parallel rank for secondary ' | ||
'nodes.') | ||
parser.add_argument('--data-parallel-address', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have data_parallel_master_ip
and data_parallel_master_port
. can we use those two field and rename the cli arg? i feel data_parallel_master_ip
and data_parallel_master_port
are easier to understand compared to --data-parallel-address
and --data-parallel-rpc-port
.
in addition, I reserved 10 ports starting from data_parallel_master_port
, so you can use, say data_parallel_master_port + 2
, as --data-parallel-rpc-port
, which will make the user-interface easier to use.
see
Line 581 in 2386803
if dp_port <= port < dp_port + 10: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could rename data-parallel-address
to data-parallel-ip-address
? or data-parallel-head-ip
? I was also trying to avoid using the "master" term :)
Re the port, the specified port in this case is actually the zmq socket port, the torch process group is assigned an available port arbitrarily (which is propagated to the other nodes).
vllm/engine/arg_utils.py
Outdated
parser.add_argument('--data-parallel-size-local', | ||
'-dpl', | ||
type=int, | ||
default=EngineArgs.data_parallel_size_local, | ||
help='Number of data parallel replicas to run on ' | ||
'this node.') | ||
parser.add_argument('--data-parallel-start-rank', | ||
'-dpr', | ||
type=int, | ||
default=EngineArgs.data_parallel_start_rank, | ||
help='Starting data parallel rank for secondary ' | ||
'nodes.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these two args should go to vllm/entrypoints/cli/serve.py
since they are for online serving only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@youkaichao I moved the start-rank one but the other one is used within the AsyncLLM impl which in theory could be used in other contexts (e.g. different front-end), so it makes sense for it to be in the config I think.
I could move it from EngineArgs
into AsyncEngineArgs
but then it would be on it's own so thought better to keep with the other data parallel args.
So far, yes. I've been trying to open incremental PRs, working on multi api server right now as the next one (on top of this PR). It won't change much in terms of the deployment semantics though - just maybe one additional arg to specify how many api server procs (applies only to head node and so mutually exclusive with |
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
Hey @njhill , just want to clarify - when you include "CI tests" in the TODO, does this mean you will be adding new unit tests? Maybe it would be valuable to unit-test the communication process in scaled-out scenarios for example. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @njhill , I'm probably going to spend a little more time reviewing this but I left some initial nits.
vllm/v1/engine/core.py
Outdated
super().__init__(vllm_config, executor_class, log_stats) | ||
|
||
self.step_fn = (self.step if self.batch_queue is None else | ||
self.step_with_batch_queue) | ||
|
||
self.global_unfinished_reqs = False | ||
|
||
# Send ready message. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assigning on_head_node
to the local
field suggests that local
has an absolute meaning, i.e. "on the head node" as opposed to on any other node. However, I thought that "local" had a relative meaning, i.e. "on the same node as a particular engine core instance"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@afeldman-nm in this context it will be running on the front-end server which is always on the head node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Maybe it's out of scope for this PR, but why not change the "local" field of the dict to "on_head_node" (and make equivalent terminological changes in other parts of the code)?
It may just be me but I find I am getting confused when local is used in an absolute sense to refer to being on the head node.
local_dp_rank: int = 0, | ||
): | ||
self.index = index | ||
def __init__(self, index: int = 0, local: bool = True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Unclear what local means here; see my previous comment on this topic. Is it possible that we have overloaded the word "local"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Local means that it's local to the front-end (same node), otherwise it's remote.
vllm/v1/engine/core_client.py
Outdated
# SPMD mode is where there is an LLM instance per DP rank and one | ||
# core engine per LLM, see examples/offline_inference/data_parallel.py. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I checked examples/offline_inference/data_parallel.py
on main and there appears to be no reference to SPMD, nor do I see a tweak of the example code in this PR. Would it make sense to add a comment mentioning SPMD in the example?
Yes, plan to add tests to cover launching some engine(s) remotely, but wanted to get some feedback on the design first. |
Signed-off-by: Nick Hill <[email protected]> # Conflicts: # vllm/v1/engine/core_client.py
The security concerns are hopefully addressed now that we have #17490 merged. |
@russellb @youkaichao can you please help final round of review? |
Confirmed, yes -- thank you for working with me on this! |
Fixing rebase issues |
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
I've opened a separate PR for one of the new CI test failures which is unrelated: #18007 |
This pull request has merge conflicts that must be resolved before it can be |
# Conflicts: # vllm/v1/engine/core.py
…vllm-project#15977) Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Chenheli Hua <[email protected]>
When will vllm.entrypoints.openai.api_server support --headless and --data-parallel-start-rank? @njhill |
…vllm-project#15977) Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Yuqi Zhang <[email protected]>
Hi, does currently this approach support IPv6 address? |
I've been trying to fix IPv6 support anywhere I see that doesn't work correctly. If you see it not working anywhere, let me know. I looked at it in this PR and think it should be OK now, though I didn't try it myself. |
This decouples the management of engine processes from the IPC, and adds support for a mix of local and/or remote engines (where remote are running on a different node).
When there are any remote engines, tcp transport is used for the zmq sockets, otherwise ipc (domain-socket based) is used.
Engines are bootstrapped with the input queue address and use this to perform a handshake with the front-end running in the head node, which provides other necessary configuration details:
There is a new
--headless
option forvllm serve
to run on secondary nodes, which launches one or more engines (data parallel ranks) without the front-end / API server).Examples
This will run DP=4 with DP ranks 0 and 1 on the head node and ranks 2 and 3 on the second node:
This will run DP=4 with only the API server on the first node and all engines on the second node:
It's assumed that local engine ranks (if any) will always be lower than remote engine ranks. Note that it's not actually necessary to specify the (global) dp size on the secondary nodes since this is obtained during the handshake. It would be straightforward to extend this to any other config which must be consistent across the ranks - so that you only need to specify it in the head node cli args.
TODO (this PR):
Next PR: