generated from fastai/nbdev_template
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Co-Locating vLLM Instances with Training Processes Via External Launcher #3105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This PR is closed as it is superseded by #3162 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Fixes #3064
Addresses:
#3114
#2971
#2922
#2887
Motivation
vLLM has introduced support for an external launcher, enabling vLLM processes to be co-located with other workloads, such as training.
Benefits of delivering External Launcher to the GRPO:
To leverage this feature, I added an option in TRL to spawn vLLM processes per GPU using the external launcher.
Modifications in This PR:
This PR updates the GRPO trainer to:
Results:
Click to view YAML
To run an experiment w/ the config above, define ACCELERATE_CONFIG = recipes/accelerate_configs/zero2.yaml (from open-R1 repo) and define GRPO_CONFIG as provided above, then run
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file $ACCELERATE_CONFIG --num_processes=8 src/open_r1/grpo.py --config $GRPO_CONFIG
for multi vllm scenarioACCELERATE_LOG_LEVEL=info accelerate launch --config_file $ACCELERATE_CONFIG --num_processes=7 src/open_r1/grpo.py --config $GRPO_CONFIG
for single vllm scenario (remember to setvllm_external_launcher: false
andvllm_device: auto
).Discussions:
Why 2× speedup instead of 7–8×?
Previously, 7 GPUs were allocated for training and 1 GPU for generation. With this change, all GPUs are now utilized for both training and generation. Given that vLLMs are parallelized across all 8 GPUs, one might expect a 7–8× speedup. However, testing against a standalone vLLM instance also showed similar performance behavior as follows.
Setup 1 (Single vLLM - original behavior of TRL): for per_device_train_batch = 16, num_gen = 16, device_count = 7..
Each of the 7 devices processes 16 prompts, leading to a global total of 112 prompts. The main vLLM process selects every 16th prompt, meaning a single vLLM gets 7 prompts and generates 112 generations.
Setup 2 (Multi-vLLM - w/ external launcher): Each vLLM processes a local batch of 16 and generates one output per input. This results in 16 generations per vLLM instance.
Setup1 showed 79 sec latency vs. setup2 showed 36sec latency for the deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B model. The observed speedup was around 2×, aligning with our findings in GRPO training above.
Setup script is below.
Click to view script
We tried two different models (Qwen/Qwen2.5-Math-7B and deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) and compared setup1 (no_of_prompts = 7, num_gen = 16) vs. setup2 (no_of_prompts = 16, num_gen = 1).
python script.py --model_name "Qwen/Qwen2.5-Math-7B" --no_of_prompts 16 --num_gen 1
vs.python script.py --model_name "Qwen/Qwen2.5-Math-7B" --no_of_prompts 7 --num_gen 16
CC @fabianlim
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.