-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Single-process multi-node CPU training #9603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Enable multi-node CPU-only training without spawning
Thanks for the PR. LGTM, just a few changes requested. |
Co-authored-by: Adrian Wälchli <[email protected]>
Co-authored-by: Adrian Wälchli <[email protected]>
Implemented the changes, thanks for the pointers @awaelchli :) |
Thanks @borchero <3 |
Don't think the failing test is caused by my changes. |
Codecov Report
@@ Coverage Diff @@
## master #9603 +/- ##
=======================================
- Coverage 93% 89% -4%
=======================================
Files 179 179
Lines 15805 15807 +2
=======================================
- Hits 14648 14029 -619
- Misses 1157 1778 +621 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice work!
Co-authored-by: Adrian Wälchli <[email protected]> Co-authored-by: thomas chaton <[email protected]>
What does this PR do?
Whenever a model is small, it is not necessarily useful to use a GPU. However, one might still benefit from distributing computations for model training across nodes -- especially when you consider that containers are also "nodes".
This PR sets
DistributedType.DDP
instead ofDistributedType.DDP_SPAWN
wheneverDistributedType.DDP_CPU
is set and only a single process or a single process per node is used.Does your PR introduce any breaking changes? If yes, please list them.
num_nodes > 1
and setsnum_processes = None
on theTrainer
,num_processes
now defaults to 1 andDistributedType.DDP
is used. Before,num_processes
was set toos.cpu_count()
andDistributedType.DDP_SPAWN
was used. However, it is unlikely that people rely on that behavior, especially sincenum_processes
defaults to1
.One could decide to keep the old behavior, however, I would argue that it is far more likely to only want a single process when distributing across nodes (think containers).
Fixes #9877
Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list: