Single-process multi-node CPU training #9603

borchero · 2021-09-20T00:26:30Z

What does this PR do?

Whenever a model is small, it is not necessarily useful to use a GPU. However, one might still benefit from distributing computations for model training across nodes -- especially when you consider that containers are also "nodes".

This PR sets DistributedType.DDP instead of DistributedType.DDP_SPAWN whenever DistributedType.DDP_CPU is set and only a single process or a single process per node is used.

Does your PR introduce any breaking changes? If yes, please list them.

Minor breaking change: whenever one requests a num_nodes > 1 and sets num_processes = None on the Trainer, num_processes now defaults to 1 and DistributedType.DDP is used. Before, num_processes was set to os.cpu_count() and DistributedType.DDP_SPAWN was used. However, it is unlikely that people rely on that behavior, especially since num_processes defaults to 1.

One could decide to keep the old behavior, however, I would argue that it is far more likely to only want a single process when distributing across nodes (think containers).

Fixes #9877

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Enable multi-node CPU-only training without spawning

CHANGELOG.md

pytorch_lightning/trainer/connectors/accelerator_connector.py

pytorch_lightning/trainer/trainer.py

tests/accelerators/test_accelerator_connector.py

awaelchli · 2021-10-10T19:51:20Z

Thanks for the PR. LGTM, just a few changes requested.

Co-authored-by: Adrian Wälchli <[email protected]>

…patch-2

borchero · 2021-10-10T20:35:26Z

Implemented the changes, thanks for the pointers @awaelchli :)

pytorch_lightning/trainer/connectors/accelerator_connector.py

tests/accelerators/test_accelerator_connector.py

docs/source/common/trainer.rst

awaelchli · 2021-10-10T21:00:15Z

Thanks @borchero <3

tests/accelerators/test_accelerator_connector.py

…patch-2

borchero · 2021-10-14T15:38:10Z

Don't think the failing test is caused by my changes.

codecov · 2021-10-14T15:55:59Z

Codecov Report

Merging #9603 (146123b) into master (6feda08) will decrease coverage by 4%.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #9603    +/-   ##
=======================================
- Coverage      93%     89%    -4%     
=======================================
  Files         179     179            
  Lines       15805   15807     +2     
=======================================
- Hits        14648   14029   -619     
- Misses       1157    1778   +621

tests/accelerators/test_accelerator_connector.py

for more information, see https://pre-commit.ci

rohitgr7

nice work!

Co-authored-by: Adrian Wälchli <[email protected]> Co-authored-by: thomas chaton <[email protected]>

borchero added 3 commits September 20, 2021 02:09

Update accelerator_connector.py

7b88dec

Merge pull request #1 from borchero/feature/ddp_cpu_no_spawn

6fa563a

Enable multi-node CPU-only training without spawning

Update accelerator_connector.py

89ee1f2

borchero requested review from Borda, carmocca, SeanNaren and tchaton as code owners September 20, 2021 00:26

stale bot added the won't fix This will not be worked on label Oct 5, 2021

Lightning-AI deleted a comment from stale bot Oct 5, 2021

stale bot removed the won't fix This will not be worked on label Oct 5, 2021

Borda assigned awaelchli Oct 5, 2021

borchero mentioned this pull request Oct 9, 2021

Single-process multi-node CPU training #9877

Closed

Update changelog

ce8d5d3

borchero requested review from awaelchli, justusschock, kaushikb11, rohitgr7 and williamFalcon as code owners October 9, 2021 12:49

borchero added 4 commits October 9, 2021 14:49

Fix changelog formatting

750df22

Update documentation

097e527

Add documentation and fix test

c68d15e

Merge branch 'master' into patch-2

1b56b8c

awaelchli added distributed Generic distributed-related topic feature Is an improvement or enhancement labels Oct 9, 2021

awaelchli added this to the v1.5 milestone Oct 10, 2021

awaelchli suggested changes Oct 10, 2021

View reviewed changes

borchero and others added 3 commits October 10, 2021 22:22

Update CHANGELOG.md

86b1d2a

Co-authored-by: Adrian Wälchli <[email protected]>

Update changelog

f1fc767

Update pytorch_lightning/trainer/connectors/accelerator_connector.py

15a6f4e

Co-authored-by: Adrian Wälchli <[email protected]>

borchero added 2 commits October 10, 2021 22:34

Address comments

a80fd33

Merge branch 'patch-2' of github.com:borchero/pytorch-lightning into …

c3764ff

…patch-2

borchero requested a review from edenlightning as a code owner October 10, 2021 20:34

awaelchli approved these changes Oct 10, 2021

View reviewed changes

pytorch_lightning/trainer/connectors/accelerator_connector.py Outdated Show resolved Hide resolved

tests/accelerators/test_accelerator_connector.py Outdated Show resolved Hide resolved

docs/source/common/trainer.rst Outdated Show resolved Hide resolved

borchero added 2 commits October 10, 2021 23:57

Fix docs

2ed3450

Implement suggestions

efde395

mergify bot added the has conflicts label Oct 11, 2021

Merge branch 'master' into patch-2

bcc2431

mergify bot removed the has conflicts label Oct 11, 2021

rohitgr7 reviewed Oct 11, 2021

View reviewed changes

tests/accelerators/test_accelerator_connector.py Outdated Show resolved Hide resolved

SkafteNicki reviewed Oct 11, 2021

View reviewed changes

tests/accelerators/test_accelerator_connector.py Outdated Show resolved Hide resolved

borchero added 2 commits October 14, 2021 16:32

Fix tests

835ad77

Merge branch 'patch-2' of github.com:borchero/pytorch-lightning into …

749d67b

…patch-2

carmocca approved these changes Oct 14, 2021

View reviewed changes

mergify bot added the ready PRs ready to be merged label Oct 14, 2021

rohitgr7 reviewed Oct 14, 2021

View reviewed changes

tests/accelerators/test_accelerator_connector.py Outdated Show resolved Hide resolved

borchero and others added 5 commits October 14, 2021 19:48

Fix test

ed16c4c

Update accelerator_connector.py

d505be5

Merge master

b5a3e00

Add additional tests

41a87de

[pre-commit.ci] auto fixes from pre-commit.com hooks

146123b

for more information, see https://pre-commit.ci

rohitgr7 approved these changes Oct 14, 2021

View reviewed changes

carmocca merged commit afbf703 into Lightning-AI:master Oct 14, 2021

rohitgr7 pushed a commit to Tshimanga/pytorch-lightning that referenced this pull request Oct 18, 2021

Single-process multi-node CPU training (Lightning-AI#9603)

07ba0b9

Co-authored-by: Adrian Wälchli <[email protected]> Co-authored-by: thomas chaton <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Single-process multi-node CPU training #9603

Single-process multi-node CPU training #9603

Uh oh!

borchero commented Sep 20, 2021 •

edited by awaelchli

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

awaelchli commented Oct 10, 2021

Uh oh!

borchero commented Oct 10, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

awaelchli commented Oct 10, 2021

Uh oh!

Uh oh!

Uh oh!

borchero commented Oct 14, 2021

Uh oh!

codecov bot commented Oct 14, 2021 •

edited

Loading

Uh oh!

Uh oh!

rohitgr7 left a comment

Uh oh!

Uh oh!

Single-process multi-node CPU training #9603

Single-process multi-node CPU training #9603

Uh oh!

Conversation

borchero commented Sep 20, 2021 • edited by awaelchli Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

PR review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

awaelchli commented Oct 10, 2021

Uh oh!

borchero commented Oct 10, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

awaelchli commented Oct 10, 2021

Uh oh!

Uh oh!

Uh oh!

borchero commented Oct 14, 2021

Uh oh!

codecov bot commented Oct 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

rohitgr7 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

borchero commented Sep 20, 2021 •

edited by awaelchli

Loading

codecov bot commented Oct 14, 2021 •

edited

Loading