2/n Consolidate collective functions - collective base and subclasses #9414

four4fish · 2021-09-09T19:02:59Z

What does this PR do?

Implementing proposal here https://docs.google.com/document/d/1e83FcZHHHsTmiUmNpgmPTBYjugZaC3pcZhd4VV9AuIM/edit

Steps:

[RFC] create pytorch_lightning/utilities/collective_util.py and move collective related utils from distributed.py to collective_util.py
2/n Consolidate collective functions - collective base and subclasses. (this PR)
Integrate with all training_type_plugins (ddp, dp, fullshard and etc) and Accelerators

Fixes #7534

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

pytorch_lightning/plugins/collective/collective_plugin.py

pytorch_lightning/plugins/collective/horovod_collective.py

pytorch_lightning/plugins/collective/single_device_collective.py

pytorch_lightning/plugins/collective/torch_collective.py

codecov · 2021-09-10T01:48:56Z

Codecov Report

Merging #9414 (f293720) into master (b98ce0a) will decrease coverage by 4%.
The diff coverage is 47%.

❗ Current head f293720 differs from pull request most recent head 522ebf9. Consider uploading reports for the commit 522ebf9 to get more accurate results

@@           Coverage Diff           @@
##           master   #9414    +/-   ##
=======================================
- Coverage      93%     88%    -4%     
=======================================
  Files         179     186     +7     
  Lines       15305   15240    -65     
=======================================
- Hits        14197   13453   -744     
- Misses       1108    1787   +679

ananthsub

for getting something initially working, one PR makes sense, especially to have clear up the API requirements. but once the API requirements are clear, I think this will be easier to review in a stack, especially for the implementations. that'll make it easier so we don't miss anything during the carryover period
from the gdoc (@yifuwang ) for torch distributed, the process group initialization could be part of the collectives. do you think it makes sense to offer setup/teardown functions on the collective interface that allow for this? this way, the trainer can know if it was the agent which initialized the global process group vs not (cc @kaushikb11 as it potentially relates to Move init_ddp_connection to distributed utilities #9044)

ananthsub · 2021-09-11T00:37:32Z

pytorch_lightning/plugins/collective/collective_plugin.py

+    @abstractmethod
+    def reduce_boolean_decision(self, decision: bool) -> bool:
+        """Reduce the early stopping decision across all processes."""


we shouldn't mention early stopping here

the world size is also needed

Good catch, how about add world size to be a param in torch_collective, and setup at ddp setup_distributed() and dp go with default 1. Only Torch collective needs this?

+1 the cmt looks quite specific.

what does reduce_boolean really mean here? whats the reduce op given list of boolean? probably add a sentence to explain and add a concrete example.

I don't think reduce_boolean_decision should be part of the collective interface. This is really higher order functionality for how the Training Type plugin can use the collective to reach consensus across ranks. So I think this should sit at the training type plugin interface instead of here

awaelchli

can you add these classes to the project.toml for mypy types?
we want newly added classes and files to be fully type checked

awaelchli

I like it very much.

3/n will be the integration?

pytorch_lightning/plugins/collective/horovod_collective.py

pytorch_lightning/plugins/collective/torch_collective.py

pytorch_lightning/plugins/collective/tpu_collective.py

pytorch_lightning/plugins/collective/__init__.py

four4fish

I like it very much.

3/n will be the integration?

Yep, will integrate with ttps and accelerators.

pytorch_lightning/plugins/collective/tpu_collective.py

pytorch_lightning/plugins/collective/__init__.py

pytorch_lightning/plugins/collective/horovod_collective.py

pytorch_lightning/plugins/collective/tpu_collective.py

pytorch_lightning/plugins/collective/collective_plugin.py

pytorch_lightning/plugins/collective/torch_collective.py

ananthsub · 2021-11-13T06:20:44Z

@four4fish at this point, I don't think offering a new collective base class that unifies all these disparate backends is the right approach. It offers new API surface area in Lightning but can't guarantee that the behavior from the different underlying libraries won't diverge or have different semantics.

IMO this should be addressed at the torch distributed level, especially for XLA usage. In the meantime, Lightning can get by calling the existing utilities as-is inside of the corresponding strategy class.

carmocca · 2021-11-15T13:51:41Z

@ananthsub are you advocating to drop this proposal entirely?

is there any specific reason for the change of direction?

four4fish · 2021-11-16T01:54:57Z

@ananthsub are you advocating to drop this proposal entirely?

is there any specific reason for the change of direction?

@carmocca @awaelchli I have synced up with Ananth offline. Not changing direction or drop this plan, we are thinking about switch the implementation order a bit.

Unified collective behavior should be provided from pytorch side, now there is effort going on in pytorch to moving to this direction. We will keep you guys updated.
I think precision plugin refactor and accelerator refactor in [Main Issue] Accelerator and Plugin refactor #10416 is higher priority towards stable API 1.6.

How about let's have this after the step 2 and 3. (I'm working on the PR for #7324 right now)
What do you guys think? I can also rebase and update this first.

Co-authored-by: Adrian Wälchli <[email protected]>

for more information, see https://pre-commit.ci

stale · 2022-06-06T01:38:12Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://pytorch-lightning.readthedocs.io/en/latest/generated/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Slack. Thank you for your contributions.

stale · 2022-06-13T04:34:58Z

This pull request is going to be closed. Please feel free to reopen it create a new from the actual master.

four4fish requested review from awaelchli, Borda, carmocca, justusschock, kaushikb11, SeanNaren, tchaton and williamFalcon as code owners September 9, 2021 19:03

mergify bot added the has conflicts label Sep 9, 2021

four4fish force-pushed the collective1 branch from 3aeb090 to d7c6274 Compare September 9, 2021 19:05

mergify bot removed the has conflicts label Sep 9, 2021

four4fish force-pushed the collective1 branch from d7c6274 to fadc1b4 Compare September 9, 2021 19:27

mergify bot added the has conflicts label Sep 9, 2021

four4fish force-pushed the collective1 branch from fadc1b4 to 7c0d4ab Compare September 9, 2021 19:32

mergify bot removed the has conflicts label Sep 9, 2021

ananthsub suggested changes Sep 9, 2021

View reviewed changes

four4fish force-pushed the collective1 branch 2 times, most recently from 853970e to 8599121 Compare September 10, 2021 01:28

ananthsub reviewed Sep 11, 2021

View reviewed changes

four4fish force-pushed the collective1 branch from 67a7f6e to 558956d Compare September 11, 2021 01:39

awaelchli reviewed Sep 11, 2021

View reviewed changes

awaelchli approved these changes Sep 11, 2021

View reviewed changes

ananthsub reviewed Sep 12, 2021

View reviewed changes

pytorch_lightning/plugins/collective/__init__.py Outdated Show resolved Hide resolved

four4fish commented Sep 12, 2021

View reviewed changes

pytorch_lightning/plugins/collective/tpu_collective.py Outdated Show resolved Hide resolved

four4fish requested a review from edenlightning as a code owner September 13, 2021 01:43

four4fish force-pushed the collective1 branch from f293720 to 4ae63fd Compare September 13, 2021 01:56

tchaton reviewed Sep 13, 2021

View reviewed changes

yifuwang reviewed Sep 13, 2021

View reviewed changes

pytorch_lightning/plugins/collective/collective_plugin.py Show resolved Hide resolved

yifuwang reviewed Sep 13, 2021

View reviewed changes

pytorch_lightning/plugins/collective/torch_collective.py Outdated Show resolved Hide resolved

tchaton added the priority: 0 High priority task label Nov 15, 2021

tchaton requested a review from ananthsub November 15, 2021 12:51

mergify bot added the has conflicts label Nov 18, 2021

four4fish and others added 10 commits December 6, 2021 14:37

2/n Consolidate collective functions - collective base and subclasses

3b67185

2/n Consolidate collective functions - collective base and subclasses

4ddaf0f

2/n Consolidate collective functions - collective base and subclasses

9766a72

2/n Consolidate collective functions - collective base and subclasses

d2f6c53

2/n Consolidate collective functions - collective base and subclasses

cea07e1

Apply suggestions from code review

c1ececd

Co-authored-by: Adrian Wälchli <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

97e2a8d

for more information, see https://pre-commit.ci

2/n Consolidate collective functions - collective base and subclasses

763352c

2/n Consolidate collective functions - collective base and subclasses

393adcc

2/n Consolidate collective functions - collective base and subclasses

5f7febf

four4fish force-pushed the collective1 branch from 33a3ef0 to 5f7febf Compare December 6, 2021 22:39

mergify bot removed the has conflicts label Dec 6, 2021

Update CHANGELOG.md

464fbf1

awaelchli mentioned this pull request Dec 14, 2021

Remove redundant special case for disabling the progress bar on TPU #11061

Merged

11 tasks

This was referenced Feb 8, 2022

[Tracker] Remaining tasks for Strategy stable version #11812

Closed

Flatten the Strategy inheritance #11863

Open

carmocca modified the milestones: 1.6, 1.7 Feb 16, 2022

carmocca removed the priority: 0 High priority task label Apr 21, 2022

carmocca removed this from the 1.7 milestone Apr 21, 2022

stale bot added the won't fix This will not be worked on label Jun 6, 2022

stale bot closed this Jun 13, 2022

2/n Consolidate collective functions - collective base and subclasses #9414

2/n Consolidate collective functions - collective base and subclasses #9414

Uh oh!

Conversation

four4fish commented Sep 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

PR review

Did you have fun?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Sep 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ananthsub left a comment

Choose a reason for hiding this comment

Uh oh!

ananthsub Sep 11, 2021

Choose a reason for hiding this comment

Uh oh!

four4fish Sep 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bowangbj Sep 15, 2021

Choose a reason for hiding this comment

Uh oh!

ananthsub Sep 16, 2021

Choose a reason for hiding this comment

Uh oh!

awaelchli left a comment

Choose a reason for hiding this comment

Uh oh!

awaelchli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

four4fish left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ananthsub commented Nov 13, 2021

Uh oh!

carmocca commented Nov 15, 2021

Uh oh!

four4fish commented Nov 16, 2021

Uh oh!

stale bot commented Jun 6, 2022

Uh oh!

stale bot commented Jun 13, 2022

Uh oh!

four4fish commented Sep 9, 2021 •

edited

Loading

codecov bot commented Sep 10, 2021 •

edited

Loading

four4fish Sep 11, 2021 •

edited

Loading