Skip to content

2/n Consolidate collective functions - collective base and subclasses #9414

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from

Conversation

four4fish
Copy link
Contributor

@four4fish four4fish commented Sep 9, 2021

What does this PR do?

Implementing proposal here https://docs.google.com/document/d/1e83FcZHHHsTmiUmNpgmPTBYjugZaC3pcZhd4VV9AuIM/edit

Steps:

  1. [RFC] create pytorch_lightning/utilities/collective_util.py and move collective related utils from distributed.py to collective_util.py
  2. 2/n Consolidate collective functions - collective base and subclasses. (this PR)
  3. Integrate with all training_type_plugins (ddp, dp, fullshard and etc) and Accelerators

Fixes #7534

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

@four4fish four4fish force-pushed the collective1 branch 2 times, most recently from 853970e to 8599121 Compare September 10, 2021 01:28
@codecov
Copy link

codecov bot commented Sep 10, 2021

Codecov Report

Merging #9414 (f293720) into master (b98ce0a) will decrease coverage by 4%.
The diff coverage is 47%.

❗ Current head f293720 differs from pull request most recent head 522ebf9. Consider uploading reports for the commit 522ebf9 to get more accurate results

@@           Coverage Diff           @@
##           master   #9414    +/-   ##
=======================================
- Coverage      93%     88%    -4%     
=======================================
  Files         179     186     +7     
  Lines       15305   15240    -65     
=======================================
- Hits        14197   13453   -744     
- Misses       1108    1787   +679     

Copy link
Contributor

@ananthsub ananthsub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • for getting something initially working, one PR makes sense, especially to have clear up the API requirements. but once the API requirements are clear, I think this will be easier to review in a stack, especially for the implementations. that'll make it easier so we don't miss anything during the carryover period
  • from the gdoc (@yifuwang ) for torch distributed, the process group initialization could be part of the collectives. do you think it makes sense to offer setup/teardown functions on the collective interface that allow for this? this way, the trainer can know if it was the agent which initialized the global process group vs not (cc @kaushikb11 as it potentially relates to Move init_ddp_connection to distributed utilities #9044)

Comment on lines 54 to 56
@abstractmethod
def reduce_boolean_decision(self, decision: bool) -> bool:
"""Reduce the early stopping decision across all processes."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • we shouldn't mention early stopping here
  • the world size is also needed

Copy link
Contributor Author

@four4fish four4fish Sep 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, how about add world size to be a param in torch_collective, and setup at ddp setup_distributed() and dp go with default 1. Only Torch collective needs this?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 the cmt looks quite specific.

what does reduce_boolean really mean here? whats the reduce op given list of boolean? probably add a sentence to explain and add a concrete example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think reduce_boolean_decision should be part of the collective interface. This is really higher order functionality for how the Training Type plugin can use the collective to reach consensus across ranks. So I think this should sit at the training type plugin interface instead of here

Copy link
Contributor

@awaelchli awaelchli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add these classes to the project.toml for mypy types?
we want newly added classes and files to be fully type checked

Copy link
Contributor

@awaelchli awaelchli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it very much.

3/n will be the integration?

Copy link
Contributor Author

@four4fish four4fish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it very much.

3/n will be the integration?

Yep, will integrate with ttps and accelerators.

@ananthsub
Copy link
Contributor

@four4fish at this point, I don't think offering a new collective base class that unifies all these disparate backends is the right approach. It offers new API surface area in Lightning but can't guarantee that the behavior from the different underlying libraries won't diverge or have different semantics.

IMO this should be addressed at the torch distributed level, especially for XLA usage. In the meantime, Lightning can get by calling the existing utilities as-is inside of the corresponding strategy class.

@tchaton tchaton added the priority: 0 High priority task label Nov 15, 2021
@tchaton tchaton requested a review from ananthsub November 15, 2021 12:51
@carmocca
Copy link
Contributor

@ananthsub are you advocating to drop this proposal entirely?

is there any specific reason for the change of direction?

@four4fish
Copy link
Contributor Author

@ananthsub are you advocating to drop this proposal entirely?

is there any specific reason for the change of direction?

@carmocca @awaelchli I have synced up with Ananth offline. Not changing direction or drop this plan, we are thinking about switch the implementation order a bit.

  1. Unified collective behavior should be provided from pytorch side, now there is effort going on in pytorch to moving to this direction. We will keep you guys updated.
  2. I think precision plugin refactor and accelerator refactor in [Main Issue] Accelerator and Plugin refactor #10416 is higher priority towards stable API 1.6.

How about let's have this after the step 2 and 3. (I'm working on the PR for #7324 right now)
What do you guys think? I can also rebase and update this first.

@stale
Copy link

stale bot commented Jun 6, 2022

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://pytorch-lightning.readthedocs.io/en/latest/generated/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Slack. Thank you for your contributions.

@stale stale bot added the won't fix This will not be worked on label Jun 6, 2022
@stale
Copy link

stale bot commented Jun 13, 2022

This pull request is going to be closed. Please feel free to reopen it create a new from the actual master.

@stale stale bot closed this Jun 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
won't fix This will not be worked on
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consolidate collective functions
9 participants