-
Notifications
You must be signed in to change notification settings - Fork 3.5k
2/n Consolidate collective functions - collective base and subclasses #9414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
3aeb090
to
d7c6274
Compare
d7c6274
to
fadc1b4
Compare
fadc1b4
to
7c0d4ab
Compare
pytorch_lightning/plugins/collective/single_device_collective.py
Outdated
Show resolved
Hide resolved
853970e
to
8599121
Compare
Codecov Report
@@ Coverage Diff @@
## master #9414 +/- ##
=======================================
- Coverage 93% 88% -4%
=======================================
Files 179 186 +7
Lines 15305 15240 -65
=======================================
- Hits 14197 13453 -744
- Misses 1108 1787 +679 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- for getting something initially working, one PR makes sense, especially to have clear up the API requirements. but once the API requirements are clear, I think this will be easier to review in a stack, especially for the implementations. that'll make it easier so we don't miss anything during the carryover period
- from the gdoc (@yifuwang ) for torch distributed, the process group initialization could be part of the collectives. do you think it makes sense to offer
setup
/teardown
functions on the collective interface that allow for this? this way, the trainer can know if it was the agent which initialized the global process group vs not (cc @kaushikb11 as it potentially relates to Moveinit_ddp_connection
to distributed utilities #9044)
@abstractmethod | ||
def reduce_boolean_decision(self, decision: bool) -> bool: | ||
"""Reduce the early stopping decision across all processes.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- we shouldn't mention early stopping here
- the world size is also needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, how about add world size to be a param in torch_collective, and setup at ddp setup_distributed() and dp go with default 1. Only Torch collective needs this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 the cmt looks quite specific.
what does reduce_boolean really mean here? whats the reduce op given list of boolean? probably add a sentence to explain and add a concrete example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think reduce_boolean_decision
should be part of the collective interface. This is really higher order functionality for how the Training Type plugin can use the collective to reach consensus across ranks. So I think this should sit at the training type plugin interface instead of here
67a7f6e
to
558956d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add these classes to the project.toml for mypy types?
we want newly added classes and files to be fully type checked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it very much.
3/n will be the integration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it very much.
3/n will be the integration?
Yep, will integrate with ttps and accelerators.
f293720
to
4ae63fd
Compare
@four4fish at this point, I don't think offering a new collective base class that unifies all these disparate backends is the right approach. It offers new API surface area in Lightning but can't guarantee that the behavior from the different underlying libraries won't diverge or have different semantics. IMO this should be addressed at the torch distributed level, especially for XLA usage. In the meantime, Lightning can get by calling the existing utilities as-is inside of the corresponding strategy class. |
@ananthsub are you advocating to drop this proposal entirely? is there any specific reason for the change of direction? |
@carmocca @awaelchli I have synced up with Ananth offline. Not changing direction or drop this plan, we are thinking about switch the implementation order a bit.
How about let's have this after the step 2 and 3. (I'm working on the PR for #7324 right now) |
Co-authored-by: Adrian Wälchli <[email protected]>
for more information, see https://pre-commit.ci
33a3ef0
to
5f7febf
Compare
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://pytorch-lightning.readthedocs.io/en/latest/generated/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Slack. Thank you for your contributions. |
This pull request is going to be closed. Please feel free to reopen it create a new from the actual master. |
What does this PR do?
Implementing proposal here https://docs.google.com/document/d/1e83FcZHHHsTmiUmNpgmPTBYjugZaC3pcZhd4VV9AuIM/edit
Steps:
Fixes #7534
Does your PR introduce any breaking changes? If yes, please list them.
Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃