Skip to content

Advanced GPU Documentation #7259

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
May 6, 2021
Merged

Advanced GPU Documentation #7259

merged 11 commits into from
May 6, 2021

Conversation

SeanNaren
Copy link
Contributor

@SeanNaren SeanNaren commented Apr 28, 2021

What does this PR do?

Introduces a new advanced multi-gpu section, with more explanation and details. Cleanup of old APIs + addition of Fully Sharded and activation checkpointing.

A lot of the high level points may need actual data to back them up, but are collations from the DeepSpeed/FairScale team. It's more important right now to highlight the high level points, and then trickle down to data points via visualizations if possible.

If anyone has any suggestions on better naming than Advanced GPU Optimized Training Let me know!

cc @ananthsub @shuyingsunshine21 @min-xu-ai

TODO:

  • Add DDP Communication Hooks section
  • Potentially come up with a better name than Memory Optimized Multi-GPU Training (how about Advanced Multi-GPU Training)?
  • Add FSDP requirement that you must use self.trainer.model when doing configure_optimizers

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

@SeanNaren SeanNaren added the docs Documentation related label Apr 28, 2021
@SeanNaren SeanNaren added this to the v1.4 milestone Apr 28, 2021
@SeanNaren SeanNaren self-assigned this Apr 28, 2021
@codecov
Copy link

codecov bot commented Apr 28, 2021

Codecov Report

Merging #7259 (74e7fc3) into master (6b29211) will increase coverage by 0%.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #7259    +/-   ##
=======================================
  Coverage      91%     92%            
=======================================
  Files         199     200     +1     
  Lines       12779   12982   +203     
=======================================
+ Hits        11679   11896   +217     
+ Misses       1100    1086    -14     

Copy link

@min-xu-ai min-xu-ai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fairscale part looks great to me. Thanks for adding this great doc!

.. code-block:: python

# train using Sharded DDP
trainer = Trainer(plugins='ddp_sharded')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest adding a plugin alias of "sdp", it is kind of easier and fits the group of names like "ddp", "sdp" and "fsdp".


When not using Fully Sharded these wrap functions are a no-op. This means once the changes have been made, there is no need to remove the changes for other plugins.

This is a requirement for really large models and also saves on instantiation time as modules are sharded instantly, rather than after the entire model is created in memory.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if something needs to be said about model weight init here? Is that taken care of by lightning? If users get to control it, they need to make sure all workers init the same weights, or the shards will be from different weight init values at each worker.

Later we will try to add a way to sync params from rank 0, in that case, we can remove this restriction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @min-xu-ai, could you go into more details as to what is different here compared to DDP?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. With DDP, you can have this:

rank 0               rank 1
m = model()    m = model()  <------ two ranks may have different weights due to different random seeds
train(m)            train(m)  <---------------- weights are synced by ddp

With FSDP, since m is sharded, parts of the weights will be from rank 0 and parts of the weights will be from rank 1 when sharding happens. That can break the weight init assumptions, like zero mean and unit stddev etc.

Therefore, until FSDP can sync weights between ranks, weight init needs to be very careful with FSDP.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be a potential solution here? use torch.distributed communications to sync global stats across all shards when initializing the model? It would be good to have a solution in place in the docs for users to have an example!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just following up here because it might be a solution @min-xu-ai, but using SummonFullParams may be a way to init the model locally if the model would fit into memory, and then broadcast results. I'll add this into the FSDP docs in time when we merge the feature in!

@SeanNaren
Copy link
Contributor Author

Appreciate the comments @min-xu-ai will address the last comments ASAP!

@williamFalcon
Copy link
Contributor

maybe an advanced tutorials section?
and add this as the first bullet

@SeanNaren
Copy link
Contributor Author

maybe an advanced tutorials section?
and add this as the first bullet

Agreed I actually plan on doing something similar in terms of layout to this which would be closer to an actual tutorial: https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

Like if I wanted to train a transformer model on my data, what are the steps I should take? What should the process look like?

Copy link
Contributor

@ananthsub ananthsub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SeanNaren another setting from DDP to enable memory savings is to set gradient_as_bucket_view=True : https://pytorch.org/docs/master/_modules/torch/nn/parallel/distributed.html#DistributedDataParallel

This should save an extra ~10-15% of peak memory usage and can be an intermediary option for users who don't need the sharded/fully sharded/deepspeed enginees

cc @zhaojuanmao

@SeanNaren
Copy link
Contributor Author

I have added both the DDP Comm hooks (need to test it myself) + the gradient_as_bucket_view into the docs! Let me know what you think @ananthsub :)

Once FSDP is merged, we can merge this PR

@ananthsub
Copy link
Contributor

ananthsub commented May 5, 2021

@SeanNaren these are fantastic docs!!! these would be super useful to even merge now (minus FSDP) and then we can add back the FSDP section once #6152 is merged. What do you think?

@SeanNaren
Copy link
Contributor Author

@SeanNaren these are fantastic docs!!! these would be super useful to even merge now (minus FSDP) and then we can add back the FSDP section once #6152 is merged. What do you think?

Thanks @ananthsub let me remove the FSDP stuff and merge :) Was hoping we'll get the FSDP stuff merged, but will separate out to get this in ASAP

@SeanNaren SeanNaren changed the title [WIP] Separate Optimized Multi-GPU Plugins Documentation Separate Advanced GPU Documentation May 6, 2021
@SeanNaren SeanNaren requested a review from a team May 6, 2021 09:38
@SeanNaren
Copy link
Contributor Author

SeanNaren commented May 6, 2021

I have as followups to this PR:

  • Once FSDP merged, re-add FSDP docs
  • Create advanced tutorial section and add this as a bullet point
  • Once DeepSpeed infinity is merged, add this to the docs

EDIT:

I also dropped Sequential RPC Plugin as this will be removed entirely once FSDP merged (which should be merged soonish)

@SeanNaren SeanNaren marked this pull request as ready for review May 6, 2021 09:40
@SeanNaren SeanNaren changed the title Separate Advanced GPU Documentation Advanced GPU Documentation May 6, 2021
@awaelchli
Copy link
Contributor

Great docs @SeanNaren very high quality like everything you do.

Some sections have larger separation than others. Maybe it was intentional but I don't see the pattern.

image

Copy link
Member

@justusschock justusschock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great Work!

Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work !

Sean Naren and others added 3 commits May 6, 2021 12:16
@SeanNaren
Copy link
Contributor Author

Thanks so much guys, should've addressed all points (thanks @carmocca for cleaning up!)

@SeanNaren SeanNaren merged commit 94f6c3e into master May 6, 2021
@SeanNaren SeanNaren deleted the docs/advanced_multigpu branch May 6, 2021 12:53
@Borda Borda modified the milestones: v1.4, v1.3 May 6, 2021
@SeanNaren SeanNaren mentioned this pull request May 14, 2021
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants