Advanced GPU Documentation #7259

SeanNaren · 2021-04-28T16:09:20Z

What does this PR do?

Introduces a new advanced multi-gpu section, with more explanation and details. Cleanup of old APIs + addition of Fully Sharded and activation checkpointing.

A lot of the high level points may need actual data to back them up, but are collations from the DeepSpeed/FairScale team. It's more important right now to highlight the high level points, and then trickle down to data points via visualizations if possible.

If anyone has any suggestions on better naming than Advanced GPU Optimized Training Let me know!

cc @ananthsub @shuyingsunshine21 @min-xu-ai

TODO:

Add DDP Communication Hooks section
Potentially come up with a better name than Memory Optimized Multi-GPU Training (how about Advanced Multi-GPU Training)?
Add FSDP requirement that you must use self.trainer.model when doing configure_optimizers

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

# Conflicts: # docs/source/advanced/multi_gpu.rst

codecov · 2021-04-28T16:11:34Z

Codecov Report

Merging #7259 (74e7fc3) into master (6b29211) will increase coverage by 0%.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #7259    +/-   ##
=======================================
  Coverage      91%     92%            
=======================================
  Files         199     200     +1     
  Lines       12779   12982   +203     
=======================================
+ Hits        11679   11896   +217     
+ Misses       1100    1086    -14

docs/source/advanced/optimized_multi_gpu.rst

min-xu-ai

fairscale part looks great to me. Thanks for adding this great doc!

docs/source/advanced/optimized_multi_gpu.rst

min-xu-ai · 2021-04-28T16:50:00Z

docs/source/advanced/optimized_multi_gpu.rst

+.. code-block:: python
+
+    # train using Sharded DDP
+    trainer = Trainer(plugins='ddp_sharded')


I suggest adding a plugin alias of "sdp", it is kind of easier and fits the group of names like "ddp", "sdp" and "fsdp".

min-xu-ai · 2021-04-28T16:54:31Z

docs/source/advanced/optimized_multi_gpu.rst

+
+When not using Fully Sharded these wrap functions are a no-op. This means once the changes have been made, there is no need to remove the changes for other plugins.
+
+This is a requirement for really large models and also saves on instantiation time as modules are sharded instantly, rather than after the entire model is created in memory.


I don't know if something needs to be said about model weight init here? Is that taken care of by lightning? If users get to control it, they need to make sure all workers init the same weights, or the shards will be from different weight init values at each worker.

Later we will try to add a way to sync params from rank 0, in that case, we can remove this restriction.

Thanks @min-xu-ai, could you go into more details as to what is different here compared to DDP?

Sure. With DDP, you can have this:

rank 0 rank 1 m = model() m = model() <------ two ranks may have different weights due to different random seeds train(m) train(m) <---------------- weights are synced by ddp

With FSDP, since m is sharded, parts of the weights will be from rank 0 and parts of the weights will be from rank 1 when sharding happens. That can break the weight init assumptions, like zero mean and unit stddev etc.

Therefore, until FSDP can sync weights between ranks, weight init needs to be very careful with FSDP.

What would be a potential solution here? use torch.distributed communications to sync global stats across all shards when initializing the model? It would be good to have a solution in place in the docs for users to have an example!

Just following up here because it might be a solution @min-xu-ai, but using SummonFullParams may be a way to init the model locally if the model would fit into memory, and then broadcast results. I'll add this into the FSDP docs in time when we merge the feature in!

docs/source/advanced/optimized_multi_gpu.rst

SeanNaren · 2021-04-28T19:46:21Z

Appreciate the comments @min-xu-ai will address the last comments ASAP!

williamFalcon · 2021-04-30T10:45:34Z

maybe an advanced tutorials section?
and add this as the first bullet

SeanNaren · 2021-04-30T14:11:56Z

maybe an advanced tutorials section?
and add this as the first bullet

Agreed I actually plan on doing something similar in terms of layout to this which would be closer to an actual tutorial: https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

Like if I wanted to train a transformer model on my data, what are the steps I should take? What should the process look like?

ananthsub

@SeanNaren another setting from DDP to enable memory savings is to set gradient_as_bucket_view=True : https://pytorch.org/docs/master/_modules/torch/nn/parallel/distributed.html#DistributedDataParallel

This should save an extra ~10-15% of peak memory usage and can be an intermediary option for users who don't need the sharded/fully sharded/deepspeed enginees

cc @zhaojuanmao

SeanNaren · 2021-05-04T13:16:20Z

I have added both the DDP Comm hooks (need to test it myself) + the gradient_as_bucket_view into the docs! Let me know what you think @ananthsub :)

Once FSDP is merged, we can merge this PR

ananthsub · 2021-05-05T19:44:41Z

@SeanNaren these are fantastic docs!!! these would be super useful to even merge now (minus FSDP) and then we can add back the FSDP section once #6152 is merged. What do you think?

SeanNaren · 2021-05-05T20:20:13Z

@SeanNaren these are fantastic docs!!! these would be super useful to even merge now (minus FSDP) and then we can add back the FSDP section once #6152 is merged. What do you think?

Thanks @ananthsub let me remove the FSDP stuff and merge :) Was hoping we'll get the FSDP stuff merged, but will separate out to get this in ASAP

SeanNaren · 2021-05-06T09:39:40Z

I have as followups to this PR:

Once FSDP merged, re-add FSDP docs
Create advanced tutorial section and add this as a bullet point
Once DeepSpeed infinity is merged, add this to the docs

EDIT:

I also dropped Sequential RPC Plugin as this will be removed entirely once FSDP merged (which should be merged soonish)

awaelchli · 2021-05-06T09:49:50Z

Great docs @SeanNaren very high quality like everything you do.

Some sections have larger separation than others. Maybe it was intentional but I don't see the pattern.

justusschock

Great Work!

docs/source/advanced/advanced_gpu.rst

tchaton

Awesome work !

docs/source/benchmarking/performance.rst

docs/source/advanced/training_tricks.rst

docs/source/advanced/advanced_gpu.rst

Co-authored-by: Carlos Mocholí <[email protected]> Co-authored-by: Justus Schock <[email protected]>

SeanNaren · 2021-05-06T11:18:09Z

Thanks so much guys, should've addressed all points (thanks @carmocca for cleaning up!)

SeanNaren added 4 commits April 8, 2021 12:34

Added advanced gpu section

88c0262

Small changes

6b128e8

Merge branch 'master' into docs/advanced_multigpu

9ff33a2

# Conflicts: # docs/source/advanced/multi_gpu.rst

Better documentation

b881501

SeanNaren added the docs Documentation related label Apr 28, 2021

SeanNaren added this to the v1.4 milestone Apr 28, 2021

SeanNaren self-assigned this Apr 28, 2021

ananthsub reviewed Apr 28, 2021

View reviewed changes

docs/source/advanced/optimized_multi_gpu.rst Outdated Show resolved Hide resolved

min-xu-ai approved these changes Apr 28, 2021

View reviewed changes

Address code review

809f085

Add warning about using trainer.model, clean up some of the examples

09c54d7

ananthsub reviewed Apr 30, 2021

View reviewed changes

Add section for ddp, remove references and old sequential documentation

a99f3d9

Remove Fully Sharded documentation for now

b222b2d

SeanNaren changed the title ~~[WIP] Separate Optimized Multi-GPU Plugins Documentation~~ Separate Advanced GPU Documentation May 6, 2021

SeanNaren requested a review from a team May 6, 2021 09:38

SeanNaren marked this pull request as ready for review May 6, 2021 09:40

SeanNaren requested review from awaelchli, Borda, edenlightning and tchaton as code owners May 6, 2021 09:40

SeanNaren changed the title ~~Separate Advanced GPU Documentation~~ Advanced GPU Documentation May 6, 2021

justusschock approved these changes May 6, 2021

View reviewed changes

docs/source/advanced/advanced_gpu.rst Outdated Show resolved Hide resolved

docs/source/advanced/advanced_gpu.rst Outdated Show resolved Hide resolved

tchaton approved these changes May 6, 2021

View reviewed changes

carmocca reviewed May 6, 2021

View reviewed changes

Sean Naren and others added 3 commits May 6, 2021 12:16

Apply suggestions from code review

0f74353

Co-authored-by: Carlos Mocholí <[email protected]> Co-authored-by: Justus Schock <[email protected]>

Address code review

e9230d0

Address code review

74e7fc3

carmocca approved these changes May 6, 2021

View reviewed changes

awaelchli approved these changes May 6, 2021

View reviewed changes

SeanNaren merged commit 94f6c3e into master May 6, 2021

SeanNaren deleted the docs/advanced_multigpu branch May 6, 2021 12:53

Borda modified the milestones: v1.4, v1.3 May 6, 2021

SeanNaren mentioned this pull request May 14, 2021

FSDP with full state dict #7487

Merged

11 tasks


		When not using Fully Sharded these wrap functions are a no-op. This means once the changes have been made, there is no need to remove the changes for other plugins.

		This is a requirement for really large models and also saves on instantiation time as modules are sharded instantly, rather than after the entire model is created in memory.

Advanced GPU Documentation #7259

Advanced GPU Documentation #7259

Uh oh!

Conversation

SeanNaren commented Apr 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

PR review

Did you have fun?

Uh oh!

codecov bot commented Apr 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

min-xu-ai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

min-xu-ai Apr 28, 2021

Choose a reason for hiding this comment

Uh oh!

min-xu-ai Apr 28, 2021

Choose a reason for hiding this comment

Uh oh!

SeanNaren Apr 29, 2021

Choose a reason for hiding this comment

Uh oh!

min-xu-ai Apr 29, 2021

Choose a reason for hiding this comment

Uh oh!

SeanNaren May 4, 2021

Choose a reason for hiding this comment

Uh oh!

SeanNaren May 6, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SeanNaren commented Apr 28, 2021

Uh oh!

williamFalcon commented Apr 30, 2021

Uh oh!

SeanNaren commented Apr 30, 2021

Uh oh!

ananthsub left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SeanNaren commented May 4, 2021

Uh oh!

ananthsub commented May 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SeanNaren commented May 5, 2021

Uh oh!

SeanNaren commented May 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

awaelchli commented May 6, 2021

Uh oh!

justusschock left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tchaton left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SeanNaren commented Apr 28, 2021 •

edited

Loading

codecov bot commented Apr 28, 2021 •

edited

Loading

ananthsub left a comment •

edited

Loading

ananthsub commented May 5, 2021 •

edited

Loading

SeanNaren commented May 6, 2021 •

edited

Loading