[doc][c10d] Fixes to FSDP tutorial (#3138)

c-p-i-o · web-flow · commit 4e9296e9adab · 2024-10-31T11:34:25.000-07:00
Summary: Fix up the actual FSDP tutorial to get it running again. https://github.com/pytorch/examples/pull/1297/files This tutorial is refered to in this document. In addition to this, minor fixups in this document. 1. Fix typo in link. 2. Add a grid card with pre-requisites and what you will learn. 3. Add more links to actual FSDP paper. 4. Stop refering to PyTorch nightly, instead just tell reader to get latest PyTorch as FSDP has been released for a while. Test Plan: Render and examine. Reviewers: Subscribers: Tasks: Tags:
diff --git a/distributed/home.rst b/distributed/home.rst
@@ -77,7 +77,7 @@ Learn FSDP
 
      .. grid-item-card:: :octicon:`file-code;1em`
         FSDP Advanced
-        :link: https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html?utm_source=distr_landing&utm_medium=FSDP_advanced
+        :link: https://pytorch.org/tutorials/intermediate/FSDP_advanced_tutorial.html?utm_source=distr_landing&utm_medium=FSDP_advanced
         :link-type: url
 
         In this tutorial, you will learn how to fine-tune a HuggingFace (HF) T5
diff --git a/index.rst b/index.rst
@@ -763,7 +763,7 @@ Welcome to PyTorch Tutorials
    :header: Advanced Model Training with Fully Sharded Data Parallel (FSDP)
    :card_description: Explore advanced model training with Fully Sharded Data Parallel package.
    :image: _static/img/thumbnails/cropped/Getting-Started-with-FSDP.png
-   :link: intermediate/FSDP_adavnced_tutorial.html
+   :link: intermediate/FSDP_advanced_tutorial.html
    :tags: Parallel-and-Distributed-Training
 
 .. customcarditem::
@@ -1115,7 +1115,7 @@ Additional Resources
    intermediate/ddp_tutorial
    intermediate/dist_tuto
    intermediate/FSDP_tutorial
-   intermediate/FSDP_adavnced_tutorial
+   intermediate/FSDP_advanced_tutorial
    intermediate/TCPStore_libuv_backend
    intermediate/TP_tutorial
    intermediate/pipelining_tutorial
diff --git a/intermediate_source/FSDP_advanced_tutorial.rst b/intermediate_source/FSDP_advanced_tutorial.rst
@@ -6,25 +6,44 @@ Wright <https://github.com/lessw2020>`__, `Rohan Varma
 <https://github.com/rohan-varma/>`__, `Yanli Zhao
 <https://github.com/zhaojuanmao>`__
 
+.. grid:: 2
+
+   .. grid-item-card:: :octicon:`mortar-board;1em;` What you will learn
+      :class-card: card-prerequisites
+
+      * PyTorch's Fully Sharded Data Parallel Module: A wrapper for sharding module parameters across
+      data parallel workers.
+
+
+
+
+   .. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites
+      :class-card: card-prerequisites
+
+      * PyTorch 1.12 or later
+      * Read about the `FSDP API <https://pytorch.org/docs/main/fsdp.html>`__.
+
 
 This tutorial introduces more advanced features of Fully Sharded Data Parallel
 (FSDP) as part of the PyTorch 1.12 release. To get familiar with FSDP, please
 refer to the `FSDP getting started tutorial
 <https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html>`__.
 
 In this tutorial, we fine-tune a HuggingFace (HF) T5 model with FSDP for text
-summarization as a working example. 
+summarization as a working example.
 
 The example uses Wikihow and for simplicity, we will showcase the training on a
-single node, P4dn instance with 8 A100 GPUs. We will soon have a blog post on
-large scale FSDP training on a multi-node cluster, please stay tuned for that on
-the PyTorch medium channel.
+single node, P4dn instance with 8 A100 GPUs. We now have several blog posts (
+`(link1), <https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/>`__
+`(link2) <https://engineering.fb.com/2021/07/15/open-source/fsdp/>`__)
+and a `paper <https://arxiv.org/abs/2304.11277>`__ on
+large scale FSDP training on a multi-node cluster.
 
 FSDP is a production ready package with focus on ease of use, performance, and
 long-term support.  One of the main benefits of FSDP is reducing the memory
 footprint on each GPU. This enables training of larger models with lower total
 memory vs DDP, and leverages the overlap of computation and communication to
-train models efficiently. 
+train models efficiently.
 This reduced memory pressure can be leveraged to either train larger models or
 increase batch size, potentially helping overall training throughput.  You can
 read more about PyTorch FSDP `here
@@ -47,21 +66,21 @@ Recap on How FSDP Works
 
 At a high level FDSP works as follow:
 
-*In constructor*
+*In the constructor*
 
 * Shard model parameters and each rank only keeps its own shard
 
-*In forward pass*
+*In the forward pass*
 
 * Run `all_gather` to collect all shards from all ranks to recover the full
-  parameter for this FSDP unit Run forward computation
-* Discard non-owned parameter shards it has just collected to free memory
+  parameter for this FSDP unit and run the forward computation
+* Discard the non-owned parameter shards it has just collected to free memory
 
-*In backward pass*
+*In the backward pass*
 
 * Run `all_gather` to collect all shards from all ranks to recover the full
-  parameter in this FSDP unit Run backward computation
-* Discard non-owned parameters to free memory. 
+  parameter in this FSDP unit and run backward computation
+* Discard non-owned parameters to free memory.
 * Run reduce_scatter to sync gradients
 
 
@@ -80,15 +99,11 @@ examples
 
 *Setup*
 
-1.1 Install PyTorch Nightlies
-
-We will install PyTorch nightlies, as some of the features such as activation
-checkpointing is available in nightlies and will be added in next PyTorch
-release after 1.12.
+1.1 Install the latest PyTorch
 
-.. code-block:: bash 
+.. code-block:: bash
 
-    pip3 install --pre torch torchvision torchaudio -f https://download.pytorch.org/whl/nightly/cu113/torch_nightly.html
+    pip3 install torch torchvision torchaudio
 
 1.2 Dataset Setup
 
@@ -154,7 +169,7 @@ Next, we add the following code snippets to a Python script “T5_training.py”
     import tqdm
     from datetime import datetime
 
-1.4 Distributed training setup. 
+1.4 Distributed training setup.
 Here we use two helper functions to initialize the processes for distributed
 training,  and then to clean up after training completion.  In this tutorial, we
 are going to use torch elastic, using `torchrun
@@ -191,13 +206,13 @@ metrics.
         date_of_run = datetime.now().strftime("%Y-%m-%d-%I:%M:%S_%p")
         print(f"--> current date and time of run = {date_of_run}")
         return date_of_run
-   
+
     def format_metrics_to_gb(item):
         """quick function to format numbers to gigabyte and round to 4 digit precision"""
         metric_num = item / g_gigabyte
         metric_num = round(metric_num, ndigits=4)
         return metric_num
-    
+
 
 2.2 Define a train function:
 
@@ -275,7 +290,7 @@ metrics.
 
 .. code-block:: python
 
-    
+
     def fsdp_main(args):
 
         model, tokenizer = setup_model("t5-base")
@@ -292,7 +307,7 @@ metrics.
 
 
         #wikihow(tokenizer, type_path, num_samples, input_length, output_length, print_text=False)
-        train_dataset = wikihow(tokenizer, 'train', 1500, 512, 150, False) 
+        train_dataset = wikihow(tokenizer, 'train', 1500, 512, 150, False)
         val_dataset = wikihow(tokenizer, 'validation', 300, 512, 150, False)
 
         sampler1 = DistributedSampler(train_dataset, rank=rank, num_replicas=world_size, shuffle=True)
@@ -430,7 +445,7 @@ metrics.
 
 .. code-block:: python
 
-    
+
     if __name__ == '__main__':
         # Training settings
         parser = argparse.ArgumentParser(description='PyTorch T5 FSDP Example')
@@ -463,7 +478,7 @@ metrics.
 
 To run the the training using torchrun:
 
-.. code-block:: bash 
+.. code-block:: bash
 
     torchrun --nnodes 1 --nproc_per_node 4  T5_training.py
 
@@ -487,7 +502,7 @@ communication efficient.  In PyTorch 1.12, FSDP added this support and now we
 have a wrapping policy for transfomers.
 
 It can be created as follows, where the T5Block represents the T5 transformer
-layer class (holding MHSA and FFN).  
+layer class (holding MHSA and FFN).
 
 
 .. code-block:: python
@@ -499,7 +514,7 @@ layer class (holding MHSA and FFN).
             },
         )
     torch.cuda.set_device(local_rank)
-  
+
 
     model = FSDP(model,
         auto_wrap_policy=t5_auto_wrap_policy)
@@ -513,22 +528,22 @@ Mixed Precision
 FSDP supports flexible mixed precision training allowing for arbitrary reduced
 precision types (such as fp16 or bfloat16). Currently BFloat16 is only available
 on Ampere GPUs, so you need to confirm native support before you use it. On
-V100s for example, BFloat16 can still be run but due to it running non-natively,
+V100s for example, BFloat16 can still be run but because it runs non-natively,
 it can result in significant slowdowns.
 
 To check if BFloat16 is natively supported, you can use the following :
 
 .. code-block:: python
-    
+
     bf16_ready = (
         torch.version.cuda
-        and torch.cuda.is_bf16_supported() 
+        and torch.cuda.is_bf16_supported()
         and LooseVersion(torch.version.cuda) >= "11.0"
         and dist.is_nccl_available()
         and nccl.version() >= (2, 10)
     )
 
-One of the advantages of mixed percision in FSDP is providing granular control
+One of the advantages of mixed precision in FSDP is providing granular control
 over different precision levels for parameters, gradients, and buffers as
 follows:
 
@@ -571,7 +586,7 @@ with the following policy:
 .. code-block:: bash
 
     grad_bf16 = MixedPrecision(reduce_dtype=torch.bfloat16)
-    
+
 
 In 2.4 we just add the relevant mixed precision policy to the FSDP wrapper:
 
@@ -604,9 +619,9 @@ CPU-based initialization:
             auto_wrap_policy=t5_auto_wrap_policy,
             mixed_precision=bfSixteen,
             device_id=torch.cuda.current_device())
-     
 
-    
+
+
 Sharding Strategy
 -----------------
 FSDP sharding strategy by default is set to fully shard the model parameters,
@@ -627,7 +642,7 @@ instead of "ShardingStrategy.FULL_SHARD" to the FSDP initialization  as follows:
             sharding_strategy=ShardingStrategy.SHARD_GRAD_OP # ZERO2)
 
 This will reduce the communication overhead in FSDP, in this case, it holds full
-parameters after forward and through the backwards pass. 
+parameters after forward and through the backwards pass.
 
 This saves an all_gather during backwards so there is less communication at the
 cost of a higher memory footprint. Note that full model params are freed at the
@@ -652,12 +667,12 @@ wrapper in 2.4 as follows:
             mixed_precision=bfSixteen,
             device_id=torch.cuda.current_device(),
             backward_prefetch = BackwardPrefetch.BACKWARD_PRE)
-            
+
 `backward_prefetch` has two modes, `BACKWARD_PRE` and `BACKWARD_POST`.
 `BACKWARD_POST` means that the next FSDP unit's params will not be requested
 until the current FSDP unit processing is complete, thus minimizing memory
 overhead.  In some cases, using `BACKWARD_PRE` can increase model training speed
-up to 2-10%, with even higher speed improvements noted for larger models. 
+up to 2-10%, with even higher speed improvements noted for larger models.
 
 Model Checkpoint Saving, by streaming to the Rank0 CPU
 ------------------------------------------------------
@@ -696,7 +711,7 @@ Pytorch 1.12 and used HF T5 as the running example.  Using the proper wrapping
 policy especially for transformer models, along with mixed precision and
 backward prefetch should speed up your training runs. Also, features such as
 initializing the model on device, and checkpoint saving via streaming to CPU
-should help to avoid OOM error in dealing with large models. 
+should help to avoid OOM error in dealing with large models.
 
 We are actively working to add new features to FSDP for the next release. If
 you have feedback, feature requests, questions or are encountering issues