Update DeepSpeed docs after single saving PR (#7036)

Sean Naren · web-flow · commit 7b0b6c31128b · 2021-04-15T08:56:04.000-07:00
diff --git a/docs/source/advanced/multi_gpu.rst b/docs/source/advanced/multi_gpu.rst
@@ -813,12 +813,6 @@ Below we describe how to enable all of these to see benefit. **With all these im
 
 Also please have a look at our :ref:`deepspeed-zero-stage-3-tips` which contains a lot of helpful information when configuring your own models.
 
-.. note::
-    Currently we only support non-elastic checkpointing. This means saving the model across GPUs will save shards of the model on all processes, which will then require the same amount of GPUS to load.
-    This additionally means for inference you must use the ``Trainer.test`` or ``Trainer.predict`` functionality as described below, to ensure we set up the distributed environment correctly.
-
-    This limitation is actively being worked on and will be resolved in the near future.
-
 .. code-block:: python
 
     from pytorch_lightning import Trainer
@@ -848,12 +842,6 @@ We expose a hook that layers initialized within the hook will be sharded instant
 
 This reduces the time taken to initialize very large models, as well as ensure we do not run out of memory when instantiating larger models. For more information you can refer to the DeepSpeed docs for `Constructing Massive Models <https://deepspeed.readthedocs.io/en/latest/zero3.html>`_.
 
-.. note::
-    When using the ``configure_sharded_model`` hook to shard models, note that ``LightningModule.load_from_checkpoint`` may not work for loading saved checkpoints. If you've trained on one GPU, you can manually instantiate the model and call the hook,
-    however when using multiple GPUs, this will not work as ``LightningModule.load_from_checkpoint`` doesn't support sharded checkpoints.
-
-    We recommend using ``Trainer.test`` or ``Trainer.predict`` for inference.
-
 .. code-block:: python
 
     from pytorch_lightning import Trainer
@@ -950,7 +938,7 @@ Here is some helpful information when setting up DeepSpeed ZeRO Stage 3 with Lig
 * If you're using Adam or AdamW, ensure to use FusedAdam or DeepSpeedCPUAdam (for CPU Offloading) rather than the default torch optimizers as they come with large speed benefits
 * Treat your GPU/CPU memory as one large pool. In some cases, you may not want to offload certain things (like activations) to provide even more space to offload model parameters
 * When offloading to the CPU, make sure to bump up the batch size as GPU memory will be freed
-
+* We also support sharded checkpointing. By passing ``save_full_weights=False`` to the ``DeepSpeedPlugin``, we'll save shards of the model which allows you to save extremely large models. However to load the model and run test/validation/predict you must use the Trainer object.
 
 Custom DeepSpeed Config
 """""""""""""""""""""""