diff --git a/.azure-pipelines/gpu-tests.yml b/.azure-pipelines/gpu-tests.yml index f009ea1b9bb0b..8f49b3346009d 100644 --- a/.azure-pipelines/gpu-tests.yml +++ b/.azure-pipelines/gpu-tests.yml @@ -51,7 +51,7 @@ jobs: - bash: | python -c "fname = 'requirements/extra.txt' ; lines = [line for line in open(fname).readlines() if 'horovod' not in line] ; open(fname, 'w').writelines(lines)" pip install fairscale>=0.3.4 - pip install "deepspeed==0.4.3" # FIXME: bug with >= 0.4.4 + pip install deepspeed==0.5.4 pip install . --requirement requirements/devel.txt pip list displayName: 'Install dependencies' @@ -106,10 +106,10 @@ jobs: set -e python -m pytest pl_examples -v --maxfail=2 --durations=0 bash pl_examples/run_examples.sh --trainer.gpus=1 - bash pl_examples/run_examples.sh --trainer.gpus=2 --trainer.accelerator=ddp - bash pl_examples/run_examples.sh --trainer.gpus=2 --trainer.accelerator=ddp --trainer.precision=16 - bash pl_examples/run_examples.sh --trainer.gpus=2 --trainer.accelerator=dp - bash pl_examples/run_examples.sh --trainer.gpus=2 --trainer.accelerator=dp --trainer.precision=16 + bash pl_examples/run_examples.sh --trainer.gpus=2 --trainer.strategy=ddp + bash pl_examples/run_examples.sh --trainer.gpus=2 --trainer.strategy=ddp --trainer.precision=16 + bash pl_examples/run_examples.sh --trainer.gpus=2 --trainer.strategy=dp + bash pl_examples/run_examples.sh --trainer.gpus=2 --trainer.strategy=dp --trainer.precision=16 env: PL_USE_MOCKED_MNIST: "1" displayName: 'Testing: examples' diff --git a/.gitignore b/.gitignore index 6ad0671fb3306..7b1247433e7b4 100644 --- a/.gitignore +++ b/.gitignore @@ -156,3 +156,4 @@ cifar-10-batches-py *.pt # ctags tags +.tags diff --git a/CHANGELOG.md b/CHANGELOG.md index 70044b87791f6..06d3f824e8470 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -5,11 +5,14 @@ All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/). -## [unReleased] - 2021-MM-DD +## [1.5.0] - 2021-MM-DD ### Added +- Add support for monitoring the learning rate monitor without schedulers in `LearningRateMonitor` ([#9786](https://github.com/PyTorchLightning/pytorch-lightning/issues/9786)) + + - Register `ShardedTensor` state dict hooks in `LightningModule.__init__` if the pytorch version supports `ShardedTensor` ([#8944](https://github.com/PyTorchLightning/pytorch-lightning/pull/8944)) @@ -163,6 +166,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/). - Added a warning when an unknown key is encountered in optimizer configuration, and when `OneCycleLR` is used with `"interval": "epoch"` ([#9666](https://github.com/PyTorchLightning/pytorch-lightning/pull/9666)) +- Added `DeviceStatsMonitor` callback ([#9712](https://github.com/PyTorchLightning/pytorch-lightning/pull/9712)) + + - Added `enable_progress_bar` to Trainer constructor ([#9664](https://github.com/PyTorchLightning/pytorch-lightning/pull/9664)) @@ -175,13 +181,36 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/). - Enabled automatic parameters tying for TPUs ([#9525](https://github.com/PyTorchLightning/pytorch-lightning/pull/9525)) +- Raise a `MisconfigurationException` when trainer functions are called with `ckpt_path="best"` but `checkpoint_callback` isn't configured ([#9841](https://github.com/PyTorchLightning/pytorch-lightning/pull/9841)) + + - Added support for `torch.autograd.set_detect_anomaly` through `Trainer` constructor argument `detect_anomaly` ([#9848](https://github.com/PyTorchLightning/pytorch-lightning/pull/9848)) +- Added a `len` method to `LightningDataModule` ([#9895](https://github.com/PyTorchLightning/pytorch-lightning/pull/9895)) + + +- Added `enable_model_summary` flag to Trainer ([#9699](https://github.com/PyTorchLightning/pytorch-lightning/pull/9699)) + + +- Added `strategy` argument to Trainer ([#8597](https://github.com/PyTorchLightning/pytorch-lightning/pull/8597)) + + +- Added `kfold` example for loop customization ([#9965](https://github.com/PyTorchLightning/pytorch-lightning/pull/9965)) + + +- LightningLite: + * Added `PrecisionPlugin.forward_context`, making it the default implementation for all `{train,val,test,predict}_step_context()` methods ([#9988](https://github.com/PyTorchLightning/pytorch-lightning/pull/9988)) + + ### Changed +- Setting `Trainer(accelerator="ddp_cpu")` now does not spawn a subprocess if `num_processes` is kept `1` along with `num_nodes > 1` ([#9603](https://github.com/PyTorchLightning/pytorch-lightning/pull/9603)). + + - Module imports are now catching `ModuleNotFoundError` instead of `ImportError` ([#9867](https://github.com/PyTorchLightning/pytorch-lightning/pull/9867)) + - `pytorch_lightning.loggers.neptune.NeptuneLogger` is now consistent with new [neptune-client](https://github.com/neptune-ai/neptune-client) API ([#6867](https://github.com/PyTorchLightning/pytorch-lightning/pull/6867)). Old [neptune-client](https://github.com/neptune-ai/neptune-client) API is supported by `NeptuneClient` from [neptune-contrib](https://github.com/neptune-ai/neptune-contrib) repo. @@ -257,6 +286,10 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/). - Changed `HorovodPlugin.all_gather` to return a `torch.Tensor` instead of a list ([#9696](https://github.com/PyTorchLightning/pytorch-lightning/pull/9696)) +- Changed Trainer connectors to be protected attributes: + * Configuration Validator ([#9779](https://github.com/PyTorchLightning/pytorch-lightning/pull/9779)) + + - Restore `current_epoch` and `global_step` irrespective of trainer task ([#9413](https://github.com/PyTorchLightning/pytorch-lightning/pull/9413)) @@ -269,8 +302,25 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/). - Update the logic to check for accumulation steps with deepspeed ([#9826](https://github.com/PyTorchLightning/pytorch-lightning/pull/9826)) +- `pytorch_lightning.utilities.grads.grad_norm` now raises an exception if parameter `norm_type <= 0` ([#9765](https://github.com/PyTorchLightning/pytorch-lightning/pull/9765)) + + + +- Updated error message for interactive incompatible plugins ([#9896](https://github.com/PyTorchLightning/pytorch-lightning/pull/9896)) + + +- Updated several places in the loops and trainer to access `training_type_plugin` directly instead of `accelerator` ([#9901](https://github.com/PyTorchLightning/pytorch-lightning/pull/9901)) + + + ### Deprecated +- Deprecated trainer argument `terminate_on_nan` in favour of `detect_anomaly`([#9175](https://github.com/PyTorchLightning/pytorch-lightning/pull/9175)) + + +- Deprecated `Trainer.terminate_on_nan` public attribute access ([#9849](https://github.com/PyTorchLightning/pytorch-lightning/pull/9849)) + + - Deprecated `LightningModule.summarize()` in favor of `pytorch_lightning.utilities.model_summary.summarize()` @@ -310,7 +360,7 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/). - Deprecated passing `progress_bar_refresh_rate` to the `Trainer` constructor in favor of adding the `ProgressBar` callback with `refresh_rate` directly to the list of callbacks, or passing `enable_progress_bar=False` to disable the progress bar ([#9616](https://github.com/PyTorchLightning/pytorch-lightning/pull/9616)) -- Deprecate `LightningDistributed` and move the broadcast logic to `DDPPlugin` and `DDPSpawnPlugin` directly ([#9691](https://github.com/PyTorchLightning/pytorch-lightning/pull/9691)) +- Deprecated `LightningDistributed` and move the broadcast logic to `DDPPlugin` and `DDPSpawnPlugin` directly ([#9691](https://github.com/PyTorchLightning/pytorch-lightning/pull/9691)) - Deprecated passing `stochastic_weight_avg` from the `Trainer` constructor in favor of adding the `StochasticWeightAveraging` callback directly to the list of callbacks ([#8989](https://github.com/PyTorchLightning/pytorch-lightning/pull/8989)) @@ -319,12 +369,23 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/). - Deprecated Accelerator collective API `barrier`, `broadcast`, and `all_gather`, call `TrainingTypePlugin` collective API directly ([#9677](https://github.com/PyTorchLightning/pytorch-lightning/pull/9677)) +- Deprecated `checkpoint_callback` from the `Trainer` constructor in favour of `enable_checkpointing` ([#9754](https://github.com/PyTorchLightning/pytorch-lightning/pull/9754)) + + - Deprecated the `LightningModule.on_post_move_to_device` method ([#9525](https://github.com/PyTorchLightning/pytorch-lightning/pull/9525)) - Deprecated `pytorch_lightning.core.decorators.parameter_validation` in favor of `pytorch_lightning.utilities.parameter_tying.set_shared_parameters` ([#9525](https://github.com/PyTorchLightning/pytorch-lightning/pull/9525)) +- Deprecated passing `weights_summary` to the `Trainer` constructor in favor of adding the `ModelSummary` callback with `max_depth` directly to the list of callbacks ([#9699](https://github.com/PyTorchLightning/pytorch-lightning/pull/9699)) + + +- Deprecated `log_gpu_memory`, `gpu_metrics`, and util funcs in favor of `DeviceStatsMonitor` callback ([#9921](https://github.com/PyTorchLightning/pytorch-lightning/pull/9921)) + + +- Deprecated `GPUStatsMonitor` and `XLAStatsMonitor` in favor of `DeviceStatsMonitor` callback ([#9924](https://github.com/PyTorchLightning/pytorch-lightning/pull/9924)) + ### Removed - Removed deprecated `metrics` ([#8586](https://github.com/PyTorchLightning/pytorch-lightning/pull/8586/)) @@ -423,9 +484,24 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/). - Removed `call_configure_sharded_model_hook` property from `Accelerator` and `TrainingTypePlugin` ([#9612](https://github.com/PyTorchLightning/pytorch-lightning/pull/9612)) +- Removed deprecated trainer flag `Trainer.distributed_backend` in favor of `Trainer.accelerator` ([#9246](https://github.com/PyTorchLightning/pytorch-lightning/pull/9246)) + + - Removed `TrainerProperties` mixin and moved property definitions directly into `Trainer` ([#9495](https://github.com/PyTorchLightning/pytorch-lightning/pull/9495)) +- Removed a redundant warning with `ModelCheckpoint(monitor=None)` callback ([#9875](https://github.com/PyTorchLightning/pytorch-lightning/pull/9875)) + + +- Remove `epoch` from `trainer.logged_metrics` ([#9904](https://github.com/PyTorchLightning/pytorch-lightning/pull/9904)) + + +- Removed `should_rank_save_checkpoint` property from Trainer ([#9433](https://github.com/PyTorchLightning/pytorch-lightning/pull/9433)) + + +- Removed deprecated trainer flag `Trainer.distributed_backend` in favor of `Trainer.accelerator` ([#9246](https://github.com/PyTorchLightning/pytorch-lightning/pull/9246)) + + ### Fixed @@ -450,6 +526,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/). - Fixed `BasePredictionWriter` not returning the batch_indices in a non-distributed setting ([#9432](https://github.com/PyTorchLightning/pytorch-lightning/pull/9432)) +- Fixed an error when running on in XLA environments with no TPU attached ([#9572](https://github.com/PyTorchLightning/pytorch-lightning/pull/9572)) + + - Fixed check on torchmetrics logged whose `compute()` output is a multielement tensor ([#9582](https://github.com/PyTorchLightning/pytorch-lightning/pull/9582)) @@ -468,17 +547,35 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/). - Fixed `broadcast` in `DDPPlugin` and ``DDPSpawnPlugin` to respect the `src` input ([#9691](https://github.com/PyTorchLightning/pytorch-lightning/pull/9691)) +- Fixed `self.log(on_epoch=True)` for the `on_batch_start` and `on_train_batch_start` hooks ([#9780](https://github.com/PyTorchLightning/pytorch-lightning/pull/9780)) + + - Fixed restoring training state during `trainer.fit` only ([#9413](https://github.com/PyTorchLightning/pytorch-lightning/pull/9413)) - Fixed DeepSpeed and Lightning both calling the scheduler ([#9788](https://github.com/PyTorchLightning/pytorch-lightning/pull/9788)) + - Fixed missing arguments when saving hyperparameters from the parent class but not from the child class ([#9800](https://github.com/PyTorchLightning/pytorch-lightning/pull/9800)) +- Fixed DeepSpeed GPU device IDs ([#9847](https://github.com/PyTorchLightning/pytorch-lightning/pull/9847)) + + - Reset `val_dataloader` in `tuner/batch_size_scaling` ([#9857](https://github.com/PyTorchLightning/pytorch-lightning/pull/9857)) +- Fixed use of `LightningCLI` in computer_vision_fine_tuning.py example ([#9934](https://github.com/PyTorchLightning/pytorch-lightning/pull/9934)) + + +- Fixed issue with non-init dataclass fields in `apply_to_collection` ([#9963](https://github.com/PyTorchLightning/pytorch-lightning/issues/9963)) + +- Reset `val_dataloader` in `tuner/batch_size_scaling` for binsearch ([#9975](https://github.com/PyTorchLightning/pytorch-lightning/pull/9975)) + + +- Fixed logic to check for spawn in dataloader `TrainerDataLoadingMixin._worker_check` ([#9902](https://github.com/PyTorchLightning/pytorch-lightning/pull/9902)) + + ## [1.4.9] - 2021-09-30 - Fixed `lr_find` to generate same results on multiple calls ([#9704](https://github.com/PyTorchLightning/pytorch-lightning/pull/9704)) diff --git a/benchmarks/test_basic_parity.py b/benchmarks/test_basic_parity.py index e9442dd26e65b..2144be39394cb 100644 --- a/benchmarks/test_basic_parity.py +++ b/benchmarks/test_basic_parity.py @@ -159,7 +159,7 @@ def lightning_loop(cls_model, idx, device_type: str = "cuda", num_epochs=10): # as the first run is skipped, no need to run it long max_epochs=num_epochs if idx > 0 else 1, enable_progress_bar=False, - weights_summary=None, + enable_model_summary=False, gpus=1 if device_type == "cuda" else 0, checkpoint_callback=False, logger=False, diff --git a/benchmarks/test_sharded_parity.py b/benchmarks/test_sharded_parity.py index b6bcb658dcde9..ade0a055d27c2 100644 --- a/benchmarks/test_sharded_parity.py +++ b/benchmarks/test_sharded_parity.py @@ -137,7 +137,7 @@ def plugin_parity_test( ddp_model = model_cls() use_cuda = gpus > 0 - trainer = Trainer(fast_dev_run=True, max_epochs=1, gpus=gpus, precision=precision, accelerator="ddp_spawn") + trainer = Trainer(fast_dev_run=True, max_epochs=1, gpus=gpus, precision=precision, strategy="ddp_spawn") max_memory_ddp, ddp_time = record_ddp_fit_model_stats(trainer=trainer, model=ddp_model, use_cuda=use_cuda) @@ -145,7 +145,7 @@ def plugin_parity_test( seed_everything(seed) custom_plugin_model = model_cls() - trainer = Trainer(fast_dev_run=True, max_epochs=1, gpus=gpus, precision=precision, accelerator="ddp_sharded_spawn") + trainer = Trainer(fast_dev_run=True, max_epochs=1, gpus=gpus, precision=precision, strategy="ddp_sharded_spawn") assert isinstance(trainer.training_type_plugin, DDPSpawnShardedPlugin) max_memory_custom, custom_model_time = record_ddp_fit_model_stats( diff --git a/dockers/tpu-tests/tpu_test_cases.jsonnet b/dockers/tpu-tests/tpu_test_cases.jsonnet index 4a3b9728221a7..55454e7cac0a2 100644 --- a/dockers/tpu-tests/tpu_test_cases.jsonnet +++ b/dockers/tpu-tests/tpu_test_cases.jsonnet @@ -35,7 +35,8 @@ local tputests = base.BaseTest { coverage run --source=pytorch_lightning -m pytest -v --capture=no \ tests/profiler/test_xla_profiler.py \ pytorch_lightning/utilities/xla_device.py \ - tests/accelerators/test_tpu_backend.py \ + tests/accelerators/test_tpu.py \ + tests/callbacks/test_device_stats_monitor.py \ tests/models/test_tpu.py test_exit_code=$? echo "\n||| END PYTEST LOGS |||\n" diff --git a/docs/source/advanced/multi_gpu.rst b/docs/source/advanced/multi_gpu.rst index ee689e16112c1..653906d4fb68b 100644 --- a/docs/source/advanced/multi_gpu.rst +++ b/docs/source/advanced/multi_gpu.rst @@ -611,28 +611,34 @@ Let's say you have a batch size of 7 in your dataloader. def train_dataloader(self): return Dataset(..., batch_size=7) -In DDP or Horovod your effective batch size will be 7 * gpus * num_nodes. +In DDP, DDP_SPAWN, Deepspeed, DDP_SHARDED, or Horovod your effective batch size will be 7 * gpus * num_nodes. .. code-block:: python # effective batch size = 7 * 8 Trainer(gpus=8, accelerator="ddp") + Trainer(gpus=8, accelerator="ddp_spawn") + Trainer(gpus=8, accelerator="ddp_sharded") Trainer(gpus=8, accelerator="horovod") # effective batch size = 7 * 8 * 10 Trainer(gpus=8, num_nodes=10, accelerator="ddp") + Trainer(gpus=8, num_nodes=10, accelerator="ddp_spawn") + Trainer(gpus=8, num_nodes=10, accelerator="ddp_sharded") Trainer(gpus=8, num_nodes=10, accelerator="horovod") -In DDP2, your effective batch size will be 7 * num_nodes. +In DDP2 or DP, your effective batch size will be 7 * num_nodes. The reason is that the full batch is visible to all GPUs on the node when using DDP2. .. code-block:: python # effective batch size = 7 Trainer(gpus=8, accelerator="ddp2") + Trainer(gpus=8, accelerator="dp") # effective batch size = 7 * 10 Trainer(gpus=8, num_nodes=10, accelerator="ddp2") + Trainer(gpus=8, accelerator="dp") .. note:: Huge batch sizes are actually really bad for convergence. Check out: diff --git a/docs/source/advanced/sequences.rst b/docs/source/advanced/sequences.rst index 8e50de49933eb..2d8d770cbb850 100644 --- a/docs/source/advanced/sequences.rst +++ b/docs/source/advanced/sequences.rst @@ -1,6 +1,6 @@ Sequential Data -================ +=============== Truncated Backpropagation Through Time -------------------------------------- diff --git a/docs/source/api_references.rst b/docs/source/api_references.rst index df70b2b0a3944..7bc4d8b460e8d 100644 --- a/docs/source/api_references.rst +++ b/docs/source/api_references.rst @@ -67,6 +67,71 @@ Loggers API test_tube wandb +Loop API +-------- + +Base Classes +^^^^^^^^^^^^ + +.. currentmodule:: pytorch_lightning.loops + +.. autosummary:: + :toctree: api + :nosignatures: + :template: classtemplate.rst + + ~base.Loop + ~dataloader.dataloader_loop.DataLoaderLoop + + +Default Loop Implementations +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Training +"""""""" + +.. currentmodule:: pytorch_lightning.loops + +.. autosummary:: + :toctree: api + :nosignatures: + :template: classtemplate.rst + + FitLoop + ~epoch.TrainingEpochLoop + ~batch.TrainingBatchLoop + ~optimization.OptimizerLoop + ~optimization.ManualOptimization + + +Validation and Testing +"""""""""""""""""""""" + +.. currentmodule:: pytorch_lightning.loops + +.. autosummary:: + :toctree: api + :nosignatures: + :template: classtemplate.rst + + ~dataloader.EvaluationLoop + ~epoch.EvaluationEpochLoop + + +Prediction +"""""""""" + +.. currentmodule:: pytorch_lightning.loops + +.. autosummary:: + :toctree: api + :nosignatures: + :template: classtemplate.rst + + ~dataloader.PredictionLoop + ~epoch.PredictionEpochLoop + + Plugins API ----------- diff --git a/docs/source/clouds/cluster.rst b/docs/source/clouds/cluster.rst index c7a8b71f26d0b..f75a735bb809f 100644 --- a/docs/source/clouds/cluster.rst +++ b/docs/source/clouds/cluster.rst @@ -11,11 +11,13 @@ In this guide, we cover 1. General purpose cluster (not managed) -2. SLURM cluster +2. Using `Torch Distributed Run `__ -3. Custom cluster environment +3. SLURM cluster -4. General tips for multi-node training +4. Custom cluster environment + +5. General tips for multi-node training -------- @@ -39,6 +41,7 @@ PyTorch Lightning follows the design of `PyTorch distributed communication packa - *WORLD_SIZE* - required; how many nodes are in the cluster - *NODE_RANK* - required; id of the node in the cluster +.. _training_script_setup: Training script setup --------------------- @@ -66,12 +69,45 @@ This means that you need to: 3. Run the script on each node. --------- +---------- + +.. _torch_distributed_run: + +2. Torch Distributed Run +======================== + +`Torch Distributed Run `__ provides helper functions to setup distributed environment variables from the `PyTorch distributed communication package `__ that need to be defined on each node. + +Once the script is setup like described in :ref:`training_script_setup`, you can run the below command across your nodes to start multi-node training. + +Like a custom cluster, you have to ensure that there is network connectivity between the nodes with firewall rules that allow traffic flow on a specified *MASTER_PORT*. + +Finally, you'll need to decide which node you'd like to be the master node (*MASTER_ADDR*), and the ranks of each node (*NODE_RANK*). + +For example: + +* *MASTER_ADDR* 10.10.10.16 +* *MASTER_PORT* 29500 +* *NODE_RANK* 0 for the first node, 1 for the second node + +Run the below command with the appropriate variables set on each node. + +.. code-block:: bash + + python -m torch.distributed.run + --nnodes=2 # number of nodes you'd like to run with + --master_addr + --master_port + --node_rank + train.py (--arg1 ... train script args...) + +.. note:: + ``torch.distributed.run`` assumes that you'd like to spawn a process per GPU if GPU devices are found on the node. This can be adjusted with ``-nproc_per_node``. .. _slurm: -2. SLURM managed cluster +3. SLURM managed cluster ======================== Lightning automates the details behind training on a SLURM-powered cluster. In contrast to the general purpose @@ -239,7 +275,7 @@ The other option is that you generate scripts on your own via a bash command or .. _custom-cluster: -3. Custom cluster +4. Custom cluster ================= Lightning provides an interface for providing your own definition of a cluster environment. It mainly consists of @@ -282,7 +318,7 @@ and node rank (node id). Here is an example of a custom ---------- -4. General tips for multi-node training +5. General tips for multi-node training ======================================= Debugging flags diff --git a/docs/source/common/debugging.rst b/docs/source/common/debugging.rst index 7a11863c0e1bf..6e5a721dd092a 100644 --- a/docs/source/common/debugging.rst +++ b/docs/source/common/debugging.rst @@ -95,11 +95,14 @@ Print a summary of your LightningModule --------------------------------------- Whenever the ``.fit()`` function gets called, the Trainer will print the weights summary for the LightningModule. By default it only prints the top-level modules. If you want to show all submodules in your network, use the -`'full'` option: +``max_depth`` option: .. testcode:: - trainer = Trainer(weights_summary="full") + from pytorch_lightning.callbacks import ModelSummary + + trainer = Trainer(callbacks=[ModelSummary(max_depth=-1)]) + You can also display the intermediate input- and output sizes of all your layers by setting the ``example_input_array`` attribute in your LightningModule. It will print a table like this @@ -115,8 +118,9 @@ You can also display the intermediate input- and output sizes of all your layers when you call ``.fit()`` on the Trainer. This can help you find bugs in the composition of your layers. See Also: - - :paramref:`~pytorch_lightning.trainer.trainer.Trainer.weights_summary` Trainer argument - - :class:`~pytorch_lightning.core.memory.ModelSummary` + - :class:`~pytorch_lightning.callbacks.model_summary.ModelSummary` + - :func:`~pytorch_lightning.utilities.model_summary.summarize` + - :class:`~pytorch_lightning.utilities.model_summary.ModelSummary` ---------------- diff --git a/docs/source/common/hyperparameters.rst b/docs/source/common/hyperparameters.rst index 1781a26a9189f..41a99e022ae95 100644 --- a/docs/source/common/hyperparameters.rst +++ b/docs/source/common/hyperparameters.rst @@ -201,7 +201,7 @@ To recap, add ALL possible trainer flags to the argparser and init the ``Trainer trainer = Trainer.from_argparse_args(hparams) # or if you need to pass in callbacks - trainer = Trainer.from_argparse_args(hparams, checkpoint_callback=..., callbacks=[...]) + trainer = Trainer.from_argparse_args(hparams, enable_checkpointing=..., callbacks=[...]) ---------- diff --git a/docs/source/common/lightning_module.rst b/docs/source/common/lightning_module.rst index ba2694286739e..6ee0ebe7b1110 100644 --- a/docs/source/common/lightning_module.rst +++ b/docs/source/common/lightning_module.rst @@ -1195,6 +1195,7 @@ for more information. on_after_backward() on_before_optimizer_step() + configure_gradient_clipping() optimizer_step() on_train_batch_end() @@ -1452,6 +1453,12 @@ on_before_optimizer_step .. automethod:: pytorch_lightning.core.hooks.ModelHooks.on_before_optimizer_step :noindex: +configure_gradient_clipping +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. automethod:: pytorch_lightning.core.lightning.LightningModule.configure_gradient_clipping + :noindex: + optimizer_step ~~~~~~~~~~~~~~ diff --git a/docs/source/common/optimizers.rst b/docs/source/common/optimizers.rst index 39a583d9c94d8..0405b9a4365af 100644 --- a/docs/source/common/optimizers.rst +++ b/docs/source/common/optimizers.rst @@ -69,7 +69,7 @@ Here is a minimal example of manual optimization. Gradient accumulation --------------------- You can accumulate gradients over batches similarly to -:attr:`~pytorch_lightning.trainer.Trainer.accumulate_grad_batches` of automatic optimization. +:attr:`~pytorch_lightning.trainer.trainer.Trainer.accumulate_grad_batches` of automatic optimization. To perform gradient accumulation with one optimizer, you can do as such. .. testcode:: python @@ -516,3 +516,47 @@ to perform a step, Lightning won't be able to support accelerators and precision ): optimizer = optimizer.optimizer optimizer.step(closure=optimizer_closure) + +----- + +Configure gradient clipping +--------------------------- +To configure custom gradient clipping, consider overriding +the :meth:`~pytorch_lightning.core.lightning.LightningModule.configure_gradient_clipping` method. +Attributes :attr:`~pytorch_lightning.trainer.trainer.Trainer.gradient_clip_val` and +:attr:`~pytorch_lightning.trainer.trainer.Trainer.gradient_clip_algorithm` will be passed in the respective +arguments here and Lightning will handle gradient clipping for you. In case you want to set +different values for your arguments of your choice and let Lightning handle the gradient clipping, you can +use the inbuilt :meth:`~pytorch_lightning.core.lightning.LightningModule.clip_gradients` method and pass +the arguments along with your optimizer. + +.. note:: + Make sure to not override :meth:`~pytorch_lightning.core.lightning.LightningModule.clip_gradients` + method. If you want to customize gradient clipping, consider using + :meth:`~pytorch_lightning.core.lightning.LightningModule.configure_gradient_clipping` method. + +For example, here we will apply gradient clipping only to the gradients associated with optimizer A. + +.. testcode:: python + + def configure_gradient_clipping(self, optimizer, optimizer_idx, gradient_clip_val, gradient_clip_algorithm): + if optimizer_idx == 0: + # Lightning will handle the gradient clipping + self.clip_gradients( + optimizer, gradient_clip_val=gradient_clip_val, gradient_clip_algorithm=gradient_clip_algorithm + ) + +Here we configure gradient clipping differently for optimizer B. + +.. testcode:: python + + def configure_gradient_clipping(self, optimizer, optimizer_idx, gradient_clip_val, gradient_clip_algorithm): + if optimizer_idx == 0: + # Lightning will handle the gradient clipping + self.clip_gradients( + optimizer, gradient_clip_val=gradient_clip_val, gradient_clip_algorithm=gradient_clip_algorithm + ) + elif optimizer_idx == 1: + self.clip_gradients( + optimizer, gradient_clip_val=gradient_clip_val * 2, gradient_clip_algorithm=gradient_clip_algorithm + ) diff --git a/docs/source/common/trainer.rst b/docs/source/common/trainer.rst index e8f78864b1ddf..f8d815432a41c 100644 --- a/docs/source/common/trainer.rst +++ b/docs/source/common/trainer.rst @@ -216,7 +216,7 @@ accelerator | -The accelerator backend to use (previously known as distributed_backend). +The accelerator backend to use: - (``'dp'``) is DataParallel (split batch among GPUs of same machine) - (``'ddp'``) is DistributedDataParallel (each gpu on each node trains, and syncs grads) @@ -528,6 +528,34 @@ Example:: checkpoint_callback ^^^^^^^^^^^^^^^^^^^ +Deprecated: This has been deprecated in v1.5 and will be removed in v1.7. Please use ``enable_checkpointing`` instead. + +default_root_dir +^^^^^^^^^^^^^^^^ + +.. raw:: html + + + +| + +Default path for logs and weights when no logger or +:class:`pytorch_lightning.callbacks.ModelCheckpoint` callback passed. On +certain clusters you might want to separate where logs and checkpoints are +stored. If you don't then use this argument for convenience. Paths can be local +paths or remote paths such as `s3://bucket/path` or 'hdfs://path/'. Credentials +will need to be set up to use remote filepaths. + +.. testcode:: + + # default used by the Trainer + trainer = Trainer(default_root_dir=os.getcwd()) + +enable_checkpointing +^^^^^^^^^^^^^^^^^^^^ + .. raw:: html