Skip to content

Commit 3d490cd

Browse files
committed
Merge branch 'master' into ci/minimize-docker
2 parents 4193f24 + 5262b63 commit 3d490cd

File tree

25 files changed

+397
-163
lines changed

25 files changed

+397
-163
lines changed

.azure-pipelines/gpu-tests.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ jobs:
5050
5151
- bash: |
5252
python -c "fname = 'requirements/extra.txt' ; lines = [line for line in open(fname).readlines() if 'horovod' not in line] ; open(fname, 'w').writelines(lines)"
53-
pip install fairscale>=0.3.4
53+
pip install fairscale==0.4.0
5454
pip install deepspeed==0.5.4
5555
pip install . --requirement requirements/devel.txt
5656
pip list

CHANGELOG.md

Lines changed: 15 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -220,7 +220,8 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
220220
* Implemented `DeepSpeedPlugin._setup_model_and_optimizers` ([#10009](https://github.com/PyTorchLightning/pytorch-lightning/pull/10009), [#10064](https://github.com/PyTorchLightning/pytorch-lightning/pull/10064))
221221
* Implemented `{DDPShardedPlugin,DDPShardedSpawnPlugin}._setup_model_and_optimizers` ([#10028](https://github.com/PyTorchLightning/pytorch-lightning/pull/10028), [#10064](https://github.com/PyTorchLightning/pytorch-lightning/pull/10064))
222222
* Added optional `model` argument to the `optimizer_step` methods in accelerators and plugins ([#10023](https://github.com/PyTorchLightning/pytorch-lightning/pull/10023))
223-
223+
* Updated precision attributes in `DeepSpeedPlugin` ([#10164](https://github.com/PyTorchLightning/pytorch-lightning/pull/10164))
224+
* Added the ability to return a result from rank 0 in `DDPSpawnPlugin.spawn` ([#10162](https://github.com/PyTorchLightning/pytorch-lightning/pull/10162))
224225

225226

226227
- Added `XLACheckpointIO` plugin ([#9972](https://github.com/PyTorchLightning/pytorch-lightning/pull/9972))
@@ -343,6 +344,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
343344
- Moved the `optimizer_step` and `clip_gradients` hook from the `Accelerator` and `TrainingTypePlugin` into the `PrecisionPlugin` ([#10143](https://github.com/PyTorchLightning/pytorch-lightning/pull/10143), [#10029](https://github.com/PyTorchLightning/pytorch-lightning/pull/10029))
344345

345346

347+
- `NativeMixedPrecisionPlugin` and its subclasses now take an optional `GradScaler` instance ([#10055](https://github.com/PyTorchLightning/pytorch-lightning/pull/10055))
348+
349+
346350
- Updated several places in the loops and trainer to access `training_type_plugin` directly instead of `accelerator` ([#9901](https://github.com/PyTorchLightning/pytorch-lightning/pull/9901))
347351

348352

@@ -444,10 +448,12 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
444448
- Deprecated `ClusterEnvironment.creates_children()` in favor of `ClusterEnvironment.creates_processes_externally` (property) ([#10106](https://github.com/PyTorchLightning/pytorch-lightning/pull/10106))
445449

446450

447-
448451
- Deprecated `PrecisionPlugin.master_params()` in favor of `PrecisionPlugin.main_params()` ([#10105](https://github.com/PyTorchLightning/pytorch-lightning/pull/10105))
449452

450453

454+
- Deprecated `lr_sch_names` from `LearningRateMonitor` ([#10066](https://github.com/PyTorchLightning/pytorch-lightning/pull/10066))
455+
456+
451457
### Removed
452458

453459
- Removed deprecated `metrics` ([#8586](https://github.com/PyTorchLightning/pytorch-lightning/pull/8586/))
@@ -656,9 +662,16 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
656662
- Fixed undesired side effects being caused by `Trainer` patching dataloader methods on the `LightningModule` ([#9764](https://github.com/PyTorchLightning/pytorch-lightning/pull/9764))
657663

658664

665+
- Fixed monitor value in `ModelCheckpoint` getting moved to the wrong device in a special case where it becomes NaN ([#10118](https://github.com/PyTorchLightning/pytorch-lightning/pull/10118))
666+
667+
659668
- Fixed creation of `dirpath` in `BaseProfiler` if it doesn't exist ([#10073](https://github.com/PyTorchLightning/pytorch-lightning/pull/10073))
660669

661670

671+
- Fixed an issue with `pl.utilities.seed.reset_seed` converting the `PL_SEED_WORKERS` environment variable to `bool` ([#10099](https://github.com/PyTorchLightning/pytorch-lightning/pull/10099))
672+
673+
674+
662675
## [1.4.9] - 2021-09-30
663676

664677
- Fixed `lr_find` to generate same results on multiple calls ([#9704](https://github.com/PyTorchLightning/pytorch-lightning/pull/9704))

dockers/base-cuda/Dockerfile

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -108,11 +108,11 @@ RUN \
108108

109109
RUN \
110110
# install FairScale
111-
pip install fairscale>=0.3.4
111+
pip install fairscale==0.4.0
112112

113113
RUN \
114114
# install DeepSpeed
115-
pip install deepspeed==0.4.0
115+
pip install deepspeed==0.5.4
116116

117117
RUN \
118118
# Show what we have
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
Training Type Plugins Registry
2+
==============================
3+
4+
.. warning:: The Plugins Registry is experimental and subject to change.
5+
6+
Lightning includes a registry that holds information about Training Type plugins and allows for the registration of new custom plugins.
7+
8+
The Plugins are assigned strings that identify them, such as "ddp", "deepspeed_stage_2_offload", and so on.
9+
It also returns the optional description and parameters for initialising the Plugin that were defined during registration.
10+
11+
12+
.. code-block:: python
13+
14+
# Training with the DDP Plugin with `find_unused_parameters` as False
15+
trainer = Trainer(strategy="ddp_find_unused_parameters_false", accelerator="gpu", devices=4)
16+
17+
# Training with DeepSpeed ZeRO Stage 3 and CPU Offload
18+
trainer = Trainer(strategy="deepspeed_stage_3_offload", accelerator="gpu", devices=3)
19+
20+
# Training with the TPU Spawn Plugin with `debug` as True
21+
trainer = Trainer(strategy="tpu_spawn_debug", accelerator="tpu", devices=8)
22+
23+
24+
Additionally, you can pass your custom registered training type plugins to the ``strategy`` argument.
25+
26+
.. code-block:: python
27+
28+
from pytorch_lightning.plugins import DDPPlugin, TrainingTypePluginsRegistry, CheckpointIO
29+
30+
31+
class CustomCheckpointIO(CheckpointIO):
32+
def save_checkpoint(self, checkpoint: Dict[str, Any], path: Union[str, Path]) -> None:
33+
...
34+
35+
def load_checkpoint(self, path: Union[str, Path]) -> Dict[str, Any]:
36+
...
37+
38+
39+
custom_checkpoint_io = CustomCheckpointIO()
40+
41+
# Register the DDP Plugin with your custom CheckpointIO plugin
42+
TrainingTypePluginsRegistry.register(
43+
"ddp_custom_checkpoint_io",
44+
DDPPlugin,
45+
description="DDP Plugin with custom checkpoint io plugin",
46+
checkpoint_io=custom_checkpoint_io,
47+
)
48+
49+
trainer = Trainer(strategy="ddp_custom_checkpoint_io", accelerator="gpu", devices=2)

docs/source/extensions/accelerators.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ One to handle differences from the training routine and one to handle different
2626
from pytorch_lightning.plugins import NativeMixedPrecisionPlugin, DDPPlugin
2727

2828
accelerator = GPUAccelerator(
29-
precision_plugin=NativeMixedPrecisionPlugin(),
29+
precision_plugin=NativeMixedPrecisionPlugin(16, "cuda"),
3030
training_type_plugin=DDPPlugin(),
3131
)
3232
trainer = Trainer(accelerator=accelerator)

docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,7 @@ PyTorch Lightning
6666
advanced/checkpoint_io
6767
common/optimizers
6868
advanced/profiler
69+
advanced/plugins_registry
6970
advanced/sequences
7071
common/single_gpu
7172
advanced/training_tricks

docs/source/starter/new-project.rst

Lines changed: 38 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -134,6 +134,7 @@ Under the hood a LightningModule is still just a :class:`torch.nn.Module` that g
134134
- The Train loop
135135
- The Validation loop
136136
- The Test loop
137+
- The Prediction loop
137138
- The Model or system of Models
138139
- The Optimizer
139140

@@ -181,7 +182,7 @@ More details in :doc:`lightning module <../common/lightning_module>` docs.
181182
Step 2: Fit with Lightning Trainer
182183
**********************************
183184

184-
First, define the data however you want. Lightning just needs a :class:`~torch.utils.data.DataLoader` for the train/val/test splits.
185+
First, define the data however you want. Lightning just needs a :class:`~torch.utils.data.DataLoader` for the train/val/test/predict splits.
185186

186187
.. code-block:: python
187188
@@ -258,7 +259,8 @@ Turn off automatic optimization and you control the train loop!
258259
259260
260261
def training_step(self, batch, batch_idx):
261-
# access your optimizers with use_pl_optimizer=False. Default is True
262+
# access your optimizers with use_pl_optimizer=False. Default is True,
263+
# setting use_pl_optimizer=True will maintain plugin/precision support
262264
opt_a, opt_b = self.optimizers(use_pl_optimizer=True)
263265
264266
loss_a = self.generator(batch)
@@ -321,7 +323,7 @@ You can also add a forward method to do predictions however you want.
321323

322324

323325
autoencoder = LitAutoEncoder()
324-
autoencoder = autoencoder(torch.rand(1, 28 * 28))
326+
embedding = autoencoder(torch.rand(1, 28 * 28))
325327

326328

327329
.. code-block:: python
@@ -371,9 +373,9 @@ a forward method or trace only the sub-models you need.
371373
372374
--------------------
373375

374-
Using CPUs/GPUs/TPUs
375-
====================
376-
It's trivial to use CPUs, GPUs or TPUs in Lightning. There's **NO NEED** to change your code, simply change the :class:`~pytorch_lightning.trainer.Trainer` options.
376+
Using CPUs/GPUs/TPUs/IPUs
377+
=========================
378+
It's trivial to use CPUs, GPUs, TPUs or IPUs in Lightning. There's **NO NEED** to change your code, simply change the :class:`~pytorch_lightning.trainer.Trainer` options.
377379

378380
.. testcode::
379381

@@ -423,6 +425,11 @@ Without changing a SINGLE line of your code, you can now do the following with t
423425
# using only half the training data and checking validation every quarter of a training epoch
424426
trainer = pl.Trainer(tpu_cores=8, precision=16, limit_train_batches=0.5, val_check_interval=0.25)
425427
428+
.. code-block:: python
429+
430+
# Train on IPUs
431+
trainer = pl.Trainer(ipus=8)
432+
426433
-----------
427434

428435
Checkpoints
@@ -449,7 +456,7 @@ If you prefer to do it manually, here's the equivalent
449456

450457
Data flow
451458
=========
452-
Each loop (training, validation, test) has three hooks you can implement:
459+
Each loop (training, validation, test, predict) has three hooks you can implement:
453460

454461
- x_step
455462
- x_step_end
@@ -474,8 +481,8 @@ The equivalent in Lightning is:
474481
return prediction
475482
476483
477-
def training_epoch_end(self, training_step_outputs):
478-
for prediction in predictions:
484+
def training_epoch_end(self, outs):
485+
for out in outs:
479486
...
480487
481488
In the event that you use DP or DDP2 distributed modes (ie: split a batch across GPUs),
@@ -508,9 +515,9 @@ The lightning equivalent is:
508515
def training_step_end(self, losses):
509516
gpu_0_loss = losses[0]
510517
gpu_1_loss = losses[1]
511-
return (gpu_0_loss + gpu_1_loss) * 1 / 2
518+
return (gpu_0_loss + gpu_1_loss) / 2
512519
513-
.. tip:: The validation and test loops have the same structure.
520+
.. tip:: The validation, test and prediction loops have the same structure.
514521

515522
-----------------
516523

@@ -648,8 +655,10 @@ Make your data code reusable by organizing it into a :class:`~pytorch_lightning.
648655
if stage in (None, "fit"):
649656
mnist_train = MNIST(os.getcwd(), train=True, transform=transform)
650657
self.mnist_train, self.mnist_val = random_split(mnist_train, [55000, 5000])
651-
if stage == (None, "test"):
658+
if stage == "test":
652659
self.mnist_test = MNIST(os.getcwd(), train=False, transform=transform)
660+
if stage == "predict":
661+
self.mnist_predict = MNIST(os.getcwd(), train=False, transform=transform)
653662

654663
# return the dataloader for each split
655664
def train_dataloader(self):
@@ -664,6 +673,10 @@ Make your data code reusable by organizing it into a :class:`~pytorch_lightning.
664673
mnist_test = DataLoader(self.mnist_test, batch_size=self.batch_size)
665674
return mnist_test
666675

676+
def predict_dataloader(self):
677+
mnist_predict = DataLoader(self.mnist_predict, batch_size=self.batch_size)
678+
return mnist_predict
679+
667680
:class:`~pytorch_lightning.core.datamodule.LightningDataModule` is designed to enable sharing and reusing data splits
668681
and transforms across different projects. It encapsulates all the steps needed to process data: downloading,
669682
tokenizing, processing etc.
@@ -681,11 +694,17 @@ the :class:`~pytorch_lightning.trainer.Trainer`:
681694
682695
# train
683696
trainer = pl.Trainer()
684-
trainer.fit(model, dm)
697+
trainer.fit(model, datamodule=dm)
698+
699+
# validate
700+
trainer.validate(datamodule=dm)
685701
686702
# test
687703
trainer.test(datamodule=dm)
688704
705+
# predict
706+
predictions = trainer.predict(datamodule=dm)
707+
689708
DataModules are specifically useful for building models based on data. Read more on :doc:`datamodules <../extensions/datamodules>`.
690709

691710
------
@@ -701,15 +720,18 @@ Lightning has many tools for debugging. Here is an example of just a few of them
701720

702721
.. testcode::
703722

704-
# Automatically overfit the sane batch of your model for a sanity test
723+
# Automatically overfit the same batch of your model for a sanity test
705724
trainer = Trainer(overfit_batches=1)
706725

707726
.. testcode::
708727

709-
# unit test all the code- hits every line of your code once to see if you have bugs,
728+
# unit test all the code - hits every line of your code once to see if you have bugs,
710729
# instead of waiting hours to crash on validation
711730
trainer = Trainer(fast_dev_run=True)
712731

732+
# unit test all the code - hits every line of your code with 4 batches
733+
trainer = Trainer(fast_dev_run=4)
734+
713735
.. testcode::
714736

715737
# train only 20% of an epoch
@@ -739,7 +761,7 @@ Once you define and train your first Lightning model, you might want to try othe
739761
- :doc:`Automatically find a good learning rate <../advanced/lr_finder>`
740762
- :ref:`Load checkpoints directly from S3 <common/weights_loading:Checkpoint Loading>`
741763
- :doc:`Scale to massive compute clusters <../clouds/cluster>`
742-
- :doc:`Use multiple dataloaders per train/val/test loop <../guides/data>`
764+
- :doc:`Use multiple dataloaders per train/val/test/predict loop <../guides/data>`
743765
- :ref:`Use multiple optimizers to do reinforcement learning or even GANs <common/optimizers:Use multiple optimizers (like GANs)>`
744766

745767
Or read our :doc:`Guide <../starter/introduction_guide>` to learn more!

pytorch_lightning/callbacks/lr_monitor.py

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@
2828
import pytorch_lightning as pl
2929
from pytorch_lightning.callbacks.base import Callback
3030
from pytorch_lightning.utilities import rank_zero_warn
31+
from pytorch_lightning.utilities.distributed import rank_zero_deprecation
3132
from pytorch_lightning.utilities.exceptions import MisconfigurationException
3233

3334

@@ -93,7 +94,7 @@ def __init__(self, logging_interval: Optional[str] = None, log_momentum: bool =
9394
self.logging_interval = logging_interval
9495
self.log_momentum = log_momentum
9596
self.lrs: Dict[str, List[float]] = {}
96-
self.lr_sch_names: List[str] = []
97+
self._lr_sch_names: List[str] = []
9798

9899
def on_train_start(self, trainer: "pl.Trainer", *args: Any, **kwargs: Any) -> None:
99100
"""Called before training, determines unique names for all lr schedulers in the case of multiple of the
@@ -334,6 +335,16 @@ def _check_duplicates_and_update_name(
334335
name_list = [self._add_suffix(name, param_groups, i) for i in range(len(param_groups))]
335336

336337
if add_lr_sch_names:
337-
self.lr_sch_names.append(name)
338+
self._lr_sch_names.append(name)
338339

339340
return name_list
341+
342+
@property
343+
def lr_sch_names(self) -> List[str]:
344+
# TODO remove `lr_sch_names` and `add_lr_sch_names` argument in v1.7.0
345+
rank_zero_deprecation(
346+
"`LearningRateMonitor.lr_sch_names` has been deprecated in v1.5 and will be removed in 1.7."
347+
" Consider accessing them using `LearningRateMonitor.lrs.keys()` which will return"
348+
" the names of all the optimizers, even those without a scheduler."
349+
)
350+
return self._lr_sch_names

pytorch_lightning/callbacks/model_checkpoint.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -700,7 +700,7 @@ def _update_best_and_save(
700700

701701
# do not save nan, replace with +/- inf
702702
if isinstance(current, torch.Tensor) and torch.isnan(current):
703-
current = torch.tensor(float("inf" if self.mode == "min" else "-inf"))
703+
current = torch.tensor(float("inf" if self.mode == "min" else "-inf"), device=current.device)
704704

705705
filepath = self._get_metric_interpolated_filepath_name(monitor_candidates, trainer, del_filepath)
706706

pytorch_lightning/plugins/precision/deepspeed_precision.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,9 +21,13 @@
2121
from pytorch_lightning.plugins.precision.precision_plugin import PrecisionPlugin
2222
from pytorch_lightning.utilities import GradClipAlgorithmType
2323
from pytorch_lightning.utilities.exceptions import MisconfigurationException
24+
from pytorch_lightning.utilities.imports import _DEEPSPEED_AVAILABLE
2425
from pytorch_lightning.utilities.model_helpers import is_overridden
2526
from pytorch_lightning.utilities.warnings import WarningCache
2627

28+
if _DEEPSPEED_AVAILABLE:
29+
from deepspeed import DeepSpeedEngine
30+
2731
warning_cache = WarningCache()
2832

2933

@@ -40,7 +44,7 @@ def backward(self, model: "pl.LightningModule", closure_loss: Tensor, *args: Any
4044
"You have overridden the `LightningModule.backward` hook but it will be ignored since DeepSpeed handles"
4145
" the backward logic internally."
4246
)
43-
deepspeed_engine = model.trainer.model
47+
deepspeed_engine: DeepSpeedEngine = model.trainer.model
4448
deepspeed_engine.backward(closure_loss, *args, **kwargs)
4549

4650
def _run_backward(self, tensor: Tensor, model: Module, *args: Any, **kwargs: Any) -> None:

pytorch_lightning/plugins/precision/fully_sharded_native_amp.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,9 +18,7 @@
1818

1919

2020
class FullyShardedNativeMixedPrecisionPlugin(ShardedNativeMixedPrecisionPlugin):
21-
"""Mixed Precision for Full Sharded Training."""
22-
23-
precision = "mixed"
21+
"""Native AMP for Fully Sharded Training."""
2422

2523
def clip_grad_by_norm(self, *_: Any, **__: Any) -> None:
2624
# see https://fairscale.readthedocs.io/en/latest/api/nn/fsdp_tips.html

0 commit comments

Comments
 (0)