Skip to content

Remove deprecated trainer flag Trainer.distributed_backend in favor of Trainer.accelerator #9246

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
cc2b274
scheduled removal of Trainer.distributed_backend
Tshimanga Sep 1, 2021
21becf0
update CHANGELOG.md
Tshimanga Sep 1, 2021
d7a9c45
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 1, 2021
e71ad3c
fix issue in data_loading
Tshimanga Sep 1, 2021
f1726d0
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 28, 2021
cf70a69
ci
rohitgr7 Sep 28, 2021
65ff016
distributed_backend
rohitgr7 Oct 11, 2021
783fe89
Merge branch 'master' into removal/1.5-distributed-backend-trainer-flag
rohitgr7 Oct 11, 2021
0ce93b9
disable_logger (#9837)
rohitgr7 Oct 11, 2021
1e1c862
Updated quantization imports in PyTorch 1.10 (#9878)
theory-in-progress Oct 11, 2021
1cb0ce9
[docs] Add Torch Distributed Run (#9890)
Oct 11, 2021
85c56bf
use existing logic to configure optimizers in lr_finder (#9789)
rohitgr7 Oct 11, 2021
001ae19
Removed a redundant warning with `ModelCheckpoint(monitor=None)` call…
Programmer-RD-AI Oct 11, 2021
026514f
remove redundant accumulation normalization in manual optimization (#…
awaelchli Oct 11, 2021
d24064e
Deprecate `terminate_on_nan` Trainer argument in favor of `detect_ano…
yopknopixx Oct 11, 2021
494d481
Prepare v1.5.0rc0 (#9893)
kaushikb11 Oct 11, 2021
eb160df
fix qconfig import for pytorch 1.10 (#9899)
awaelchli Oct 11, 2021
7386dc8
Update DeepSpeed version, fix failing tests (#9898)
Oct 11, 2021
cead45b
Clarify lr scheduler frequency (#9843)
cowwoc Oct 12, 2021
fb8e8e8
Fix deprecation test version for accelerator collective (#9892)
kaushikb11 Oct 12, 2021
e974572
Deprecate `checkpoint_callback` from the `Trainer` constructor in fav…
rohitgr7 Oct 12, 2021
40106cf
Mark `trainer.config_validator` as protected (#9779)
ananthsub Oct 12, 2021
23e0eb9
DeepSpeed support for device IDs (#9847)
Oct 12, 2021
909a27e
Update error message for interactive incompatible plugins (#9896)
awaelchli Oct 12, 2021
bec1788
Raise a `MisconfigurationException` when trainer functions are called…
rohitgr7 Oct 12, 2021
979b748
update docs (#9903)
rohitgr7 Oct 12, 2021
69f2f84
Update docs for `GradientAccumulationScheduler` (#9891)
rohitgr7 Oct 12, 2021
29581b5
update tests to not rely on patched dataloaders (#9905)
awaelchli Oct 12, 2021
487473b
guard against None in pytorch get_xla_supported_devices (#9572)
ckchow Oct 12, 2021
6354f21
CombinedLoader example fix (#9906)
kainoj Oct 12, 2021
7a1e967
Remove type error handling in _configure_checkpoint_callbacks (#9823)
daniellepintz Oct 12, 2021
2dd6b97
Mark `Trainer.terminate_on_nan` protected and deprecate public proper…
ananthsub Oct 12, 2021
8bc2593
Remove epoch from `trainer.logged_metrics` (#9904)
rohitgr7 Oct 13, 2021
a6d1cc3
Remove `should_rank_save_checkpoint` property from Trainer (#9433)
kaushikb11 Oct 13, 2021
ee63840
Add `enable_model_summary` flag and deprecate `weights_summary` (#9699)
ananthsub Oct 13, 2021
7c8c7ce
Add `strategy` argument to Trainer (#8597)
kaushikb11 Oct 13, 2021
b10ab54
Add `configure_gradient_clipping` hook in `LightningModule` (#9584)
rohitgr7 Oct 13, 2021
f03147b
[2/4] Add DeviceStatsMonitor callback (#9712)
daniellepintz Oct 13, 2021
24556e6
Log LR using LearningRateMonitor even when LR Scheduler is not define…
VirajBagal Oct 14, 2021
3020822
[2/n] Directly call TrainingTypePlugin APIs instead of going through …
four4fish Oct 14, 2021
8eb832b
Deprecate `GPUStatsMonitor` and `XLAStatsMonitor` in favor of `Device…
daniellepintz Oct 14, 2021
830839d
Refactor tests for TPU Accelerator (#9718)
kaushikb11 Oct 14, 2021
07ba0b9
Single-process multi-node CPU training (#9603)
borchero Oct 14, 2021
ba2efa2
Deprecate `log_gpu_memory`, `gpu_metrics`, and util funcs in favor of…
daniellepintz Oct 14, 2021
12ac06b
Add support for `len(datamodule)` (#9895)
kingyiusuen Oct 15, 2021
52221c0
Validate the precision input earlier (#9763)
carmocca Oct 15, 2021
7938922
Use non-deprecated options in tests (#9949)
carmocca Oct 15, 2021
99af0c5
(1/n) tests: Use strategy flag instead of accelerator for training st…
kaushikb11 Oct 16, 2021
bca1b66
Fixed use of LightningCLI in computer_vision_fine_tuning.py example (…
mauvilsa Oct 16, 2021
b66ecfc
Avoid deprecation warning after #9901 (#9951)
carmocca Oct 16, 2021
2c5e330
Fix issue with no-init dataclass fields in move_to_device (#9963)
ronif Oct 17, 2021
e6ec14e
Fix `LightningOptimizer` step and toggling logic (#9958)
carmocca Oct 18, 2021
0e36a1c
Update accelerator connector messages after the addition of strategy …
carmocca Oct 18, 2021
84dd799
loop customization docs (#9609)
awaelchli Oct 18, 2021
da126ba
reset val dataloader for binsearch (#9975)
eladsegal Oct 18, 2021
db8470a
Fix `self.log(on_epoch=True)` on_batch_start (#9780)
carmocca Oct 18, 2021
206d6a0
Remove deprecated `DataModule.dims` usage in tests (#9948)
carmocca Oct 18, 2021
4dcc078
Update `resume_from_checkpoint` docs (#9952)
carmocca Oct 18, 2021
73e0a57
Remove manual tracking of optimizer steps (#9957)
carmocca Oct 18, 2021
6a385be
Introduce `PrecisionPlugin.forward_context()` (#9988)
awaelchli Oct 18, 2021
13fdba7
Fix logic to check for spawn in worker_check (#9902)
rohitgr7 Oct 18, 2021
220fea7
Add unit tests for `pl.utilities.grads` (#9765)
awaelchli Oct 18, 2021
431919f
Add KFold Loop example (#9965)
tchaton Oct 18, 2021
3c303e9
Add typing for `LightningOptimizer` (#9990)
carmocca Oct 18, 2021
3100dab
scheduled removal of Trainer.distributed_backend
Tshimanga Sep 1, 2021
4a0c479
update CHANGELOG.md
Tshimanga Sep 1, 2021
a2e2e52
update
rohitgr7 Oct 18, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions .azure-pipelines/gpu-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ jobs:
- bash: |
python -c "fname = 'requirements/extra.txt' ; lines = [line for line in open(fname).readlines() if 'horovod' not in line] ; open(fname, 'w').writelines(lines)"
pip install fairscale>=0.3.4
pip install "deepspeed==0.4.3" # FIXME: bug with >= 0.4.4
pip install deepspeed==0.5.4
pip install . --requirement requirements/devel.txt
pip list
displayName: 'Install dependencies'
Expand Down Expand Up @@ -106,10 +106,10 @@ jobs:
set -e
python -m pytest pl_examples -v --maxfail=2 --durations=0
bash pl_examples/run_examples.sh --trainer.gpus=1
bash pl_examples/run_examples.sh --trainer.gpus=2 --trainer.accelerator=ddp
bash pl_examples/run_examples.sh --trainer.gpus=2 --trainer.accelerator=ddp --trainer.precision=16
bash pl_examples/run_examples.sh --trainer.gpus=2 --trainer.accelerator=dp
bash pl_examples/run_examples.sh --trainer.gpus=2 --trainer.accelerator=dp --trainer.precision=16
bash pl_examples/run_examples.sh --trainer.gpus=2 --trainer.strategy=ddp
bash pl_examples/run_examples.sh --trainer.gpus=2 --trainer.strategy=ddp --trainer.precision=16
bash pl_examples/run_examples.sh --trainer.gpus=2 --trainer.strategy=dp
bash pl_examples/run_examples.sh --trainer.gpus=2 --trainer.strategy=dp --trainer.precision=16
env:
PL_USE_MOCKED_MNIST: "1"
displayName: 'Testing: examples'
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -156,3 +156,4 @@ cifar-10-batches-py
*.pt
# ctags
tags
.tags
101 changes: 99 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,14 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).


## [unReleased] - 2021-MM-DD
## [1.5.0] - 2021-MM-DD

### Added


- Add support for monitoring the learning rate monitor without schedulers in `LearningRateMonitor` ([#9786](https://github.com/PyTorchLightning/pytorch-lightning/issues/9786))


- Register `ShardedTensor` state dict hooks in `LightningModule.__init__` if the pytorch version supports `ShardedTensor` ([#8944](https://github.com/PyTorchLightning/pytorch-lightning/pull/8944))


Expand Down Expand Up @@ -163,6 +166,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Added a warning when an unknown key is encountered in optimizer configuration, and when `OneCycleLR` is used with `"interval": "epoch"` ([#9666](https://github.com/PyTorchLightning/pytorch-lightning/pull/9666))


- Added `DeviceStatsMonitor` callback ([#9712](https://github.com/PyTorchLightning/pytorch-lightning/pull/9712))


- Added `enable_progress_bar` to Trainer constructor ([#9664](https://github.com/PyTorchLightning/pytorch-lightning/pull/9664))


Expand All @@ -175,13 +181,36 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Enabled automatic parameters tying for TPUs ([#9525](https://github.com/PyTorchLightning/pytorch-lightning/pull/9525))


- Raise a `MisconfigurationException` when trainer functions are called with `ckpt_path="best"` but `checkpoint_callback` isn't configured ([#9841](https://github.com/PyTorchLightning/pytorch-lightning/pull/9841))


- Added support for `torch.autograd.set_detect_anomaly` through `Trainer` constructor argument `detect_anomaly` ([#9848](https://github.com/PyTorchLightning/pytorch-lightning/pull/9848))


- Added a `len` method to `LightningDataModule` ([#9895](https://github.com/PyTorchLightning/pytorch-lightning/pull/9895))


- Added `enable_model_summary` flag to Trainer ([#9699](https://github.com/PyTorchLightning/pytorch-lightning/pull/9699))


- Added `strategy` argument to Trainer ([#8597](https://github.com/PyTorchLightning/pytorch-lightning/pull/8597))


- Added `kfold` example for loop customization ([#9965](https://github.com/PyTorchLightning/pytorch-lightning/pull/9965))


- LightningLite:
* Added `PrecisionPlugin.forward_context`, making it the default implementation for all `{train,val,test,predict}_step_context()` methods ([#9988](https://github.com/PyTorchLightning/pytorch-lightning/pull/9988))


### Changed

- Setting `Trainer(accelerator="ddp_cpu")` now does not spawn a subprocess if `num_processes` is kept `1` along with `num_nodes > 1` ([#9603](https://github.com/PyTorchLightning/pytorch-lightning/pull/9603)).


- Module imports are now catching `ModuleNotFoundError` instead of `ImportError` ([#9867](https://github.com/PyTorchLightning/pytorch-lightning/pull/9867))


- `pytorch_lightning.loggers.neptune.NeptuneLogger` is now consistent with new [neptune-client](https://github.com/neptune-ai/neptune-client) API ([#6867](https://github.com/PyTorchLightning/pytorch-lightning/pull/6867)).

Old [neptune-client](https://github.com/neptune-ai/neptune-client) API is supported by `NeptuneClient` from [neptune-contrib](https://github.com/neptune-ai/neptune-contrib) repo.
Expand Down Expand Up @@ -257,6 +286,10 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Changed `HorovodPlugin.all_gather` to return a `torch.Tensor` instead of a list ([#9696](https://github.com/PyTorchLightning/pytorch-lightning/pull/9696))


- Changed Trainer connectors to be protected attributes:
* Configuration Validator ([#9779](https://github.com/PyTorchLightning/pytorch-lightning/pull/9779))


- Restore `current_epoch` and `global_step` irrespective of trainer task ([#9413](https://github.com/PyTorchLightning/pytorch-lightning/pull/9413))


Expand All @@ -269,8 +302,25 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Update the logic to check for accumulation steps with deepspeed ([#9826](https://github.com/PyTorchLightning/pytorch-lightning/pull/9826))


- `pytorch_lightning.utilities.grads.grad_norm` now raises an exception if parameter `norm_type <= 0` ([#9765](https://github.com/PyTorchLightning/pytorch-lightning/pull/9765))



- Updated error message for interactive incompatible plugins ([#9896](https://github.com/PyTorchLightning/pytorch-lightning/pull/9896))


- Updated several places in the loops and trainer to access `training_type_plugin` directly instead of `accelerator` ([#9901](https://github.com/PyTorchLightning/pytorch-lightning/pull/9901))



### Deprecated

- Deprecated trainer argument `terminate_on_nan` in favour of `detect_anomaly`([#9175](https://github.com/PyTorchLightning/pytorch-lightning/pull/9175))


- Deprecated `Trainer.terminate_on_nan` public attribute access ([#9849](https://github.com/PyTorchLightning/pytorch-lightning/pull/9849))


- Deprecated `LightningModule.summarize()` in favor of `pytorch_lightning.utilities.model_summary.summarize()`


Expand Down Expand Up @@ -310,7 +360,7 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Deprecated passing `progress_bar_refresh_rate` to the `Trainer` constructor in favor of adding the `ProgressBar` callback with `refresh_rate` directly to the list of callbacks, or passing `enable_progress_bar=False` to disable the progress bar ([#9616](https://github.com/PyTorchLightning/pytorch-lightning/pull/9616))


- Deprecate `LightningDistributed` and move the broadcast logic to `DDPPlugin` and `DDPSpawnPlugin` directly ([#9691](https://github.com/PyTorchLightning/pytorch-lightning/pull/9691))
- Deprecated `LightningDistributed` and move the broadcast logic to `DDPPlugin` and `DDPSpawnPlugin` directly ([#9691](https://github.com/PyTorchLightning/pytorch-lightning/pull/9691))


- Deprecated passing `stochastic_weight_avg` from the `Trainer` constructor in favor of adding the `StochasticWeightAveraging` callback directly to the list of callbacks ([#8989](https://github.com/PyTorchLightning/pytorch-lightning/pull/8989))
Expand All @@ -319,12 +369,23 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Deprecated Accelerator collective API `barrier`, `broadcast`, and `all_gather`, call `TrainingTypePlugin` collective API directly ([#9677](https://github.com/PyTorchLightning/pytorch-lightning/pull/9677))


- Deprecated `checkpoint_callback` from the `Trainer` constructor in favour of `enable_checkpointing` ([#9754](https://github.com/PyTorchLightning/pytorch-lightning/pull/9754))


- Deprecated the `LightningModule.on_post_move_to_device` method ([#9525](https://github.com/PyTorchLightning/pytorch-lightning/pull/9525))


- Deprecated `pytorch_lightning.core.decorators.parameter_validation` in favor of `pytorch_lightning.utilities.parameter_tying.set_shared_parameters` ([#9525](https://github.com/PyTorchLightning/pytorch-lightning/pull/9525))


- Deprecated passing `weights_summary` to the `Trainer` constructor in favor of adding the `ModelSummary` callback with `max_depth` directly to the list of callbacks ([#9699](https://github.com/PyTorchLightning/pytorch-lightning/pull/9699))


- Deprecated `log_gpu_memory`, `gpu_metrics`, and util funcs in favor of `DeviceStatsMonitor` callback ([#9921](https://github.com/PyTorchLightning/pytorch-lightning/pull/9921))


- Deprecated `GPUStatsMonitor` and `XLAStatsMonitor` in favor of `DeviceStatsMonitor` callback ([#9924](https://github.com/PyTorchLightning/pytorch-lightning/pull/9924))

### Removed

- Removed deprecated `metrics` ([#8586](https://github.com/PyTorchLightning/pytorch-lightning/pull/8586/))
Expand Down Expand Up @@ -423,9 +484,24 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Removed `call_configure_sharded_model_hook` property from `Accelerator` and `TrainingTypePlugin` ([#9612](https://github.com/PyTorchLightning/pytorch-lightning/pull/9612))


- Removed deprecated trainer flag `Trainer.distributed_backend` in favor of `Trainer.accelerator` ([#9246](https://github.com/PyTorchLightning/pytorch-lightning/pull/9246))


- Removed `TrainerProperties` mixin and moved property definitions directly into `Trainer` ([#9495](https://github.com/PyTorchLightning/pytorch-lightning/pull/9495))


- Removed a redundant warning with `ModelCheckpoint(monitor=None)` callback ([#9875](https://github.com/PyTorchLightning/pytorch-lightning/pull/9875))


- Remove `epoch` from `trainer.logged_metrics` ([#9904](https://github.com/PyTorchLightning/pytorch-lightning/pull/9904))


- Removed `should_rank_save_checkpoint` property from Trainer ([#9433](https://github.com/PyTorchLightning/pytorch-lightning/pull/9433))


- Removed deprecated trainer flag `Trainer.distributed_backend` in favor of `Trainer.accelerator` ([#9246](https://github.com/PyTorchLightning/pytorch-lightning/pull/9246))


### Fixed


Expand All @@ -450,6 +526,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Fixed `BasePredictionWriter` not returning the batch_indices in a non-distributed setting ([#9432](https://github.com/PyTorchLightning/pytorch-lightning/pull/9432))


- Fixed an error when running on in XLA environments with no TPU attached ([#9572](https://github.com/PyTorchLightning/pytorch-lightning/pull/9572))


- Fixed check on torchmetrics logged whose `compute()` output is a multielement tensor ([#9582](https://github.com/PyTorchLightning/pytorch-lightning/pull/9582))


Expand All @@ -468,17 +547,35 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Fixed `broadcast` in `DDPPlugin` and ``DDPSpawnPlugin` to respect the `src` input ([#9691](https://github.com/PyTorchLightning/pytorch-lightning/pull/9691))


- Fixed `self.log(on_epoch=True)` for the `on_batch_start` and `on_train_batch_start` hooks ([#9780](https://github.com/PyTorchLightning/pytorch-lightning/pull/9780))


- Fixed restoring training state during `trainer.fit` only ([#9413](https://github.com/PyTorchLightning/pytorch-lightning/pull/9413))


- Fixed DeepSpeed and Lightning both calling the scheduler ([#9788](https://github.com/PyTorchLightning/pytorch-lightning/pull/9788))


- Fixed missing arguments when saving hyperparameters from the parent class but not from the child class ([#9800](https://github.com/PyTorchLightning/pytorch-lightning/pull/9800))


- Fixed DeepSpeed GPU device IDs ([#9847](https://github.com/PyTorchLightning/pytorch-lightning/pull/9847))


- Reset `val_dataloader` in `tuner/batch_size_scaling` ([#9857](https://github.com/PyTorchLightning/pytorch-lightning/pull/9857))


- Fixed use of `LightningCLI` in computer_vision_fine_tuning.py example ([#9934](https://github.com/PyTorchLightning/pytorch-lightning/pull/9934))


- Fixed issue with non-init dataclass fields in `apply_to_collection` ([#9963](https://github.com/PyTorchLightning/pytorch-lightning/issues/9963))

- Reset `val_dataloader` in `tuner/batch_size_scaling` for binsearch ([#9975](https://github.com/PyTorchLightning/pytorch-lightning/pull/9975))


- Fixed logic to check for spawn in dataloader `TrainerDataLoadingMixin._worker_check` ([#9902](https://github.com/PyTorchLightning/pytorch-lightning/pull/9902))


## [1.4.9] - 2021-09-30

- Fixed `lr_find` to generate same results on multiple calls ([#9704](https://github.com/PyTorchLightning/pytorch-lightning/pull/9704))
Expand Down
2 changes: 1 addition & 1 deletion benchmarks/test_basic_parity.py
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,7 @@ def lightning_loop(cls_model, idx, device_type: str = "cuda", num_epochs=10):
# as the first run is skipped, no need to run it long
max_epochs=num_epochs if idx > 0 else 1,
enable_progress_bar=False,
weights_summary=None,
enable_model_summary=False,
gpus=1 if device_type == "cuda" else 0,
checkpoint_callback=False,
logger=False,
Expand Down
4 changes: 2 additions & 2 deletions benchmarks/test_sharded_parity.py
Original file line number Diff line number Diff line change
Expand Up @@ -137,15 +137,15 @@ def plugin_parity_test(
ddp_model = model_cls()
use_cuda = gpus > 0

trainer = Trainer(fast_dev_run=True, max_epochs=1, gpus=gpus, precision=precision, accelerator="ddp_spawn")
trainer = Trainer(fast_dev_run=True, max_epochs=1, gpus=gpus, precision=precision, strategy="ddp_spawn")

max_memory_ddp, ddp_time = record_ddp_fit_model_stats(trainer=trainer, model=ddp_model, use_cuda=use_cuda)

# Reset and train Custom DDP
seed_everything(seed)
custom_plugin_model = model_cls()

trainer = Trainer(fast_dev_run=True, max_epochs=1, gpus=gpus, precision=precision, accelerator="ddp_sharded_spawn")
trainer = Trainer(fast_dev_run=True, max_epochs=1, gpus=gpus, precision=precision, strategy="ddp_sharded_spawn")
assert isinstance(trainer.training_type_plugin, DDPSpawnShardedPlugin)

max_memory_custom, custom_model_time = record_ddp_fit_model_stats(
Expand Down
3 changes: 2 additions & 1 deletion dockers/tpu-tests/tpu_test_cases.jsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,8 @@ local tputests = base.BaseTest {
coverage run --source=pytorch_lightning -m pytest -v --capture=no \
tests/profiler/test_xla_profiler.py \
pytorch_lightning/utilities/xla_device.py \
tests/accelerators/test_tpu_backend.py \
tests/accelerators/test_tpu.py \
tests/callbacks/test_device_stats_monitor.py \
tests/models/test_tpu.py
test_exit_code=$?
echo "\n||| END PYTEST LOGS |||\n"
Expand Down
10 changes: 8 additions & 2 deletions docs/source/advanced/multi_gpu.rst
Original file line number Diff line number Diff line change
Expand Up @@ -611,28 +611,34 @@ Let's say you have a batch size of 7 in your dataloader.
def train_dataloader(self):
return Dataset(..., batch_size=7)

In DDP or Horovod your effective batch size will be 7 * gpus * num_nodes.
In DDP, DDP_SPAWN, Deepspeed, DDP_SHARDED, or Horovod your effective batch size will be 7 * gpus * num_nodes.

.. code-block:: python

# effective batch size = 7 * 8
Trainer(gpus=8, accelerator="ddp")
Trainer(gpus=8, accelerator="ddp_spawn")
Trainer(gpus=8, accelerator="ddp_sharded")
Trainer(gpus=8, accelerator="horovod")

# effective batch size = 7 * 8 * 10
Trainer(gpus=8, num_nodes=10, accelerator="ddp")
Trainer(gpus=8, num_nodes=10, accelerator="ddp_spawn")
Trainer(gpus=8, num_nodes=10, accelerator="ddp_sharded")
Trainer(gpus=8, num_nodes=10, accelerator="horovod")

In DDP2, your effective batch size will be 7 * num_nodes.
In DDP2 or DP, your effective batch size will be 7 * num_nodes.
The reason is that the full batch is visible to all GPUs on the node when using DDP2.

.. code-block:: python

# effective batch size = 7
Trainer(gpus=8, accelerator="ddp2")
Trainer(gpus=8, accelerator="dp")

# effective batch size = 7 * 10
Trainer(gpus=8, num_nodes=10, accelerator="ddp2")
Trainer(gpus=8, accelerator="dp")


.. note:: Huge batch sizes are actually really bad for convergence. Check out:
Expand Down
2 changes: 1 addition & 1 deletion docs/source/advanced/sequences.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@

Sequential Data
================
===============

Truncated Backpropagation Through Time
--------------------------------------
Expand Down
65 changes: 65 additions & 0 deletions docs/source/api_references.rst
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,71 @@ Loggers API
test_tube
wandb

Loop API
--------

Base Classes
^^^^^^^^^^^^

.. currentmodule:: pytorch_lightning.loops

.. autosummary::
:toctree: api
:nosignatures:
:template: classtemplate.rst

~base.Loop
~dataloader.dataloader_loop.DataLoaderLoop


Default Loop Implementations
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Training
""""""""

.. currentmodule:: pytorch_lightning.loops

.. autosummary::
:toctree: api
:nosignatures:
:template: classtemplate.rst

FitLoop
~epoch.TrainingEpochLoop
~batch.TrainingBatchLoop
~optimization.OptimizerLoop
~optimization.ManualOptimization


Validation and Testing
""""""""""""""""""""""

.. currentmodule:: pytorch_lightning.loops

.. autosummary::
:toctree: api
:nosignatures:
:template: classtemplate.rst

~dataloader.EvaluationLoop
~epoch.EvaluationEpochLoop


Prediction
""""""""""

.. currentmodule:: pytorch_lightning.loops

.. autosummary::
:toctree: api
:nosignatures:
:template: classtemplate.rst

~dataloader.PredictionLoop
~epoch.PredictionEpochLoop


Plugins API
-----------

Expand Down
Loading