Skip to content

[on fork, easier to review] 3/n Consolidate collective functions - Integrate with TTPs #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 28 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
160e7e1
Deprecate LightningModule.get_progress_bar_dict (#8985)
daniellepintz Sep 9, 2021
58de08d
Fix name order in CITATION.cff (#9423)
aphedges Sep 9, 2021
c963bf6
[loops] Reset reference to dataloader iterator on run end (#9386)
ananthsub Sep 10, 2021
d028e36
Add remove_checkpoint to CheckpointIO plugin to simplify ModelCheckpo…
kaushikb11 Sep 10, 2021
3118480
Disable benchmark ci on PRs (#9430)
kaushikb11 Sep 10, 2021
e0f2e04
Share the training step output data via `ClosureResult` (#9349)
carmocca Sep 10, 2021
4f8c3ba
Type the Loop base class as generic (#9418)
carmocca Sep 10, 2021
d773407
feat: Add ModelSummary Callback (#9344)
kaushikb11 Sep 10, 2021
9eccb31
Loop and test restructuring (#9383)
carmocca Sep 10, 2021
81687aa
[docs] Clear up default logging, showing you don't need to pass a log…
Sep 10, 2021
ffd275f
[Refactor] Improve auto-encoder example (#9402)
tchaton Sep 10, 2021
ee37872
Adapt `NeptuneLogger` to new `neptune-client` api (#6867)
shnela Sep 10, 2021
7ca038b
Merge pull request #9438 from PyTorchLightning/feature/neptune-code-o…
awaelchli Sep 10, 2021
6ff43cb
fix resuming from checkpoint for fault-tolerant in case of no failure…
awaelchli Sep 10, 2021
15434a9
Update torch_xla wheels installation link (#9436)
kaushikb11 Sep 10, 2021
cc2ac02
Move add_to_queue/get_from_queue to DDPSpawnPlugin (#9118)
daniellepintz Sep 10, 2021
d2def36
[bugfix] Revert inference mode support from #8813 (#9443)
ananthsub Sep 10, 2021
b294c57
Fix type hint for filepath (#9434)
kaushikb11 Sep 10, 2021
83bff01
Add on_exception hook to documentation (#9365)
daniellepintz Sep 11, 2021
8d255b2
update rank_zero condition for logging summary (#9461)
awaelchli Sep 12, 2021
ed43ad6
2/n Consolidate collective functions - collective base and subclasses
four4fish Sep 9, 2021
387432a
2/n Consolidate collective functions - collective base and subclasses
four4fish Sep 10, 2021
9f26bf5
2/n Consolidate collective functions - collective base and subclasses
four4fish Sep 10, 2021
218188c
2/n Consolidate collective functions - collective base and subclasses
four4fish Sep 11, 2021
e4ac511
2/n Consolidate collective functions - collective base and subclasses
four4fish Sep 11, 2021
8574b00
Apply suggestions from code review
four4fish Sep 12, 2021
2379aea
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 12, 2021
8fc38bf
3/n Consolidate collective functions - Integrate with TTPs
four4fish Sep 13, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions .azure-pipelines/gpu-benchmark.yml
Original file line number Diff line number Diff line change
@@ -1,3 +1,20 @@
# Python package
# Create and test a Python package on multiple Python versions.
# Add steps that analyze code, save the dist with the build record, publish to a PyPI-compatible index, and more:
# https://docs.microsoft.com/azure/devops/pipelines/languages/python

trigger:
tags:
include:
- '*'
branches:
include:
- "master"
- "release/*"
- "refs/tags/*"

pr: none

schedules:
- cron: "0 0 * * *" # At the end of every day
displayName: Daily midnight benchmark
Expand Down
1 change: 1 addition & 0 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
/pytorch_lightning/distributed @williamfalcon @tchaton @awaelchli @kaushikb11
/pytorch_lightning/loggers @tchaton @awaelchli @borda
/pytorch_lightning/loggers/wandb.py @borisdayma
/pytorch_lightning/loggers/neptune.py @shnela @HubertJaworski @pkasprzyk @pitercl @Raalsky @aniezurawski @kamil-kaczmarek
/pytorch_lightning/loops @tchaton @awaelchli @justusschock @carmocca
/pytorch_lightning/overrides @tchaton @SeanNaren @borda
/pytorch_lightning/plugins @tchaton @SeanNaren @awaelchli @justusschock
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,7 @@ wandb
.forked/
*.prof
*.tar.gz
.neptune/

# dataset generated from bolts in examples.
cifar-10-batches-py
Expand Down
32 changes: 28 additions & 4 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).


- Progress tracking
* Integrate `TrainingEpochLoop.total_batch_idx` ([#8598](https://github.com/PyTorchLightning/pytorch-lightning/pull/8598)
* Avoid optional `Tracker` attributes ([#9320](https://github.com/PyTorchLightning/pytorch-lightning/pull/9320)
* Integrate `TrainingEpochLoop.total_batch_idx` ([#8598](https://github.com/PyTorchLightning/pytorch-lightning/pull/8598))
* Avoid optional `Tracker` attributes ([#9320](https://github.com/PyTorchLightning/pytorch-lightning/pull/9320))
* Reset `current` progress counters when restarting an epoch loop that had already finished ([#9371](https://github.com/PyTorchLightning/pytorch-lightning/pull/9371))


- Added `batch_size` and `rank_zero_only` arguments for `log_dict` to match `log` ([#8628](https://github.com/PyTorchLightning/pytorch-lightning/pull/8628))
Expand Down Expand Up @@ -110,14 +111,25 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Added `on_exception` callback hook ([#9183](https://github.com/PyTorchLightning/pytorch-lightning/pull/9183))


- Add a warning to deepspeed when inferring batch size ([#9221](https://github.com/PyTorchLightning/pytorch-lightning/pull/9221))
- Added a warning to deepspeed when inferring batch size ([#9221](https://github.com/PyTorchLightning/pytorch-lightning/pull/9221))


- Added `inference_mode` for evaluation and prediction ([8813](https://github.com/PyTorchLightning/pytorch-lightning/pull/8813))
- Added `remove_checkpoint` to `CheckpointIO` plugin by moving the responsibility from `ModelCheckpoint` Callback ([#9373](https://github.com/PyTorchLightning/pytorch-lightning/pull/9373))


- Added `ModelSummary` callback ([#9344](https://github.com/PyTorchLightning/pytorch-lightning/pull/9344))


- Add collective base class and subclasses ([#9414](https://github.com/PyTorchLightning/pytorch-lightning/pull/9414))


### Changed

- `pytorch_lightning.loggers.neptune.NeptuneLogger` is now consistent with new [neptune-client](https://github.com/neptune-ai/neptune-client) API ([#6867](https://github.com/PyTorchLightning/pytorch-lightning/pull/6867)).

Old [neptune-client](https://github.com/neptune-ai/neptune-client) API is supported by `NeptuneClient` from [neptune-contrib](https://github.com/neptune-ai/neptune-contrib) repo.


- Parsing of the `gpus` Trainer argument has changed: `gpus="n"` (str) no longer selects the GPU index n and instead selects the first n devices. ([#8770](https://github.com/PyTorchLightning/pytorch-lightning/pull/8770))


Expand Down Expand Up @@ -179,6 +191,12 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Deprecated `DataModule` properties: `train_transforms`, `val_transforms`, `test_transforms`, `size`, `dims` ([#8851](https://github.com/PyTorchLightning/pytorch-lightning/pull/8851))


- Deprecated `add_to_queue`, `get_from_queue` from `LightningModule` in favor of corresponding methods in the `DDPSpawnPlugin` ([9118](https://github.com/PyTorchLightning/pytorch-lightning/pull/9118))


- Deprecated `LightningModule.get_progress_bar_dict` and `Trainer.progress_bar_dict` in favor of `pytorch_lightning.callbacks.progress.base.get_standard_metrics` and `ProgressBarBase.get_metrics` ([#8985](https://github.com/PyTorchLightning/pytorch-lightning/pull/8985))


- Deprecated `prepare_data_per_node` flag on Trainer and set it as a property of `DataHooks`, accessible in the `LightningModule` and `LightningDataModule` ([#8958](https://github.com/PyTorchLightning/pytorch-lightning/pull/8958))


Expand Down Expand Up @@ -331,6 +349,12 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Fixed `replace_sampler` missing the batch size under specific conditions ([#9367](https://github.com/PyTorchLightning/pytorch-lightning/pull/9367))


- Fixed bug where the training step output needed to be `deepcopy`-ed ([#9349](https://github.com/PyTorchLightning/pytorch-lightning/pull/9349))


- Fixed freeing data iterators in loop `on_run_end` ([#9386](https://github.com/PyTorchLightning/pytorch-lightning/pull/9386))


## [1.4.5] - 2021-08-31

- Fixed reduction using `self.log(sync_dict=True, reduce_fx={mean,max})` ([#9142](https://github.com/PyTorchLightning/pytorch-lightning/pull/9142))
Expand Down
4 changes: 2 additions & 2 deletions CITATION.cff
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@ title: "PyTorch Lightning"
abstract: "The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate."
date-released: 2019-03-30
authors:
- family-names: "William"
given-names: "Falcon"
- family-names: "Falcon"
given-names: "William"
- name: "The PyTorch Lightning team"
version: 1.4
doi: 10.5281/zenodo.3828935
Expand Down
6 changes: 3 additions & 3 deletions docs/source/advanced/tpu.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ A TPU is a Tensor processing unit. Each TPU has 8 cores where each
core is optimized for 128x128 matrix multiplies. In general, a single
TPU is about as fast as 5 V100 GPUs!

A TPU pod hosts many TPUs on it. Currently, TPU pod v2 has 2048 cores!
A TPU pod hosts many TPUs on it. Currently, TPU v3 Pod has up to 2048 TPU cores and 32 TiB of memory!
You can request a full pod from Google cloud or a "slice" which gives you
some subset of those 2048 cores.

Expand Down Expand Up @@ -64,9 +64,9 @@ To get a TPU on colab, follow these steps:

.. code-block::

!pip install cloud-tpu-client==0.10 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.8-cp37-cp37m-linux_x86_64.whl
!pip install cloud-tpu-client==0.10 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl

5. Once the above is done, install PyTorch Lightning (v 0.7.0+).
5. Once the above is done, install PyTorch Lightning.

.. code-block::

Expand Down
16 changes: 5 additions & 11 deletions docs/source/common/lightning_module.rst
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ Notice a few things.
out = net(x)

Thus, to use Lightning, you just need to organize your code which takes about 30 minutes,
(and let's be real, you probably should do anyhow).
(and let's be real, you probably should do anyway).

------------

Expand Down Expand Up @@ -267,8 +267,8 @@ The matching pseudocode is:

Training with DataParallel
~~~~~~~~~~~~~~~~~~~~~~~~~~
When training using a `accelerator` that splits data from each batch across GPUs, sometimes you might
need to aggregate them on the master GPU for processing (dp, or ddp2).
When training using an `accelerator` that splits data from each batch across GPUs, sometimes you might
need to aggregate them on the main GPU for processing (dp, or ddp2).

In this case, implement the `training_step_end` method

Expand Down Expand Up @@ -379,8 +379,8 @@ If you need to do something with all the outputs of each `validation_step`, over

Validating with DataParallel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When training using a `accelerator` that splits data from each batch across GPUs, sometimes you might
need to aggregate them on the master GPU for processing (dp, or ddp2).
When training using an `accelerator` that splits data from each batch across GPUs, sometimes you might
need to aggregate them on the main GPU for processing (dp, or ddp2).

In this case, implement the `validation_step_end` method

Expand Down Expand Up @@ -1242,12 +1242,6 @@ backward
.. automethod:: pytorch_lightning.core.lightning.LightningModule.backward
:noindex:

get_progress_bar_dict
~~~~~~~~~~~~~~~~~~~~~

.. automethod:: pytorch_lightning.core.lightning.LightningModule.get_progress_bar_dict
:noindex:

on_before_backward
~~~~~~~~~~~~~~~~~~

Expand Down
36 changes: 26 additions & 10 deletions docs/source/common/loggers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
Loggers
*******

Lightning supports the most popular logging frameworks (TensorBoard, Comet, etc...). TensorBoard is used by default,
Lightning supports the most popular logging frameworks (TensorBoard, Comet, Neptune, etc...). TensorBoard is used by default,
but you can pass to the :class:`~pytorch_lightning.trainer.trainer.Trainer` any combination of the following loggers.

.. note::
Expand Down Expand Up @@ -107,34 +107,50 @@ First, install the package:

pip install neptune-client

or with conda:

.. code-block:: bash

conda install -c conda-forge neptune-client

Then configure the logger and pass it to the :class:`~pytorch_lightning.trainer.trainer.Trainer`:

.. testcode::
.. code-block:: python

from pytorch_lightning.loggers import NeptuneLogger

neptune_logger = NeptuneLogger(
api_key="ANONYMOUS", # replace with your own
project_name="shared/pytorch-lightning-integration",
experiment_name="default", # Optional,
params={"max_epochs": 10}, # Optional,
tags=["pytorch-lightning", "mlp"], # Optional,
project="common/pytorch-lightning-integration", # format "<WORKSPACE/PROJECT>"
tags=["training", "resnet"], # optional
)
trainer = Trainer(logger=neptune_logger)

The :class:`~pytorch_lightning.loggers.NeptuneLogger` is available anywhere except ``__init__`` in your
:class:`~pytorch_lightning.core.lightning.LightningModule`.

.. testcode::
.. code-block:: python

class MyModule(LightningModule):
def any_lightning_module_function_or_hook(self):
some_img = fake_image()
self.logger.experiment.add_image("generated_images", some_img, 0)
# generic recipe for logging custom metadata (neptune specific)
metadata = ...
self.logger.experiment["your/metadata/structure"].log(metadata)

Note that syntax: ``self.logger.experiment["your/metadata/structure"].log(metadata)``
is specific to Neptune and it extends logger capabilities.
Specifically, it allows you to log various types of metadata like scores, files,
images, interactive visuals, CSVs, etc. Refer to the
`Neptune docs <https://docs.neptune.ai/you-should-know/logging-metadata#essential-logging-methods>`_
for more detailed explanations.

You can always use regular logger methods: ``log_metrics()`` and ``log_hyperparams()`` as these are also supported.

.. seealso::
:class:`~pytorch_lightning.loggers.NeptuneLogger` docs.

Logger `user guide <https://docs.neptune.ai/integrations-and-supported-tools/model-training/pytorch-lightning>`_.

----------------

Tensorboard
Expand Down Expand Up @@ -227,7 +243,7 @@ Then configure the logger and pass it to the :class:`~pytorch_lightning.trainer.
The :class:`~pytorch_lightning.loggers.WandbLogger` is available anywhere except ``__init__`` in your
:class:`~pytorch_lightning.core.lightning.LightningModule`.

.. testcode::
.. code-block:: python

class MyModule(LightningModule):
def any_lightning_module_function_or_hook(self):
Expand Down
6 changes: 6 additions & 0 deletions docs/source/extensions/callbacks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -395,6 +395,12 @@ on_keyboard_interrupt
.. automethod:: pytorch_lightning.callbacks.Callback.on_keyboard_interrupt
:noindex:

on_exception
^^^^^^^^^^^^

.. automethod:: pytorch_lightning.callbacks.Callback.on_exception
:noindex:

on_save_checkpoint
^^^^^^^^^^^^^^^^^^

Expand Down
26 changes: 21 additions & 5 deletions docs/source/extensions/logging.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,24 @@ Logging
#######

Lightning supports the most popular logging frameworks (TensorBoard, Comet, etc...).
To use a logger, simply pass it into the :class:`~pytorch_lightning.trainer.trainer.Trainer`.
Lightning uses TensorBoard by default.

By default, Lightning uses `PyTorch TensorBoard <https://pytorch.org/docs/stable/tensorboard.html>`__ logging under the hood, and stores the logs to a directory (by default in ``lightning_logs/``).

.. testcode::

from pytorch_lightning import Trainer

# Automatically logs to a directory
# (by default ``lightning_logs/``)
trainer = Trainer()

To see your logs:

.. code-block:: bash

tensorboard --logdir=lightning_logs/

You can also pass a custom Logger to the :class:`~pytorch_lightning.trainer.trainer.Trainer`.

.. testcode::

Expand Down Expand Up @@ -245,13 +261,13 @@ Modifying the progress bar

The progress bar by default already includes the training loss and version number of the experiment
if you are using a logger. These defaults can be customized by overriding the
:func:`~pytorch_lightning.core.lightning.LightningModule.get_progress_bar_dict` hook in your module.
:func:`~pytorch_lightning.callbacks.base.ProgressBarBase.get_metrics` hook in your module.

.. code-block:: python

def get_progress_bar_dict(self):
def get_metrics(self):
# don't show the version number
items = super().get_progress_bar_dict()
items = super().get_metrics()
items.pop("v_num", None)
return items

Expand Down
Loading