four4fish · four4fish · Sep 9, 2021 · Sep 9, 2021 · Sep 10, 2021 · Sep 10, 2021
diff --git a/.azure-pipelines/gpu-benchmark.yml b/.azure-pipelines/gpu-benchmark.yml
@@ -1,3 +1,20 @@
+# Python package
+# Create and test a Python package on multiple Python versions.
+# Add steps that analyze code, save the dist with the build record, publish to a PyPI-compatible index, and more:
+# https://docs.microsoft.com/azure/devops/pipelines/languages/python
+
+trigger:
+  tags:
+    include:
+      - '*'
+  branches:
+    include:
+      - "master"
+      - "release/*"
+      - "refs/tags/*"
+
+pr: none
+
 schedules:
   - cron: "0 0 * * *" # At the end of every day
     displayName: Daily midnight benchmark

diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS
@@ -25,6 +25,7 @@
 /pytorch_lightning/distributed          @williamfalcon @tchaton @awaelchli @kaushikb11
 /pytorch_lightning/loggers              @tchaton @awaelchli @borda
 /pytorch_lightning/loggers/wandb.py     @borisdayma
+/pytorch_lightning/loggers/neptune.py   @shnela @HubertJaworski @pkasprzyk @pitercl @Raalsky @aniezurawski @kamil-kaczmarek
 /pytorch_lightning/loops                @tchaton @awaelchli @justusschock @carmocca
 /pytorch_lightning/overrides            @tchaton @SeanNaren @borda
 /pytorch_lightning/plugins              @tchaton @SeanNaren @awaelchli @justusschock

@@ -147,6 +147,7 @@ wandb
 .forked/
 *.prof
 *.tar.gz
+.neptune/
 
 # dataset generated from bolts in examples.
 cifar-10-batches-py

@@ -23,8 +23,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 
 
 - Progress tracking
-    * Integrate `TrainingEpochLoop.total_batch_idx` ([#8598](https://github.com/PyTorchLightning/pytorch-lightning/pull/8598)
-    * Avoid optional `Tracker` attributes ([#9320](https://github.com/PyTorchLightning/pytorch-lightning/pull/9320)
+    * Integrate `TrainingEpochLoop.total_batch_idx` ([#8598](https://github.com/PyTorchLightning/pytorch-lightning/pull/8598))
+    * Avoid optional `Tracker` attributes ([#9320](https://github.com/PyTorchLightning/pytorch-lightning/pull/9320))
+    * Reset `current` progress counters when restarting an epoch loop that had already finished ([#9371](https://github.com/PyTorchLightning/pytorch-lightning/pull/9371))
 
 
 - Added `batch_size` and `rank_zero_only` arguments for `log_dict` to match `log` ([#8628](https://github.com/PyTorchLightning/pytorch-lightning/pull/8628))
@@ -110,14 +111,25 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 - Added `on_exception` callback hook ([#9183](https://github.com/PyTorchLightning/pytorch-lightning/pull/9183))
 
 
-- Add a warning to deepspeed when inferring batch size ([#9221](https://github.com/PyTorchLightning/pytorch-lightning/pull/9221))
+- Added a warning to deepspeed when inferring batch size ([#9221](https://github.com/PyTorchLightning/pytorch-lightning/pull/9221))
 
 
-- Added `inference_mode` for evaluation and prediction ([8813](https://github.com/PyTorchLightning/pytorch-lightning/pull/8813))
+- Added `remove_checkpoint` to `CheckpointIO` plugin by moving the responsibility from `ModelCheckpoint` Callback ([#9373](https://github.com/PyTorchLightning/pytorch-lightning/pull/9373))
+
+
+- Added `ModelSummary` callback ([#9344](https://github.com/PyTorchLightning/pytorch-lightning/pull/9344))
+
+
+- Add collective base class and subclasses ([#9414](https://github.com/PyTorchLightning/pytorch-lightning/pull/9414))
 
 
 ### Changed
 
+- `pytorch_lightning.loggers.neptune.NeptuneLogger` is now consistent with new [neptune-client](https://github.com/neptune-ai/neptune-client) API ([#6867](https://github.com/PyTorchLightning/pytorch-lightning/pull/6867)).
+
+  Old [neptune-client](https://github.com/neptune-ai/neptune-client) API is supported by `NeptuneClient` from [neptune-contrib](https://github.com/neptune-ai/neptune-contrib) repo.
+
+
 - Parsing of the `gpus` Trainer argument has changed: `gpus="n"` (str) no longer selects the GPU index n and instead selects the first n devices. ([#8770](https://github.com/PyTorchLightning/pytorch-lightning/pull/8770))
 
 
@@ -179,6 +191,12 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 - Deprecated `DataModule` properties: `train_transforms`, `val_transforms`, `test_transforms`, `size`, `dims` ([#8851](https://github.com/PyTorchLightning/pytorch-lightning/pull/8851))
 
 
+- Deprecated `add_to_queue`, `get_from_queue` from `LightningModule` in favor of corresponding methods in the `DDPSpawnPlugin` ([9118](https://github.com/PyTorchLightning/pytorch-lightning/pull/9118))
+
+
+- Deprecated `LightningModule.get_progress_bar_dict` and `Trainer.progress_bar_dict` in favor of `pytorch_lightning.callbacks.progress.base.get_standard_metrics` and `ProgressBarBase.get_metrics` ([#8985](https://github.com/PyTorchLightning/pytorch-lightning/pull/8985))
+
+
 - Deprecated `prepare_data_per_node` flag on Trainer and set it as a property of `DataHooks`, accessible in the `LightningModule` and `LightningDataModule` ([#8958](https://github.com/PyTorchLightning/pytorch-lightning/pull/8958))
 
 
@@ -331,6 +349,12 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 - Fixed `replace_sampler` missing the batch size under specific conditions ([#9367](https://github.com/PyTorchLightning/pytorch-lightning/pull/9367))
 
 
+- Fixed bug where the training step output needed to be `deepcopy`-ed ([#9349](https://github.com/PyTorchLightning/pytorch-lightning/pull/9349))
+
+
+- Fixed freeing data iterators in loop `on_run_end` ([#9386](https://github.com/PyTorchLightning/pytorch-lightning/pull/9386))
+
+
 ## [1.4.5] - 2021-08-31
 
 - Fixed reduction using `self.log(sync_dict=True, reduce_fx={mean,max})` ([#9142](https://github.com/PyTorchLightning/pytorch-lightning/pull/9142))

@@ -4,8 +4,8 @@ title: "PyTorch Lightning"
 abstract: "The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate."
 date-released: 2019-03-30
 authors:
-  - family-names: "William"
-    given-names: "Falcon"
+  - family-names: "Falcon"
+    given-names: "William"
   - name: "The PyTorch Lightning team"
 version: 1.4
 doi: 10.5281/zenodo.3828935

@@ -23,7 +23,7 @@ A TPU is a Tensor processing unit. Each TPU has 8 cores where each
 core is optimized for 128x128 matrix multiplies. In general, a single
 TPU is about as fast as 5 V100 GPUs!
 
-A TPU pod hosts many TPUs on it. Currently, TPU pod v2 has 2048 cores!
+A TPU pod hosts many TPUs on it. Currently, TPU v3 Pod has up to 2048 TPU cores and 32 TiB of memory!
 You can request a full pod from Google cloud or a "slice" which gives you
 some subset of those 2048 cores.
 
@@ -64,9 +64,9 @@ To get a TPU on colab, follow these steps:
 
    .. code-block::
 
-        !pip install cloud-tpu-client==0.10 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.8-cp37-cp37m-linux_x86_64.whl
+        !pip install cloud-tpu-client==0.10 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl
 
-5. Once the above is done, install PyTorch Lightning (v 0.7.0+).
+5. Once the above is done, install PyTorch Lightning.
 
    .. code-block::
 

@@ -80,7 +80,7 @@ Notice a few things.
         out = net(x)
 
 Thus, to use Lightning, you just need to organize your code which takes about 30 minutes,
-(and let's be real, you probably should do anyhow).
+(and let's be real, you probably should do anyway).
 
 ------------
 
@@ -267,8 +267,8 @@ The matching pseudocode is:
 
 Training with DataParallel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
-When training using a `accelerator` that splits data from each batch across GPUs, sometimes you might
-need to aggregate them on the master GPU for processing (dp, or ddp2).
+When training using an `accelerator` that splits data from each batch across GPUs, sometimes you might
+need to aggregate them on the main GPU for processing (dp, or ddp2).
 
 In this case, implement the `training_step_end` method
 
@@ -379,8 +379,8 @@ If you need to do something with all the outputs of each `validation_step`, over
 
 Validating with DataParallel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-When training using a `accelerator` that splits data from each batch across GPUs, sometimes you might
-need to aggregate them on the master GPU for processing (dp, or ddp2).
+When training using an `accelerator` that splits data from each batch across GPUs, sometimes you might
+need to aggregate them on the main GPU for processing (dp, or ddp2).
 
 In this case, implement the `validation_step_end` method
 
@@ -1242,12 +1242,6 @@ backward
 .. automethod:: pytorch_lightning.core.lightning.LightningModule.backward
     :noindex:
 
-get_progress_bar_dict
-~~~~~~~~~~~~~~~~~~~~~
-
-.. automethod:: pytorch_lightning.core.lightning.LightningModule.get_progress_bar_dict
-    :noindex:
-
 on_before_backward
 ~~~~~~~~~~~~~~~~~~
 

@@ -9,7 +9,7 @@
 Loggers
 *******
 
-Lightning supports the most popular logging frameworks (TensorBoard, Comet, etc...). TensorBoard is used by default,
+Lightning supports the most popular logging frameworks (TensorBoard, Comet, Neptune, etc...). TensorBoard is used by default,
 but you can pass to the :class:`~pytorch_lightning.trainer.trainer.Trainer` any combination of the following loggers.
 
 .. note::
@@ -107,34 +107,50 @@ First, install the package:
 
     pip install neptune-client
 
+or with conda:
+
+.. code-block:: bash
+
+    conda install -c conda-forge neptune-client
+
 Then configure the logger and pass it to the :class:`~pytorch_lightning.trainer.trainer.Trainer`:
 
-.. testcode::
+.. code-block:: python
 
     from pytorch_lightning.loggers import NeptuneLogger
 
     neptune_logger = NeptuneLogger(
         api_key="ANONYMOUS",  # replace with your own
-        project_name="shared/pytorch-lightning-integration",
-        experiment_name="default",  # Optional,
-        params={"max_epochs": 10},  # Optional,
-        tags=["pytorch-lightning", "mlp"],  # Optional,
+        project="common/pytorch-lightning-integration",  # format "<WORKSPACE/PROJECT>"
+        tags=["training", "resnet"],  # optional
     )
     trainer = Trainer(logger=neptune_logger)
 
 The :class:`~pytorch_lightning.loggers.NeptuneLogger` is available anywhere except ``__init__`` in your
 :class:`~pytorch_lightning.core.lightning.LightningModule`.
 
-.. testcode::
+.. code-block:: python
 
     class MyModule(LightningModule):
         def any_lightning_module_function_or_hook(self):
-            some_img = fake_image()
-            self.logger.experiment.add_image("generated_images", some_img, 0)
+            # generic recipe for logging custom metadata (neptune specific)
+            metadata = ...
+            self.logger.experiment["your/metadata/structure"].log(metadata)
+
+Note that syntax: ``self.logger.experiment["your/metadata/structure"].log(metadata)``
+is specific to Neptune and it extends logger capabilities.
+Specifically, it allows you to log various types of metadata like scores, files,
+images, interactive visuals, CSVs, etc. Refer to the
+`Neptune docs <https://docs.neptune.ai/you-should-know/logging-metadata#essential-logging-methods>`_
+for more detailed explanations.
+
+You can always use regular logger methods: ``log_metrics()`` and ``log_hyperparams()`` as these are also supported.
 
 .. seealso::
     :class:`~pytorch_lightning.loggers.NeptuneLogger` docs.
 
+    Logger `user guide <https://docs.neptune.ai/integrations-and-supported-tools/model-training/pytorch-lightning>`_.
+
 ----------------
 
 Tensorboard
@@ -227,7 +243,7 @@ Then configure the logger and pass it to the :class:`~pytorch_lightning.trainer.
 The :class:`~pytorch_lightning.loggers.WandbLogger` is available anywhere except ``__init__`` in your
 :class:`~pytorch_lightning.core.lightning.LightningModule`.
 
-.. testcode::
+.. code-block:: python
 
     class MyModule(LightningModule):
         def any_lightning_module_function_or_hook(self):

@@ -395,6 +395,12 @@ on_keyboard_interrupt
 .. automethod:: pytorch_lightning.callbacks.Callback.on_keyboard_interrupt
     :noindex:
 
+on_exception
+^^^^^^^^^^^^
+
+.. automethod:: pytorch_lightning.callbacks.Callback.on_exception
+    :noindex:
+
 on_save_checkpoint
 ^^^^^^^^^^^^^^^^^^
 

@@ -15,8 +15,24 @@ Logging
 #######
 
 Lightning supports the most popular logging frameworks (TensorBoard, Comet, etc...).
-To use a logger, simply pass it into the :class:`~pytorch_lightning.trainer.trainer.Trainer`.
-Lightning uses TensorBoard by default.
+
+By default, Lightning uses `PyTorch TensorBoard <https://pytorch.org/docs/stable/tensorboard.html>`__ logging  under the hood, and stores the logs to a directory (by default in ``lightning_logs/``).
+
+.. testcode::
+
+    from pytorch_lightning import Trainer
+
+    # Automatically logs to a directory
+    # (by default ``lightning_logs/``)
+    trainer = Trainer()
+
+To see your logs:
+
+.. code-block:: bash
+
+    tensorboard --logdir=lightning_logs/
+
+You can also pass a custom Logger to the :class:`~pytorch_lightning.trainer.trainer.Trainer`.
 
 .. testcode::
 
@@ -245,13 +261,13 @@ Modifying the progress bar
 
 The progress bar by default already includes the training loss and version number of the experiment
 if you are using a logger. These defaults can be customized by overriding the
-:func:`~pytorch_lightning.core.lightning.LightningModule.get_progress_bar_dict` hook in your module.
+:func:`~pytorch_lightning.callbacks.base.ProgressBarBase.get_metrics` hook in your module.
 
 .. code-block:: python
 
-    def get_progress_bar_dict(self):
+    def get_metrics(self):
         # don't show the version number
-        items = super().get_progress_bar_dict()
+        items = super().get_metrics()
         items.pop("v_num", None)
         return items