Skip to content

Deprecate LightningDistributed and keep broadcast in ddp/ddpSpawn directly #9692

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
four4fish opened this issue Sep 24, 2021 · 2 comments · Fixed by #9691
Closed

Deprecate LightningDistributed and keep broadcast in ddp/ddpSpawn directly #9692

four4fish opened this issue Sep 24, 2021 · 2 comments · Fixed by #9691
Labels
deprecation Includes a deprecation distributed Generic distributed-related topic feature Is an improvement or enhancement refactor

Comments

@four4fish
Copy link
Contributor

four4fish commented Sep 24, 2021

Proposed refactoring or deprecation

LightningDistributed() class only used by ddp and ddpSpawn, and only have one broadcast function for torch collectives. It's unnecessary to have.
Also, we have to set rank and device in set up steps. If subclass extend DDP or DDPSpawn, and overridden function where LightningDistributed.rank and device setted, could cause silent failures.

  1. Now the src is not respected in torch broadcast

Motivation

Simplify the code structure and reduce the possibilities for silent failure

Pitch

Deprecate LightningDistributed. There is only one function

    def broadcast(self, obj: Any, group=_group.WORLD):
        # always wrap into a list so it can be broadcasted.
        obj = [obj]

        if self.rank != 0:
            obj = [None] * len(obj)

        broadcast_object_list(obj, 0, group=group or _group.WORLD)

        return obj[0]

Move to ddp and ddpSpawn

   def broadcast(self, obj: object, src: int = 0) -> object:
        if not distributed_available():
            raise RuntimeError("DDP is not initialized and torch.distributed is not available, can not broadcast object")
        obj = [obj]
        if self.global_rank != 0:
            obj = [None] * len(obj)
        broadcast_object_list(obj, src, group=_group.WORLD)
        return obj[0]

Additional context

Related to #7534


If you enjoy Lightning, check out our other projects! ⚡

  • Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

  • Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning

  • Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

  • Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

@four4fish four4fish added refactor distributed Generic distributed-related topic deprecation Includes a deprecation feature Is an improvement or enhancement labels Sep 24, 2021
@awaelchli
Copy link
Contributor

The chosen sprint is in the past, updating it.

@awaelchli
Copy link
Contributor

awaelchli commented Sep 24, 2021

I'm ok with this change. It is reasonable now that the majority of methods have disappeared from LightningDistributed, and after our acclerator rework, there is no longer a need for this standalone class. And it will be replaced by the collective plugin in that sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deprecation Includes a deprecation distributed Generic distributed-related topic feature Is an improvement or enhancement refactor
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants