[WIP] Add support for HPU accelerator #10404

jerome-habana · 2021-11-08T05:42:27Z

A new accelerator to support Gaudi devices added
DDP support for multi-card runs enabled
HPU precision plugin for bf16 added
Distributed overrides updated for HPU tensor types support

Signed-off-by: Jerome [email protected]

What does this PR do?

Fixes #10214

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

- A new accelerator to support Gaudi devices added - DDP support for multi-card runs enabled - HPU precision plugin for bf16 added - Distributed overrides updated for HPU tensor types support Signed-off-by: Jerome <[email protected]>

for more information, see https://pre-commit.ci

justusschock

Hey, Already looking good, I left you some early comments and suggestions, but seems like the integration goes well :)

justusschock · 2021-11-08T09:03:05Z

pytorch_lightning/trainer/trainer.py

@@ -174,6 +176,7 @@ def __init__(
        plugins: Optional[Union[PLUGIN_INPUT, List[PLUGIN_INPUT]]] = None,
        amp_backend: str = "native",
        amp_level: Optional[str] = None,
+        hmp_params: ["level", "verbose", "bf16_ops", "fp32_ops"] = None,


how often will people change these? Is it sensible to have these as default and ask people to instantiate the plugin class themselves if they want to override?
We have this for other backends like FSDP or DeepSpeed already

Also: please no mutable defaults, this is one of python's quirks :)

Yes, this parameters shoudn't be added to the Trainer. amp_level could be used to hmp_params params, but ultimately this should be passed only via the HPUPlugin.

Using amp doesn't seem to be the right way. We might need to generalize this.

justusschock · 2021-11-08T09:03:21Z

pytorch_lightning/trainer/trainer.py

@@ -130,6 +131,7 @@ def __init__(
        devices: Optional[Union[List[int], str, int]] = None,
        gpus: Optional[Union[List[int], str, int]] = None,
        auto_select_gpus: bool = False,
+        hpus: Optional[int] = None,


@jerome-habana Would you be ok with not having a separate flag for this and integrate this with devices and accelerator?

We are currently in the process of reducing and re-evaluating the trainer arguments as they've grown quite large.

Yes, we are currently depreciating those flags for readability. Let's not add hpus there.

@justusschock has this been decided? We still have the IPU flag for example. is there any issue to track this development?

Yes, we are currently depreciating those flags for readability.

We haven't decided yet that we will be deprecating those flags.

what is the final decision ? Shall I proceed with the existing change ?

justusschock · 2021-11-08T09:04:20Z

pytorch_lightning/trainer/trainer.py

+        if self.logger is not None:
+            # save exp to get started (this is where the first experiment logs are written)
+            # self.logger.log_hyperparams(self.lightning_module.hparams_initial)
+            self.logger.log_graph(self.lightning_module)
+            self.logger.save()


Is this a hard requirement for you? Because this also changes behaviour for all other backends, which we usually want to avoid.

justusschock · 2021-11-08T09:06:36Z

pytorch_lightning/utilities/distributed.py

+
+    # local rank mapping for device open is needed for hpu devices
+    if torch_distributed_backend == "hcl" or torch_distributed_backend == "hccl":
+        try:
+            import habana_frameworks.torch.core.hccl
+        except Exception:
+            print("hccl backend is not supported, using hcl backend")
+            torch_distributed_backend = "hcl"
+            os.environ["PL_TORCH_DISTRIBUTED_BACKEND"] = "hcl"
+
+        os.environ["ID"] = str(cluster_environment.local_rank())
+


Sorry, but I don't think this can live here, we are trying everything outside accelerators to be device agnostic whenever possible. So special-casing stuff should be avoided. Even though it feels like overhead, could you maybe subclass this?

Makes sense. Will make it redundant and try out

justusschock · 2021-11-08T09:07:30Z

pytorch_lightning/utilities/imports.py

+from habana_frameworks.torch.utils.library_loader import is_habana_avaialble
+
+_HPU_AVAILABLE = is_habana_avaialble()


Suggested change

from habana_frameworks.torch.utils.library_loader import is_habana_avaialble

_HPU_AVAILABLE = is_habana_avaialble()

try:

from habana_frameworks.torch.utils.library_loader import is_habana_avaialble

_HPU_AVAILABLE = is_habana_avaialble()

except ImportError:

_HPU_AVAILABLE = False

Here, we could check availability of two things

Habana framework

Habana Devices

You could rather check this via

_HABANA_AVAILABLE = _module_available("habana") if _HABANA_AVAILABLE: from habana_frameworks.torch.utils.library_loader import is_habana_available _HPU_AVAILABLE = is_habana_avaialble() else: _HPU_AVAILABLE = False

Am already doing it within the function.

justusschock · 2021-11-08T09:20:54Z

pytorch_lightning/plugins/training_type/hpu.py

+    @property
+    def is_distributed(self) -> bool:
+        return False


Isn't this always the case? if not, it is likely an issue beyond this PR, if yes, no need to override :)

I was wondering, we can have the same plugin do 1x and 8x and can be generalized. So added it.

justusschock · 2021-11-08T09:21:43Z

pytorch_lightning/overrides/torch_distributed.py

+    import os
+
+    dist_backend = os.environ.get("PL_TORCH_DISTRIBUTED_BACKEND")
+    is_hcl_backend = group_backend == torch.distributed.Backend(str(dist_backend))


All the code in this file will be removed in #10390 together with the support for PyTorch 1.6

justusschock · 2021-11-08T09:22:20Z

pytorch_lightning/plugins/training_type/hpu.py

+    ):
+
+        device = torch.device("hpu")
+        checkpoint_io = checkpoint_io


Suggested change

checkpoint_io = checkpoint_io

justusschock · 2021-11-08T09:22:59Z

pytorch_lightning/plugins/training_type/hpu.py

+        checkpoint_io = checkpoint_io
+        super().__init__(device, checkpoint_io=checkpoint_io)
+
+        self.debug = debug


what is this flag needed for?

Not needed. Will remove it

justusschock · 2021-11-08T09:24:55Z

pytorch_lightning/trainer/data_loading.py

    tpu_local_core_rank: int
+    hpu_local_core_rank: int


can we just combine those to local core rank? I don't think there is a need to have two arguments here that basically do the same but likely are only used mutually exclusively

tchaton · 2021-11-08T09:40:42Z

pytorch_lightning/plugins/training_type/hpu.py

+        debug: bool = False,
+    ):
+
+        device = torch.device("hpu")


We can pass them directly to super().init

tchaton · 2021-11-08T09:41:24Z

pytorch_lightning/plugins/training_type/hpu.py

+        return False
+
+    def setup(self) -> None:
+        shared_params = find_shared_parameters(self.model)


I don't believe this logic is required for HPU. This is quite specific to TPU which don't support tying.

this is common for all sharding use cases if am not wrong.

tchaton · 2021-11-08T09:41:54Z

pytorch_lightning/plugins/training_type/hpu.py

+            self.device = torch.device(self.device)
+
+    def on_save(self, checkpoint: dict) -> dict:
+        return move_data_to_device(checkpoint, torch.device("cpu"))


I believe this should use the checkpoint_io plugin.

tchaton · 2021-11-08T09:50:49Z

pl_examples/hpu_examples/simple_mnist/mnist.py

+hmp_params["fp32_ops"] = "./pytorch-lightning-fork/pl_examples/hpu_examples/simple_mnist/ops_fp32_mnist.txt"
+
+# Initialize a trainer
+trainer = pl.Trainer(hpus=1, max_epochs=1, precision=16, hmp_params=hmp_params)


We won't support adding a new flag within the Trainer such as hmp_params. This would have to rely directly on existing flags or be passed directly by creating the plugin.

@tchaton Do you have a recommendation ? Its nice to generalize this as a common param based on the backend but would involve modifying amp

pytorch_lightning/plugins/training_type/hpu.py

tchaton · 2021-11-08T10:15:48Z

pytorch_lightning/plugins/training_type/hpu.py

+        device: int,
+        checkpoint_io: Optional[CheckpointIO] = None,
+        debug: bool = False,
+    ):


docstring ?

tchaton · 2021-11-08T10:17:45Z

pytorch_lightning/trainer/connectors/accelerator_connector.py

+                elif self.has_hpu:
+                    self._device_type = DeviceType.HPU
+
+    def update_device_type_if_hpu_plugin(self) -> None:


Is this function used ?

Will check and remove if not required. But there was an issue with devicetype

tchaton · 2021-11-08T10:20:05Z

pytorch_lightning/utilities/distributed.py

@@ -208,6 +213,9 @@ def forward(

        gathered_tensor = [torch.zeros_like(tensor) for _ in range(torch.distributed.get_world_size())]

+        if _HPU_AVAILABLE:
+            # HPU distributed backend doesn't support int64 tensors
+            tensor = tensor.int()


Should we check the tensor wasn't a float or boolean , etc.. and then apply the re-conversion. This could break user code.

These are all indices during the initial broadcast.

tchaton · 2021-11-08T10:21:37Z

pytorch_lightning/utilities/distributed.py

@@ -381,6 +389,18 @@ def init_dist_connection(
    world_size = world_size if world_size is not None else cluster_environment.world_size()
    os.environ["MASTER_ADDR"] = cluster_environment.master_address()
    os.environ["MASTER_PORT"] = str(cluster_environment.master_port())
+
+    # local rank mapping for device open is needed for hpu devices


Could this be done directly within the HPU Accelerator ?

I dont see a way to do it.

tchaton · 2021-11-08T10:22:06Z

pytorch_lightning/utilities/distributed.py

@@ -381,6 +389,18 @@ def init_dist_connection(
    world_size = world_size if world_size is not None else cluster_environment.world_size()
    os.environ["MASTER_ADDR"] = cluster_environment.master_address()
    os.environ["MASTER_PORT"] = str(cluster_environment.master_port())
+
+    # local rank mapping for device open is needed for hpu devices
+    if torch_distributed_backend == "hcl" or torch_distributed_backend == "hccl":


side note: Mind explaining the difference between hccl vs hcl in the code ?

I'll deprecate hcl.

Borda

Do we have any testing within this PR for HPU, shall we first add CI for it and then add this integration?

Borda · 2021-11-08T11:05:05Z

pl_examples/hpu_examples/simple_mnist/mnist.py

+hmp_params["bf16_ops"] = "./pytorch-lightning-fork/pl_examples/hpu_examples/simple_mnist/ops_bf16_mnist.txt"
+hmp_params["fp32_ops"] = "./pytorch-lightning-fork/pl_examples/hpu_examples/simple_mnist/ops_fp32_mnist.txt"


can we keep them inside this repo, in fact, they are here, but the past is not updated...

pytorch_lightning/overrides/torch_distributed.py

Borda · 2021-11-08T11:07:10Z

tests/accelerators/test_hpu.py

+def test_accelerator_selected(tmpdir):
+    trainer = Trainer(default_root_dir=tmpdir, hpus=1)
+    assert isinstance(trainer.accelerator, HPUAccelerator)
+    trainer = Trainer(default_root_dir=tmpdir, hpus=1, accelerator="hpu")


why auto, I would force to ask for HPU

stale · 2021-11-22T17:06:02Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://pytorch-lightning.readthedocs.io/en/latest/generated/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Slack. Thank you for your contributions.

jerome-habana

When is 1.6 expected ?

tchaton · 2021-11-23T10:30:06Z

When is 1.6 expected ?

Hey @jerome-habana,

1.6 would be expected 1 week after PyTorch 1.11. It usually takes around 1 quarter.

stale · 2021-12-07T21:45:41Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://pytorch-lightning.readthedocs.io/en/latest/generated/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Slack. Thank you for your contributions.

stale · 2021-12-15T04:58:55Z

This pull request is going to be closed. Please feel free to reopen it create a new from the actual master.

jerome-habana and others added 2 commits November 8, 2021 07:30

Add support for HPU accelerator

f3337a0

- A new accelerator to support Gaudi devices added - DDP support for multi-card runs enabled - HPU precision plugin for bf16 added - Distributed overrides updated for HPU tensor types support Signed-off-by: Jerome <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

e15135e

for more information, see https://pre-commit.ci

justusschock reviewed Nov 8, 2021

View reviewed changes

tchaton reviewed Nov 8, 2021

View reviewed changes

Borda reviewed Nov 8, 2021

View reviewed changes

SeanNaren mentioned this pull request Nov 8, 2021

[RFC] Future of gpus/ipus/tpu_cores with respect to devices #10410

Closed

stale bot added the won't fix This will not be worked on label Nov 22, 2021

kaushikb11 added accelerator feature Is an improvement or enhancement and removed won't fix This will not be worked on labels Nov 22, 2021

jerome-habana commented Nov 23, 2021

View reviewed changes

kaushikb11 mentioned this pull request Nov 23, 2021

[RFC] Support passing pluggable Accelerators to Trainer #10687

Closed

stale bot added the won't fix This will not be worked on label Dec 7, 2021

stale bot closed this Dec 15, 2021

akihironitta mentioned this pull request Mar 9, 2022

Add support for Habana accelerator (HPU) #11808

Merged

12 tasks

		from habana_frameworks.torch.utils.library_loader import is_habana_avaialble

		_HPU_AVAILABLE = is_habana_avaialble()

		hmp_params["bf16_ops"] = "./pytorch-lightning-fork/pl_examples/hpu_examples/simple_mnist/ops_bf16_mnist.txt"
		hmp_params["fp32_ops"] = "./pytorch-lightning-fork/pl_examples/hpu_examples/simple_mnist/ops_fp32_mnist.txt"

[WIP] Add support for HPU accelerator #10404

[WIP] Add support for HPU accelerator #10404

Uh oh!

Conversation

jerome-habana commented Nov 8, 2021 • edited by carmocca Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

PR review

Did you have fun?

Uh oh!

justusschock left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justusschock Nov 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jerome-habana commented Nov 8, 2021 •

edited by carmocca

Loading

justusschock Nov 8, 2021 •

edited

Loading