Skip to content

[WIP] Add support for HPU accelerator #10404

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

jerome-habana
Copy link
Contributor

@jerome-habana jerome-habana commented Nov 8, 2021

  • A new accelerator to support Gaudi devices added
  • DDP support for multi-card runs enabled
  • HPU precision plugin for bf16 added
  • Distributed overrides updated for HPU tensor types support

Signed-off-by: Jerome [email protected]

What does this PR do?

Fixes #10214

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

jerome-habana and others added 2 commits November 8, 2021 07:30
 - A new accelerator to support Gaudi devices added
 - DDP support for multi-card runs enabled
 - HPU precision plugin for bf16 added
 - Distributed overrides updated for HPU tensor types support

Signed-off-by: Jerome <[email protected]>
Copy link
Member

@justusschock justusschock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, Already looking good, I left you some early comments and suggestions, but seems like the integration goes well :)

@@ -174,6 +176,7 @@ def __init__(
plugins: Optional[Union[PLUGIN_INPUT, List[PLUGIN_INPUT]]] = None,
amp_backend: str = "native",
amp_level: Optional[str] = None,
hmp_params: ["level", "verbose", "bf16_ops", "fp32_ops"] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how often will people change these? Is it sensible to have these as default and ask people to instantiate the plugin class themselves if they want to override?
We have this for other backends like FSDP or DeepSpeed already

Also: please no mutable defaults, this is one of python's quirks :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this parameters shoudn't be added to the Trainer. amp_level could be used to hmp_params params, but ultimately this should be passed only via the HPUPlugin.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using amp doesn't seem to be the right way. We might need to generalize this.

@@ -130,6 +131,7 @@ def __init__(
devices: Optional[Union[List[int], str, int]] = None,
gpus: Optional[Union[List[int], str, int]] = None,
auto_select_gpus: bool = False,
hpus: Optional[int] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerome-habana Would you be ok with not having a separate flag for this and integrate this with devices and accelerator?

We are currently in the process of reducing and re-evaluating the trainer arguments as they've grown quite large.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we are currently depreciating those flags for readability. Let's not add hpus there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justusschock has this been decided? We still have the IPU flag for example. is there any issue to track this development?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we are currently depreciating those flags for readability.

We haven't decided yet that we will be deprecating those flags.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the final decision ? Shall I proceed with the existing change ?

Comment on lines +839 to +843
if self.logger is not None:
# save exp to get started (this is where the first experiment logs are written)
# self.logger.log_hyperparams(self.lightning_module.hparams_initial)
self.logger.log_graph(self.lightning_module)
self.logger.save()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a hard requirement for you? Because this also changes behaviour for all other backends, which we usually want to avoid.

Comment on lines +392 to +403

# local rank mapping for device open is needed for hpu devices
if torch_distributed_backend == "hcl" or torch_distributed_backend == "hccl":
try:
import habana_frameworks.torch.core.hccl
except Exception:
print("hccl backend is not supported, using hcl backend")
torch_distributed_backend = "hcl"
os.environ["PL_TORCH_DISTRIBUTED_BACKEND"] = "hcl"

os.environ["ID"] = str(cluster_environment.local_rank())

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I don't think this can live here, we are trying everything outside accelerators to be device agnostic whenever possible. So special-casing stuff should be avoided. Even though it feels like overhead, could you maybe subclass this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Will make it redundant and try out

Comment on lines +112 to +114
from habana_frameworks.torch.utils.library_loader import is_habana_avaialble

_HPU_AVAILABLE = is_habana_avaialble()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from habana_frameworks.torch.utils.library_loader import is_habana_avaialble
_HPU_AVAILABLE = is_habana_avaialble()
try:
from habana_frameworks.torch.utils.library_loader import is_habana_avaialble
_HPU_AVAILABLE = is_habana_avaialble()
except ImportError:
_HPU_AVAILABLE = False

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we could check availability of two things

  1. Habana framework
  2. Habana Devices

You could rather check this via

_HABANA_AVAILABLE = _module_available("habana")
if _HABANA_AVAILABLE:
	from habana_frameworks.torch.utils.library_loader import is_habana_available
	
	_HPU_AVAILABLE = is_habana_avaialble()
else:
	_HPU_AVAILABLE = False

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am already doing it within the function.

Comment on lines +52 to +54
@property
def is_distributed(self) -> bool:
return False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this always the case? if not, it is likely an issue beyond this PR, if yes, no need to override :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering, we can have the same plugin do 1x and 8x and can be generalized. So added it.

Comment on lines +57 to +60
import os

dist_backend = os.environ.get("PL_TORCH_DISTRIBUTED_BACKEND")
is_hcl_backend = group_backend == torch.distributed.Backend(str(dist_backend))
Copy link
Member

@justusschock justusschock Nov 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the code in this file will be removed in #10390 together with the support for PyTorch 1.6

):

device = torch.device("hpu")
checkpoint_io = checkpoint_io
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
checkpoint_io = checkpoint_io

checkpoint_io = checkpoint_io
super().__init__(device, checkpoint_io=checkpoint_io)

self.debug = debug
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this flag needed for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed. Will remove it

Comment on lines 53 to +54
tpu_local_core_rank: int
hpu_local_core_rank: int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just combine those to local core rank? I don't think there is a need to have two arguments here that basically do the same but likely are only used mutually exclusively

debug: bool = False,
):

device = torch.device("hpu")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can pass them directly to super().init

return False

def setup(self) -> None:
shared_params = find_shared_parameters(self.model)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe this logic is required for HPU. This is quite specific to TPU which don't support tying.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is common for all sharding use cases if am not wrong.

self.device = torch.device(self.device)

def on_save(self, checkpoint: dict) -> dict:
return move_data_to_device(checkpoint, torch.device("cpu"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this should use the checkpoint_io plugin.

hmp_params["fp32_ops"] = "./pytorch-lightning-fork/pl_examples/hpu_examples/simple_mnist/ops_fp32_mnist.txt"

# Initialize a trainer
trainer = pl.Trainer(hpus=1, max_epochs=1, precision=16, hmp_params=hmp_params)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We won't support adding a new flag within the Trainer such as hmp_params. This would have to rely directly on existing flags or be passed directly by creating the plugin.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tchaton Do you have a recommendation ? Its nice to generalize this as a common param based on the backend but would involve modifying amp

device: int,
checkpoint_io: Optional[CheckpointIO] = None,
debug: bool = False,
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docstring ?

elif self.has_hpu:
self._device_type = DeviceType.HPU

def update_device_type_if_hpu_plugin(self) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function used ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will check and remove if not required. But there was an issue with devicetype

@@ -208,6 +213,9 @@ def forward(

gathered_tensor = [torch.zeros_like(tensor) for _ in range(torch.distributed.get_world_size())]

if _HPU_AVAILABLE:
# HPU distributed backend doesn't support int64 tensors
tensor = tensor.int()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check the tensor wasn't a float or boolean , etc.. and then apply the re-conversion. This could break user code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all indices during the initial broadcast.

@@ -381,6 +389,18 @@ def init_dist_connection(
world_size = world_size if world_size is not None else cluster_environment.world_size()
os.environ["MASTER_ADDR"] = cluster_environment.master_address()
os.environ["MASTER_PORT"] = str(cluster_environment.master_port())

# local rank mapping for device open is needed for hpu devices
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be done directly within the HPU Accelerator ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont see a way to do it.

@@ -381,6 +389,18 @@ def init_dist_connection(
world_size = world_size if world_size is not None else cluster_environment.world_size()
os.environ["MASTER_ADDR"] = cluster_environment.master_address()
os.environ["MASTER_PORT"] = str(cluster_environment.master_port())

# local rank mapping for device open is needed for hpu devices
if torch_distributed_backend == "hcl" or torch_distributed_backend == "hccl":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

side note: Mind explaining the difference between hccl vs hcl in the code ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll deprecate hcl.

Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any testing within this PR for HPU, shall we first add CI for it and then add this integration?

Comment on lines +44 to +45
hmp_params["bf16_ops"] = "./pytorch-lightning-fork/pl_examples/hpu_examples/simple_mnist/ops_bf16_mnist.txt"
hmp_params["fp32_ops"] = "./pytorch-lightning-fork/pl_examples/hpu_examples/simple_mnist/ops_fp32_mnist.txt"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we keep them inside this repo, in fact, they are here, but the past is not updated...

def test_accelerator_selected(tmpdir):
trainer = Trainer(default_root_dir=tmpdir, hpus=1)
assert isinstance(trainer.accelerator, HPUAccelerator)
trainer = Trainer(default_root_dir=tmpdir, hpus=1, accelerator="hpu")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why auto, I would force to ask for HPU

@stale
Copy link

stale bot commented Nov 22, 2021

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://pytorch-lightning.readthedocs.io/en/latest/generated/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Slack. Thank you for your contributions.

@stale stale bot added the won't fix This will not be worked on label Nov 22, 2021
@kaushikb11 kaushikb11 added accelerator feature Is an improvement or enhancement and removed won't fix This will not be worked on labels Nov 22, 2021
Copy link
Contributor Author

@jerome-habana jerome-habana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is 1.6 expected ?

@tchaton
Copy link
Contributor

tchaton commented Nov 23, 2021

When is 1.6 expected ?

Hey @jerome-habana,

1.6 would be expected 1 week after PyTorch 1.11. It usually takes around 1 quarter.

@stale
Copy link

stale bot commented Dec 7, 2021

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://pytorch-lightning.readthedocs.io/en/latest/generated/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Slack. Thank you for your contributions.

@stale stale bot added the won't fix This will not be worked on label Dec 7, 2021
@stale
Copy link

stale bot commented Dec 15, 2021

This pull request is going to be closed. Please feel free to reopen it create a new from the actual master.

@stale stale bot closed this Dec 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accelerator feature Is an improvement or enhancement won't fix This will not be worked on
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Habanas' Gaudi Accelerator
6 participants