-
Notifications
You must be signed in to change notification settings - Fork 3.5k
[WIP] Add support for HPU accelerator #10404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- A new accelerator to support Gaudi devices added - DDP support for multi-card runs enabled - HPU precision plugin for bf16 added - Distributed overrides updated for HPU tensor types support Signed-off-by: Jerome <[email protected]>
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, Already looking good, I left you some early comments and suggestions, but seems like the integration goes well :)
@@ -174,6 +176,7 @@ def __init__( | |||
plugins: Optional[Union[PLUGIN_INPUT, List[PLUGIN_INPUT]]] = None, | |||
amp_backend: str = "native", | |||
amp_level: Optional[str] = None, | |||
hmp_params: ["level", "verbose", "bf16_ops", "fp32_ops"] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how often will people change these? Is it sensible to have these as default and ask people to instantiate the plugin class themselves if they want to override?
We have this for other backends like FSDP or DeepSpeed already
Also: please no mutable defaults, this is one of python's quirks :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this parameters shoudn't be added to the Trainer. amp_level
could be used to hmp_params
params, but ultimately this should be passed only via the HPUPlugin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using amp doesn't seem to be the right way. We might need to generalize this.
@@ -130,6 +131,7 @@ def __init__( | |||
devices: Optional[Union[List[int], str, int]] = None, | |||
gpus: Optional[Union[List[int], str, int]] = None, | |||
auto_select_gpus: bool = False, | |||
hpus: Optional[int] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jerome-habana Would you be ok with not having a separate flag for this and integrate this with devices
and accelerator
?
We are currently in the process of reducing and re-evaluating the trainer arguments as they've grown quite large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we are currently depreciating those flags for readability. Let's not add hpus
there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@justusschock has this been decided? We still have the IPU flag for example. is there any issue to track this development?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we are currently depreciating those flags for readability.
We haven't decided yet that we will be deprecating those flags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the final decision ? Shall I proceed with the existing change ?
if self.logger is not None: | ||
# save exp to get started (this is where the first experiment logs are written) | ||
# self.logger.log_hyperparams(self.lightning_module.hparams_initial) | ||
self.logger.log_graph(self.lightning_module) | ||
self.logger.save() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a hard requirement for you? Because this also changes behaviour for all other backends, which we usually want to avoid.
|
||
# local rank mapping for device open is needed for hpu devices | ||
if torch_distributed_backend == "hcl" or torch_distributed_backend == "hccl": | ||
try: | ||
import habana_frameworks.torch.core.hccl | ||
except Exception: | ||
print("hccl backend is not supported, using hcl backend") | ||
torch_distributed_backend = "hcl" | ||
os.environ["PL_TORCH_DISTRIBUTED_BACKEND"] = "hcl" | ||
|
||
os.environ["ID"] = str(cluster_environment.local_rank()) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but I don't think this can live here, we are trying everything outside accelerators to be device agnostic whenever possible. So special-casing stuff should be avoided. Even though it feels like overhead, could you maybe subclass this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Will make it redundant and try out
from habana_frameworks.torch.utils.library_loader import is_habana_avaialble | ||
|
||
_HPU_AVAILABLE = is_habana_avaialble() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from habana_frameworks.torch.utils.library_loader import is_habana_avaialble | |
_HPU_AVAILABLE = is_habana_avaialble() | |
try: | |
from habana_frameworks.torch.utils.library_loader import is_habana_avaialble | |
_HPU_AVAILABLE = is_habana_avaialble() | |
except ImportError: | |
_HPU_AVAILABLE = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, we could check availability of two things
- Habana framework
- Habana Devices
You could rather check this via
_HABANA_AVAILABLE = _module_available("habana")
if _HABANA_AVAILABLE:
from habana_frameworks.torch.utils.library_loader import is_habana_available
_HPU_AVAILABLE = is_habana_avaialble()
else:
_HPU_AVAILABLE = False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am already doing it within the function.
@property | ||
def is_distributed(self) -> bool: | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this always the case? if not, it is likely an issue beyond this PR, if yes, no need to override :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering, we can have the same plugin do 1x and 8x and can be generalized. So added it.
import os | ||
|
||
dist_backend = os.environ.get("PL_TORCH_DISTRIBUTED_BACKEND") | ||
is_hcl_backend = group_backend == torch.distributed.Backend(str(dist_backend)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the code in this file will be removed in #10390 together with the support for PyTorch 1.6
): | ||
|
||
device = torch.device("hpu") | ||
checkpoint_io = checkpoint_io |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checkpoint_io = checkpoint_io |
checkpoint_io = checkpoint_io | ||
super().__init__(device, checkpoint_io=checkpoint_io) | ||
|
||
self.debug = debug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this flag needed for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed. Will remove it
tpu_local_core_rank: int | ||
hpu_local_core_rank: int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just combine those to local core rank? I don't think there is a need to have two arguments here that basically do the same but likely are only used mutually exclusively
debug: bool = False, | ||
): | ||
|
||
device = torch.device("hpu") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can pass them directly to super().init
return False | ||
|
||
def setup(self) -> None: | ||
shared_params = find_shared_parameters(self.model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe this logic is required for HPU. This is quite specific to TPU which don't support tying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is common for all sharding use cases if am not wrong.
self.device = torch.device(self.device) | ||
|
||
def on_save(self, checkpoint: dict) -> dict: | ||
return move_data_to_device(checkpoint, torch.device("cpu")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this should use the checkpoint_io plugin.
hmp_params["fp32_ops"] = "./pytorch-lightning-fork/pl_examples/hpu_examples/simple_mnist/ops_fp32_mnist.txt" | ||
|
||
# Initialize a trainer | ||
trainer = pl.Trainer(hpus=1, max_epochs=1, precision=16, hmp_params=hmp_params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We won't support adding a new flag within the Trainer such as hmp_params
. This would have to rely directly on existing flags or be passed directly by creating the plugin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tchaton Do you have a recommendation ? Its nice to generalize this as a common param based on the backend but would involve modifying amp
device: int, | ||
checkpoint_io: Optional[CheckpointIO] = None, | ||
debug: bool = False, | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docstring ?
elif self.has_hpu: | ||
self._device_type = DeviceType.HPU | ||
|
||
def update_device_type_if_hpu_plugin(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this function used ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will check and remove if not required. But there was an issue with devicetype
@@ -208,6 +213,9 @@ def forward( | |||
|
|||
gathered_tensor = [torch.zeros_like(tensor) for _ in range(torch.distributed.get_world_size())] | |||
|
|||
if _HPU_AVAILABLE: | |||
# HPU distributed backend doesn't support int64 tensors | |||
tensor = tensor.int() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check the tensor wasn't a float or boolean , etc.. and then apply the re-conversion. This could break user code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are all indices during the initial broadcast.
@@ -381,6 +389,18 @@ def init_dist_connection( | |||
world_size = world_size if world_size is not None else cluster_environment.world_size() | |||
os.environ["MASTER_ADDR"] = cluster_environment.master_address() | |||
os.environ["MASTER_PORT"] = str(cluster_environment.master_port()) | |||
|
|||
# local rank mapping for device open is needed for hpu devices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be done directly within the HPU Accelerator ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont see a way to do it.
@@ -381,6 +389,18 @@ def init_dist_connection( | |||
world_size = world_size if world_size is not None else cluster_environment.world_size() | |||
os.environ["MASTER_ADDR"] = cluster_environment.master_address() | |||
os.environ["MASTER_PORT"] = str(cluster_environment.master_port()) | |||
|
|||
# local rank mapping for device open is needed for hpu devices | |||
if torch_distributed_backend == "hcl" or torch_distributed_backend == "hccl": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
side note: Mind explaining the difference between hccl
vs hcl
in the code ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll deprecate hcl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any testing within this PR for HPU, shall we first add CI for it and then add this integration?
hmp_params["bf16_ops"] = "./pytorch-lightning-fork/pl_examples/hpu_examples/simple_mnist/ops_bf16_mnist.txt" | ||
hmp_params["fp32_ops"] = "./pytorch-lightning-fork/pl_examples/hpu_examples/simple_mnist/ops_fp32_mnist.txt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we keep them inside this repo, in fact, they are here, but the past is not updated...
def test_accelerator_selected(tmpdir): | ||
trainer = Trainer(default_root_dir=tmpdir, hpus=1) | ||
assert isinstance(trainer.accelerator, HPUAccelerator) | ||
trainer = Trainer(default_root_dir=tmpdir, hpus=1, accelerator="hpu") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why auto, I would force to ask for HPU
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://pytorch-lightning.readthedocs.io/en/latest/generated/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Slack. Thank you for your contributions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is 1.6 expected ?
Hey @jerome-habana, 1.6 would be expected 1 week after PyTorch 1.11. It usually takes around 1 quarter. |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://pytorch-lightning.readthedocs.io/en/latest/generated/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Slack. Thank you for your contributions. |
This pull request is going to be closed. Please feel free to reopen it create a new from the actual master. |
Signed-off-by: Jerome [email protected]
What does this PR do?
Fixes #10214
Does your PR introduce any breaking changes? If yes, please list them.
Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃