Skip to content

Commit 812c2dc

Browse files
jerome-habanacarmoccaawaelchliakihironittatchaton
authored
Add support for Habana accelerator (HPU) (#11808)
Co-authored-by: Carlos Mocholí <[email protected]> Co-authored-by: Adrian Wälchli <[email protected]> Co-authored-by: Carlos Mocholi <[email protected]> Co-authored-by: Aki Nitta <[email protected]> Co-authored-by: thomas chaton <[email protected]> Co-authored-by: Justus Schock <[email protected]> Co-authored-by: four4fish <[email protected]> Co-authored-by: Rohit Gupta <[email protected]> Co-authored-by: ananthsub <[email protected]> Co-authored-by: Jirka Borovec <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Kaushik B <[email protected]> Co-authored-by: Kaushik B <[email protected]> Co-authored-by: jjenniferdai <[email protected]> Co-authored-by: Kushashwa Ravi Shrimali <[email protected]> Co-authored-by: Akarsha Rao <[email protected]> Co-authored-by: Jirka <[email protected]> Co-authored-by: Carlos Mocholí <[email protected]>
1 parent 089fcb9 commit 812c2dc

40 files changed

+1433
-16
lines changed

.azure-pipelines/hpu-tests.yml

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,3 +31,23 @@ jobs:
3131
apt-get install -y hwinfo
3232
hwinfo --short
3333
displayName: 'Instance HW info'
34+
35+
- bash: |
36+
pip install . --requirement requirements/test.txt
37+
displayName: 'Install dependencies'
38+
39+
- bash: |
40+
python ".azure-pipelines/run_hpu_tests.py"
41+
displayName: 'HPU Tests in parallel'
42+
43+
- bash: |
44+
export PYTHONPATH="${PYTHONPATH}:$(pwd)"
45+
python "pl_examples/hpu_examples/simple_mnist/mnist.py"
46+
displayName: 'Testing: HPU examples'
47+
48+
- task: PublishTestResults@2
49+
inputs:
50+
testResultsFiles: 'hpu*_test-results.xml'
51+
testRunTitle: '$(Agent.OS) - $(Build.DefinitionName) - Python $(python.version)'
52+
condition: succeededOrFailed()
53+
displayName: 'Publish test results'

.azure-pipelines/run_hpu_tests.py

Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
"""This file is called from the hpu-tests.yml pipeline.
2+
3+
The following script run the hpu tests in parallel.
4+
Tests run are:
5+
1. test_inference_only is run on four cards
6+
2. test_all_stages on two cards
7+
3. complete hpu tests using one card
8+
4. complete hpu tests using eight cards.
9+
"""
10+
import itertools
11+
import subprocess
12+
import sys
13+
14+
HPU_TESTS_DICTIONARY = {
15+
"hpu1_test": "python -m coverage run --source pytorch_lightning -m pytest -sv tests/accelerators/test_hpu.py \
16+
--forked \
17+
--junitxml=hpu1_test-results.xml",
18+
"hpu2_test": "python -m coverage run --source pytorch_lightning -m pytest -sv tests/accelerators/test_hpu.py \
19+
-k test_all_stages \
20+
--hpus 2 \
21+
--verbose \
22+
--capture=no \
23+
--forked \
24+
--junitxml=hpu2_test-results.xml",
25+
"hpu4_test": "python -m coverage run --source pytorch_lightning -m pytest -sv tests/accelerators/test_hpu.py \
26+
-k test_inference_only \
27+
--hpus 4 \
28+
--capture=no \
29+
--verbose \
30+
--forked \
31+
--junitxml=hpu4_test-results.xml",
32+
"hpu8_test": "python -m coverage run --source pytorch_lightning -m pytest -sv tests/accelerators/test_hpu.py \
33+
--forked \
34+
--hpus 8 \
35+
--junitxml=hpu8_test-results.xml",
36+
"hpu1_precision_test": "python -m coverage run --source pytorch_lightning -m pytest -sv tests/plugins/precision/hpu/test_hpu.py \
37+
--hmp-bf16 'tests/plugins/precision/hpu/ops_bf16.txt' \
38+
--hmp-fp32 'tests/plugins/precision/hpu/ops_fp32.txt' \
39+
--forked \
40+
--junitxml=hpu1_precision_test-results.xml",
41+
}
42+
43+
HPU1_TEST = HPU_TESTS_DICTIONARY["hpu1_test"]
44+
HPU2_TEST = HPU_TESTS_DICTIONARY["hpu2_test"]
45+
HPU4_TEST = HPU_TESTS_DICTIONARY["hpu4_test"]
46+
HPU8_TEST = HPU_TESTS_DICTIONARY["hpu8_test"]
47+
HPU1_PRECISION_TEST = HPU_TESTS_DICTIONARY["hpu1_precision_test"]
48+
49+
PARALLEL_HPU_TESTS_EXECUTION = [[HPU4_TEST, HPU1_TEST], [HPU2_TEST, HPU1_TEST], [HPU8_TEST], [HPU1_PRECISION_TEST]]
50+
TIMEOUT = 60 # seconds
51+
TIMEOUT_EXIT_CODE = -9
52+
53+
54+
def run_hpu_tests_parallel(timeout=TIMEOUT):
55+
"""This function is called to run the HPU tests in parallel.
56+
57+
We run the tests in sub process to utilize all the eight cards available in the DL1 instance
58+
Considering the max time taken to run the HPU tests as 60 seconds, we kill the process if the time taken exceeds.
59+
60+
Args:
61+
timeout: The threshold time to run the HPU tests in parallel.
62+
An exception is logged if the threshold timeout gets expired.
63+
TIMEOUT_EXIT_CODE will be returned as -9 in case of timeout,
64+
0 in case of success and 4 in case of failure.
65+
66+
Return:
67+
The list of exit status of the HPU tests that were run in the subprocess.
68+
Here, the exit_status 0 means the test run is successful. exit_status 1 means the test run is failed.
69+
"""
70+
exit_status = []
71+
with open("stdout_log.txt", "w") as stdout_log, open("error_log.txt", "w") as error_log:
72+
for hpu_tests in PARALLEL_HPU_TESTS_EXECUTION:
73+
process_list = [
74+
subprocess.Popen(
75+
each_hpu_test, shell=True, stdout=stdout_log, stderr=error_log, universal_newlines=True
76+
)
77+
for each_hpu_test in hpu_tests
78+
]
79+
for process in process_list:
80+
try:
81+
exit_status.append(process.wait(timeout=TIMEOUT))
82+
except subprocess.TimeoutExpired as e:
83+
print(e)
84+
print("Killing the process....")
85+
process.kill()
86+
exit_status.append(TIMEOUT_EXIT_CODE)
87+
return exit_status
88+
89+
90+
def zip_cmd_exitcode(exit_status):
91+
"""This function is called to zip the tests that were executed with the exit status of the test.
92+
93+
Args:
94+
exit_status: The returned exit_status after executing run_hpu_tests_parallel().
95+
96+
Return:
97+
A list of hpu tests called and their exit status.
98+
"""
99+
status_list = []
100+
status_list = list(zip(list(itertools.chain(*PARALLEL_HPU_TESTS_EXECUTION)), exit_status))
101+
return status_list
102+
103+
104+
def print_logs(filename):
105+
"""This function is called to read the file and print the logs.
106+
107+
Args:
108+
filename: Provide the log filename that need to be print on the console.
109+
"""
110+
with open(filename) as f:
111+
print(f.read())
112+
113+
114+
def print_subprocess_logs_and_return_status(exit_status):
115+
"""This function is called to print the logs of subprocess stdout and stderror and return the status of test
116+
execution.
117+
118+
Args:
119+
exit_status: The returned exit_status after executing run_hpu_tests_parallel().
120+
121+
Return:
122+
Based on the exit status of the HPU tests, we return success or failure to the main method.
123+
"""
124+
if all(v == 0 for v in exit_status):
125+
print("All HPU tests passed")
126+
file_name = "stdout_log.txt"
127+
print_logs(file_name)
128+
return 0
129+
else:
130+
print("HPU tests are failing")
131+
print("Printing stdout_log.txt...")
132+
file_name = "stdout_log.txt"
133+
print_logs(file_name)
134+
print("Printing error_log.txt...")
135+
file_name = "error_log.txt"
136+
print_logs(file_name)
137+
return 1
138+
139+
140+
def main():
141+
exit_status = run_hpu_tests_parallel(timeout=TIMEOUT)
142+
status_list = zip_cmd_exitcode(exit_status)
143+
print("HPU Tests executed and their exit status:", status_list)
144+
return print_subprocess_logs_and_return_status(exit_status)
145+
146+
147+
if __name__ == "__main__":
148+
sys.exit(main())

CHANGELOG.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -167,6 +167,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
167167
- Added `AcceleratorRegistry` ([#12180](https://github.com/PyTorchLightning/pytorch-lightning/pull/12180))
168168

169169

170+
- Added support for Habana Accelerator (HPU) ([#11808](https://github.com/PyTorchLightning/pytorch-lightning/pull/11808))
171+
172+
170173
### Changed
171174

172175
- Drop PyTorch 1.7 support ([#12191](https://github.com/PyTorchLightning/pytorch-lightning/pull/12191))

docs/source/accelerators/hpu.rst

Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
.. _hpu:
2+
3+
Habana Gaudi AI Processor (HPU)
4+
===============================
5+
6+
Lightning supports `Habana Gaudi AI Processor (HPU) <https://habana.ai/>`__, for accelerating Deep Learning training workloads.
7+
8+
HPU Terminology
9+
---------------
10+
11+
Habana® Gaudi® AI training processors are built on a heterogeneous architecture with a cluster of fully programmable Tensor Processing Cores (TPC) along with its associated development tools and libraries, and a configurable Matrix Math engine.
12+
13+
The TPC core is a VLIW SIMD processor with an instruction set and hardware tailored to serve training workloads efficiently.
14+
The Gaudi memory architecture includes on-die SRAM and local memories in each TPC and,
15+
Gaudi is the first DL training processor that has integrated RDMA over Converged Ethernet (RoCE v2) engines on-chip.
16+
17+
On the software side, the PyTorch Habana bridge interfaces between the framework and SynapseAI software stack to enable the execution of deep learning models on the Habana Gaudi device.
18+
19+
Gaudi offers a substantial price/performance advantage -- so you get to do more deep learning training while spending less.
20+
21+
For more information, check out `Gaudi Architecture <https://docs.habana.ai/en/latest/Gaudi_Overview/Gaudi_Overview.html#gaudi-architecture>`__ and `Gaudi Developer Docs <https://developer.habana.ai>`__.
22+
23+
How to access HPUs
24+
------------------
25+
26+
To use HPUs, you must have access to a system with HPU devices.
27+
You can either use `Gaudi-based AWS EC2 DL1 instances <https://aws.amazon.com/ec2/instance-types/dl1/>`__ or `Supermicro X12 Gaudi server <https://www.supermicro.com/en/solutions/habana-gaudi>`__ to get access to HPUs.
28+
29+
Check out the `Getting Started Guide with AWS and Habana <https://docs.habana.ai/en/latest/AWS_EC2_Getting_Started/AWS_EC2_Getting_Started.html>`__.
30+
31+
Training with HPUs
32+
------------------
33+
34+
To enable PyTorch Lightning to utilize the HPU accelerator, simply provide ``accelerator="hpu"`` parameter to the Trainer class.
35+
36+
.. code-block:: python
37+
38+
trainer = Trainer(accelerator="hpu")
39+
40+
Passing ``devices=1`` and ``accelerator="hpu"`` to the Trainer class enables the Habana accelerator for single Gaudi training.
41+
42+
.. code-block:: python
43+
44+
trainer = Trainer(devices=1, accelerator="hpu")
45+
46+
The ``devices=8`` and ``accelerator="hpu"`` parameters to the Trainer class enables the Habana accelerator for distributed training with 8 Gaudis.
47+
It uses :class:`~pytorch_lightning.strategies.hpu_parallel.HPUParallelStrategy` internally which is based on DDP strategy with the addition of Habana's collective communication library (HCCL) to support scale-up within a node and scale-out across multiple nodes.
48+
49+
.. code-block:: python
50+
51+
trainer = Trainer(devices=8, accelerator="hpu")
52+
53+
.. note::
54+
If the ``devices`` flag is not defined, it will assume ``devices`` to be ``"auto"`` and select 8 Gaudi devices for :class:`~pytorch_lightning.accelerators.hpu.HPUAccelerator`.
55+
56+
57+
Mixed Precision Plugin
58+
----------------------
59+
60+
Lightning also allows mixed precision training with HPUs.
61+
By default, HPU training will use 32-bit precision. To enable mixed precision, set the ``precision`` flag.
62+
63+
.. code-block:: python
64+
65+
trainer = Trainer(devices=1, accelerator="hpu", precision=16)
66+
67+
68+
Enabling Mixed Precision Options
69+
--------------------------------
70+
71+
Internally, :class:`~pytorch_lightning.plugins.precision.hpu.HPUPrecisionPlugin` uses the Habana Mixed Precision (HMP) package to enable mixed precision training.
72+
73+
You can execute the ops in FP32 or BF16 precision. The HMP package modifies the Python operators to add the appropriate cast operations for the arguments before execution.
74+
The default settings enable users to enable mixed precision training with minimal code easily.
75+
76+
In addition to the default settings in HMP, users also have the option of overriding these defaults and providing their
77+
BF16 and FP32 operator lists by passing them as parameter to :class:`~pytorch_lightning.plugins.precision.hpu.HPUPrecisionPlugin`.
78+
79+
The below snippet shows an example model using MNIST with a single Habana Gaudi device and making use of HMP by overriding the default parameters.
80+
This enables advanced users to provide their own BF16 and FP32 operator list instead of using the HMP defaults.
81+
82+
.. code-block:: python
83+
84+
import pytorch_lightning as pl
85+
from pytorch_lightning.plugins import HPUPrecisionPlugin
86+
87+
# Initialize a trainer with HPU accelerator for HPU strategy for single device,
88+
# with mixed precision using overidden HMP settings
89+
trainer = pl.Trainer(
90+
accelerator="hpu",
91+
devices=1,
92+
# Optional Habana mixed precision params to be set
93+
# Checkout `pl_examples/hpu_examples/simple_mnist/ops_bf16_mnist.txt` for the format
94+
plugins=[
95+
HPUPrecisionPlugin(
96+
precision=16,
97+
opt_level="O1",
98+
verbose=False,
99+
bf16_file_path="ops_bf16_mnist.txt",
100+
fp32_file_path="ops_fp32_mnist.txt",
101+
)
102+
],
103+
)
104+
105+
# Init our model
106+
model = LitClassifier()
107+
# Init the data
108+
dm = MNISTDataModule(batch_size=batch_size)
109+
110+
# Train the model ⚡
111+
trainer.fit(model, datamodule=dm)
112+
113+
For more details, please refer to `PyTorch Mixed Precision Training on Gaudi <https://docs.habana.ai/en/latest/PyTorch_User_Guide/PyTorch_User_Guide.html#pytorch-mixed-precision-training-on-gaudi>`__.
114+
115+
----------------
116+
117+
.. _known-limitations_hpu:
118+
119+
Known limitations
120+
-----------------
121+
122+
* Multiple optimizers are not supported.
123+
* `Habana dataloader <https://docs.habana.ai/en/latest/PyTorch_User_Guide/PyTorch_User_Guide.html#habana-data-loader>`__ is not supported.
124+
* :class:`~pytorch_lightning.callbacks.device_stats_monitor.DeviceStatsMonitor` is not supported.

docs/source/api_references.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ Accelerator API
1616
Accelerator
1717
CPUAccelerator
1818
GPUAccelerator
19+
HPUAccelerator
1920
IPUAccelerator
2021
TPUAccelerator
2122

@@ -59,9 +60,11 @@ Strategy API
5960
DataParallelStrategy
6061
DeepSpeedStrategy
6162
HorovodStrategy
63+
HPUParallelStrategy
6264
IPUStrategy
6365
ParallelStrategy
6466
SingleDeviceStrategy
67+
SingleHPUStrategy
6568
SingleTPUStrategy
6669
Strategy
6770
TPUSpawnStrategy
@@ -198,6 +201,7 @@ Precision Plugins
198201
DeepSpeedPrecisionPlugin
199202
DoublePrecisionPlugin
200203
FullyShardedNativeMixedPrecisionPlugin
204+
HPUPrecisionPlugin
201205
IPUPrecisionPlugin
202206
MixedPrecisionPlugin
203207
NativeMixedPrecisionPlugin
@@ -234,6 +238,7 @@ Checkpoint IO Plugins
234238
:template: classtemplate.rst
235239

236240
CheckpointIO
241+
HPUCheckpointIO
237242
TorchCheckpointIO
238243
XLACheckpointIO
239244

docs/source/extensions/accelerator.rst

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ Currently there are accelerators for:
1515
- GPU
1616
- TPU
1717
- IPU
18+
- HPU
1819

1920
Each Accelerator gets two plugins upon initialization:
2021
One to handle differences from the training routine and one to handle different precisions.
@@ -58,5 +59,6 @@ Accelerator API
5859
Accelerator
5960
CPUAccelerator
6061
GPUAccelerator
61-
TPUAccelerator
62+
HPUAccelerator
6263
IPUAccelerator
64+
TPUAccelerator

docs/source/extensions/plugins.rst

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -61,17 +61,18 @@ Precision Plugins
6161
:nosignatures:
6262
:template: classtemplate.rst
6363

64-
PrecisionPlugin
65-
MixedPrecisionPlugin
66-
NativeMixedPrecisionPlugin
67-
ShardedNativeMixedPrecisionPlugin
6864
ApexMixedPrecisionPlugin
6965
DeepSpeedPrecisionPlugin
70-
TPUPrecisionPlugin
71-
TPUBf16PrecisionPlugin
7266
DoublePrecisionPlugin
7367
FullyShardedNativeMixedPrecisionPlugin
68+
HPUPrecisionPlugin
7469
IPUPrecisionPlugin
70+
MixedPrecisionPlugin
71+
NativeMixedPrecisionPlugin
72+
PrecisionPlugin
73+
ShardedNativeMixedPrecisionPlugin
74+
TPUBf16PrecisionPlugin
75+
TPUPrecisionPlugin
7576

7677

7778
Cluster Environments

docs/source/extensions/strategy.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -108,9 +108,11 @@ Built-In Training Strategies
108108
DataParallelStrategy
109109
DeepSpeedStrategy
110110
HorovodStrategy
111+
HPUParallelStrategy
111112
IPUStrategy
112113
ParallelStrategy
113114
SingleDeviceStrategy
115+
SingleHPUStrategy
114116
SingleTPUStrategy
115117
Strategy
116118
TPUSpawnStrategy

0 commit comments

Comments
 (0)