Skip to content

Add support for Habana accelerator (HPU) #11808

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 183 commits into from
Mar 25, 2022
Merged
Show file tree
Hide file tree
Changes from 109 commits
Commits
Show all changes
183 commits
Select commit Hold shift + click to select a range
f7175c4
Add hpu accelerator support
jerome-habana Feb 8, 2022
7fb871b
Update strategy for optimizer usage
jerome-habana Feb 8, 2022
a1a1ca9
Add checkpointing support
jerome-habana Feb 8, 2022
9a6da43
Fix distributed support with hpu
jerome-habana Feb 8, 2022
3e76db9
Enable usage of static_graph with hpu
jerome-habana Feb 8, 2022
b43d226
Add HPU tests
jerome-habana Feb 8, 2022
992093d
Add basic hpu_stats monitor
jerome-habana Feb 8, 2022
943be49
Code cleanup
jerome-habana Feb 8, 2022
3015972
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 8, 2022
257d644
Update tests
jerome-habana Feb 9, 2022
f1867cd
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 9, 2022
c61d68b
Add configurable params for tests
jerome-habana Feb 10, 2022
f74a898
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 10, 2022
963cd1e
Enable inference test
jerome-habana Feb 11, 2022
53a5416
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 11, 2022
2de04e8
Resolve issue with hmp params type and load hpu
jerome-habana Feb 15, 2022
0197b9c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 15, 2022
b412638
Move hmp_params to HPUPrecision plugin
jerome-habana Feb 17, 2022
e549434
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 17, 2022
1cc0a37
Update habana distributed with ddp subclass
jerome-habana Feb 18, 2022
aeda681
Add hpu backend, datatype checks
jerome-habana Feb 18, 2022
fe32865
Merge branch 'master' into hpu_accelerator
jerome-habana Feb 23, 2022
f9b0c5f
Merge branch 'master' into hpu_accelerator
jerome-habana Feb 23, 2022
123112d
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 23, 2022
ede68eb
Remove unused param for 'on_train_batch_end' in hpu test
jerome-habana Feb 23, 2022
262343a
Merge branch 'master' into hpu_accelerator
jerome-habana Mar 3, 2022
3a029c1
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 3, 2022
0a959f0
Addres review comments
jerome-habana Mar 3, 2022
1434299
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 3, 2022
400ea77
Address review comments
jerome-habana Mar 4, 2022
4146bab
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 4, 2022
f5cb696
remove deprecated logging
jerome-habana Mar 4, 2022
d3cd6b1
Merge branch 'master' into hpu_accelerator
jerome-habana Mar 7, 2022
448ed77
Fix imports for failing CI
kaushikb11 Mar 9, 2022
10b190f
fix str to_device section in converting.rst (#12243)
awaelchli Mar 7, 2022
c17c62b
Disable tuner with distributed strategies (#12179)
rohitgr7 Mar 7, 2022
28bc4f0
Add callout items to the Docs landing page (#12196)
kaushikb11 Mar 7, 2022
97e1d28
Integrate global step with progress tracking (#11805)
carmocca Mar 7, 2022
5aecf65
Deprecate `LightningDataModule.on_save/load_checkpoint` (#11893)
jjenniferdai Mar 8, 2022
0949599
add Azure HPU agent (#12258)
Borda Mar 8, 2022
4bd5034
Add `LightningCLI(auto_registry)` (#12108)
carmocca Mar 8, 2022
bd76456
Drop PyTorch 1.7 testing from the CI (#12191)
krshrimali Mar 8, 2022
80b8d01
Have the outputs match the loops format (#12182)
carmocca Mar 8, 2022
c168db5
Address review comments
jerome-habana Mar 9, 2022
831a672
Review comment :Make use of Boring model
jerome-habana Mar 9, 2022
328329e
Update stats example trainer params
jerome-habana Mar 9, 2022
c8e331e
Correct flake8 errors
jerome-habana Mar 9, 2022
9a71bdc
Remove docstring examples
jerome-habana Mar 9, 2022
8efed0b
Update hpu-tests.yml
raoakarsha Mar 3, 2022
90409a2
prune
Borda Mar 7, 2022
5bbc6dc
Update hpu-tests.yml
Borda Mar 8, 2022
85f535b
Apply suggestions from code review
Borda Mar 9, 2022
75227d9
hwinfo
Borda Mar 9, 2022
711bbf3
Override mypy warnings
jerome-habana Mar 10, 2022
bc174f6
Update test and requirements file
jerome-habana Mar 10, 2022
b28c0ce
Remove hpu stats monitor and deprecated API's
jerome-habana Mar 10, 2022
3c08bf5
Update non-hpu tests
jerome-habana Mar 10, 2022
f857721
Add hpu-tests.yml and run_hpu_tests.py to support HPU Testing
Borda Mar 10, 2022
a2b2cb1
Merge branch 'master' into hpu_accelerator
jerome-habana Mar 10, 2022
7cb34bc
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 10, 2022
f6baf69
Add exception for non-hpu tests
jerome-habana Mar 10, 2022
21fc9a4
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 10, 2022
3665ffc
Throw exception when accelerator is not present
jerome-habana Mar 10, 2022
e0b4611
Resolve mypy and error message
jerome-habana Mar 10, 2022
545ab6a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 10, 2022
96ed1cd
Disable hpu pl examples on CPU
jerome-habana Mar 10, 2022
c44b017
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 10, 2022
410875c
Address review comments
jerome-habana Mar 14, 2022
8efe56f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 14, 2022
073b170
Add documentation for habana gaudi accelerator (HPU)
jerome-habana Mar 15, 2022
7bdcaf6
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 15, 2022
da1037a
Update test code syntax
jerome-habana Mar 15, 2022
5e7af01
Mitigate duplicate label error
jerome-habana Mar 15, 2022
70d6993
Add hpu to toctree
jerome-habana Mar 16, 2022
5061d71
Update pytorch_lightning/plugins/precision/hpu_precision.py
kaushikb11 Mar 16, 2022
f6c36ce
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 16, 2022
798f137
Update _broadvast_object_list
kaushikb11 Mar 16, 2022
5e098cb
Update broadcast for HPUParallelStrategy
kaushikb11 Mar 16, 2022
093056c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 16, 2022
0563310
Update reference links
kaushikb11 Mar 17, 2022
65886ba
Update Strategies
kaushikb11 Mar 17, 2022
d837ef3
Address reviews
kaushikb11 Mar 17, 2022
37e0000
Address reviews
kaushikb11 Mar 17, 2022
07c60b4
Address reviews
jerome-habana Mar 18, 2022
394d9e2
Merge branch 'master' into hpu_accelerator
jerome-habana Mar 18, 2022
12dc3ca
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 18, 2022
3064544
Remove too many sections from sidebar
akihironitta Mar 19, 2022
7c7721d
Fix invalid formatting and links
akihironitta Mar 19, 2022
cc71c7a
Merge branch 'master' into hpu_accelerator
kaushikb11 Mar 21, 2022
e6eaa9f
Address reviews for HPUCHeckpointIO
kaushikb11 Mar 21, 2022
33beabd
Address reviews for HPU + AcceleratorConnector
kaushikb11 Mar 21, 2022
759804e
Fix tests
kaushikb11 Mar 21, 2022
bda7e36
Address reviews
kaushikb11 Mar 21, 2022
bdc19be
Remove setting hpu accelerator by just strategy
kaushikb11 Mar 21, 2022
2d34cc5
Remove unnecessary properties for HPU
kaushikb11 Mar 21, 2022
c32601a
Fix HPU tests
kaushikb11 Mar 21, 2022
f43750e
Move tests
kaushikb11 Mar 21, 2022
4e09286
Improve docs
kaushikb11 Mar 21, 2022
ab2f595
Improve tests
kaushikb11 Mar 21, 2022
549d784
Update Changelog
kaushikb11 Mar 21, 2022
ec929df
Fix test for the rigth device type
kaushikb11 Mar 21, 2022
c55a82f
Fix tests
kaushikb11 Mar 21, 2022
05dcc1c
Fix tests
kaushikb11 Mar 21, 2022
150e667
Merge branch 'master' into hpu_accelerator
kaushikb11 Mar 21, 2022
f5a333b
Address reviews
kaushikb11 Mar 21, 2022
57b9c24
Update plugins
kaushikb11 Mar 21, 2022
3dd763c
Update docs/source/accelerators/hpu.rst
kaushikb11 Mar 22, 2022
773a7a0
Update HPU mnist example
kaushikb11 Mar 22, 2022
9378c87
Update strategy
kaushikb11 Mar 22, 2022
9aefcd2
Address reviews
jerome-habana Mar 22, 2022
1f0b187
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 22, 2022
1d30ef9
Add precision tests to azure pipeline
jerome-habana Mar 22, 2022
fd9488f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 22, 2022
a4f79fb
Add comments
kaushikb11 Mar 22, 2022
a6a336d
Fix argparse
kaushikb11 Mar 22, 2022
dca30ee
Remove unnecessary use of PL_TORCH_DISTRIBUTED_BACKEND env variable
kaushikb11 Mar 22, 2022
bb8984f
Update pytorch_lightning/strategies/hpu_parallel.py
kaushikb11 Mar 22, 2022
4ab35db
Update pytorch_lightning/utilities/distributed.py
kaushikb11 Mar 22, 2022
e65a3fb
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 22, 2022
a517942
Address review
jerome-habana Mar 23, 2022
d89815d
Address reviews
kaushikb11 Mar 23, 2022
0238b45
Update document
jerome-habana Mar 23, 2022
4f44ea9
Improve Habana doc
kaushikb11 Mar 23, 2022
f332e1c
Improve Habana doc
kaushikb11 Mar 23, 2022
81202c6
Improve Habana doc
kaushikb11 Mar 23, 2022
503df4e
Update pytorch_lightning/trainer/connectors/accelerator_connector.py
kaushikb11 Mar 23, 2022
e6af417
Update links
kaushikb11 Mar 23, 2022
2bd4a66
Merge branch 'hpu_accelerator' of https://github.com/jerome-habana/py…
kaushikb11 Mar 23, 2022
67e710e
Update precision sections
kaushikb11 Mar 23, 2022
1df801b
Update doc
kaushikb11 Mar 23, 2022
9152114
Add defaults to hmp_params for Precision Plugin
kaushikb11 Mar 23, 2022
9846b6a
Update .azure-pipelines/run_hpu_tests.py
kaushikb11 Mar 24, 2022
e86becf
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 24, 2022
d165c44
Apply suggestions from code review
kaushikb11 Mar 24, 2022
c76b95f
Update docs/source/accelerators/hpu.rst
kaushikb11 Mar 24, 2022
bafcb8d
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 24, 2022
2d6c6dd
Apply suggestions from code review
kaushikb11 Mar 24, 2022
75728b6
Apply suggestions from code review
kaushikb11 Mar 24, 2022
68c5281
Update docs/source/accelerators/hpu.rst
kaushikb11 Mar 24, 2022
600e1bd
Address reviews
kaushikb11 Mar 24, 2022
b03d079
Apply suggestions from code review
kaushikb11 Mar 24, 2022
6e4474e
Update API references
kaushikb11 Mar 24, 2022
efd9f65
Address reviews regarding precision
kaushikb11 Mar 24, 2022
22827f0
Address reviews regarding docs and precision
kaushikb11 Mar 24, 2022
e82544c
Update docs/source/accelerators/hpu.rst
kaushikb11 Mar 24, 2022
4500a7e
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 24, 2022
98ba21f
Apply suggestions from code review
kaushikb11 Mar 24, 2022
3c10359
Address reviews & update tests
kaushikb11 Mar 24, 2022
6c0dd88
Merge branch 'hpu_accelerator' of https://github.com/jerome-habana/py…
kaushikb11 Mar 24, 2022
e137f19
Update testing pipeline & conftest
kaushikb11 Mar 24, 2022
a62cfa1
Fix ci
kaushikb11 Mar 24, 2022
1078a69
Add device parsing logic for HPUs
kaushikb11 Mar 24, 2022
a9dfcf3
Fix device parsing
kaushikb11 Mar 24, 2022
4665101
Use the CLI in the example
Mar 24, 2022
2ee4bbf
Docs
Mar 24, 2022
e9ae312
Merge branch 'master' into hpu_accelerator
kaushikb11 Mar 24, 2022
dc3eca7
Update docs/source/accelerators/hpu.rst
kaushikb11 Mar 24, 2022
6952125
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 24, 2022
91cced3
Update hmp_params
kaushikb11 Mar 24, 2022
0671d2c
Support passing amp_level to HPUPrecision
kaushikb11 Mar 24, 2022
522106e
Update HPUAccelerator
kaushikb11 Mar 24, 2022
c8b89ea
Update tests
kaushikb11 Mar 25, 2022
7d028b1
Fix precision tests
kaushikb11 Mar 25, 2022
3c86aff
Update device parsing logic
kaushikb11 Mar 25, 2022
3c8e321
Fix tests & address reviews
kaushikb11 Mar 25, 2022
dcda0ac
Update run_hpu_tests
kaushikb11 Mar 25, 2022
e254cd0
Update CLI test
jerome-habana Mar 25, 2022
c452bd2
Fix typing
kaushikb11 Mar 25, 2022
4c51b33
Merge branch 'hpu_accelerator' of https://github.com/jerome-habana/py…
kaushikb11 Mar 25, 2022
b66c867
Merge branch 'master' into hpu_accelerator
jerome-habana Mar 25, 2022
dca6b0f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 25, 2022
98e901d
Enable example test in pipeline
jerome-habana Mar 25, 2022
2860a4e
export path of modules
jerome-habana Mar 25, 2022
a297593
Fix test
kaushikb11 Mar 25, 2022
9c1fff7
Merge branch 'hpu_accelerator' of https://github.com/jerome-habana/py…
kaushikb11 Mar 25, 2022
65f1fb9
Update torch distributed
kaushikb11 Mar 25, 2022
2380887
Update strategy
kaushikb11 Mar 25, 2022
59ef6fd
Update example
kaushikb11 Mar 25, 2022
c02c1ed
Apply suggestions from code review
kaushikb11 Mar 25, 2022
beda30c
Address reviews
kaushikb11 Mar 25, 2022
eb99e52
Merge branch 'hpu_accelerator' of https://github.com/jerome-habana/py…
kaushikb11 Mar 25, 2022
c465a06
Update backend env variable for strategy
kaushikb11 Mar 25, 2022
60f2da4
Update backend env variable for strategy
kaushikb11 Mar 25, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions .azure-pipelines/hpu-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,3 +31,18 @@ jobs:
apt-get install -y hwinfo
hwinfo --short
displayName: 'Instance HW info'

- bash: |
pip install . --requirement requirements/test.txt
displayName: 'Install dependencies'

- bash: |
python ".azure-pipelines/run_hpu_tests.py"
displayName: 'HPU Tests in parallel'

- task: PublishTestResults@2
inputs:
testResultsFiles: 'hpu*_test-results.xml'
testRunTitle: '$(Agent.OS) - $(Build.DefinitionName) - Python $(python.version)'
condition: succeededOrFailed()
displayName: 'Publish test results'
142 changes: 142 additions & 0 deletions .azure-pipelines/run_hpu_tests.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
"""This file is called from the hpu-tests.yml pipeline.

The following script run the hpu tests in parallel.
Tests run are:
1. test_inference_only is run on four cards
2. test_all_stages on two cards
3. complete hpu tests using one card
4. complete hpu tests using eight cards.
"""
import itertools
import subprocess
import sys

HPU_TESTS_DICTIONARY = {
"hpu1_test": "python -m coverage run --source pytorch_lightning -m pytest -sv tests/accelerators/test_hpu.py \
--hmp-bf16 'tests/accelerators/ops_bf16_mnist.txt' \
--hmp-fp32 'tests/accelerators/ops_fp32_mnist.txt' \
--forked \
--junitxml=hpu1_test-results.xml",
"hpu2_test": "python -m coverage run --source pytorch_lightning -m pytest -sv tests/accelerators/test_hpu.py \
-k test_all_stages \
--hpus 2 \
--verbose \
--capture=no \
--forked \
--junitxml=hpu2_test-results.xml",
"hpu4_test": "python -m coverage run --source pytorch_lightning -m pytest -sv tests/accelerators/test_hpu.py \
-k test_inference_only \
--hpus 4 \
--capture=no \
--verbose \
--forked \
--junitxml=hpu4_test-results.xml",
"hpu8_test": "python -m coverage run --source pytorch_lightning -m pytest -sv tests/accelerators/test_hpu.py \
--hmp-bf16 'tests/accelerators/ops_bf16_mnist.txt' \
--hmp-fp32 'tests/accelerators/ops_fp32_mnist.txt' \
--forked \
--hpus 8 \
--junitxml=hpu8_test-results.xml",
}

HPU1_TEST = HPU_TESTS_DICTIONARY["hpu1_test"]
HPU2_TEST = HPU_TESTS_DICTIONARY["hpu2_test"]
HPU4_TEST = HPU_TESTS_DICTIONARY["hpu4_test"]
HPU8_TEST = HPU_TESTS_DICTIONARY["hpu8_test"]

PARALLEL_HPU_TESTS_EXECUTION = [[HPU4_TEST, HPU1_TEST], [HPU2_TEST, HPU1_TEST], [HPU8_TEST]]
TIMEOUT = 60
TIMEOUT_EXIT_CODE = -9


def run_hpu_tests_parallel(timeout=TIMEOUT):
"""This function is called to run the HPU tests in parallel.

We run the tests in sub process to utilize all the eight cards available in the DL1 instance
Considering the max time taken to run the HPU tests as 60 seconds, we kill the process if the time taken exceeds.
Return of this function will be the list of exit status of the HPU tests that were run in the subprocess.
Here, the exit_status 0 means the test run is successful. exit_status 1 means the test run is failed.
Args:
timeout: The threshold time to run the HPU tests in parallel.
Exception is logged if the threshold timeout gets expired.
TIMEOUT_EXIT_CODE will be returned as -9 in case of timeout, 0 in case of success and 4 in case of a failure.
"""
exit_status = []
with open("stdout_log.txt", "w") as stdout_log, open("error_log.txt", "w") as error_log:
for hpu_tests in PARALLEL_HPU_TESTS_EXECUTION:
process_list = [
subprocess.Popen(
each_hpu_test, shell=True, stdout=stdout_log, stderr=error_log, universal_newlines=True
)
for each_hpu_test in hpu_tests
]
for process in process_list:
try:
exit_status.append(process.wait(timeout=TIMEOUT))
except subprocess.TimeoutExpired as e:
print(e)
print("Killing the process....")
process.kill()
exit_status.append(TIMEOUT_EXIT_CODE)
return exit_status


def zip_cmd_exitcode(exit_status):
"""This function is called to zip the tests that were executed with the exit status of the test.

Return of this function will be list of hpu tests called and their exit status.
Args:
exit_status: The returned exit_status after executing run_hpu_tests_parallel().
"""
status_list = []
hpu_tests_called = []
for hpu_tests in PARALLEL_HPU_TESTS_EXECUTION:
hpu_tests_called.append(hpu_tests)
status_list = list(zip(list(itertools.chain(*hpu_tests_called)), exit_status))
return status_list


def print_logs(filename):
"""This function is called to read the file and print the logs.

Args:
filename: Provide the log filename that need to be print on the console.
"""
with open(filename) as f:
print(f.read())


def print_subprocess_logs_and_return_status(exit_status):
"""This function is called to print the logs of subprocess stdout and stderror and return the status of test
execution.

Args:
exit_status: The returned exit_status after executing run_hpu_tests_parallel().
Return of this function will be the return to main().
Based on the exit status of the HPU tests, we return success or failure to the main method.
"""
if all(v == 0 for v in exit_status):
print("All HPU tests passed")
file_name = "stdout_log.txt"
print_logs(file_name)
return 0
else:
print("HPU tests are failing")
print("Printing stdout_log.txt...")
file_name = "stdout_log.txt"
print_logs(file_name)
print("Printing error_log.txt...")
file_name = "error_log.txt"
print_logs(file_name)
return 1


def main():
exit_status = run_hpu_tests_parallel(timeout=TIMEOUT)
status_list = zip_cmd_exitcode(exit_status)
print("HPU Tests executed and their exit status:", status_list)
return print_subprocess_logs_and_return_status(exit_status)


if __name__ == "__main__":
sys.exit(main())
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Added `Callback.state_dict()` and `Callback.load_state_dict()` methods ([#12232](https://github.com/PyTorchLightning/pytorch-lightning/pull/12232))


- Added support for Habana Accelerator (HPU) ([#11808](https://github.com/PyTorchLightning/pytorch-lightning/pull/11808))


### Changed

- Drop PyTorch 1.7 support ([#12191](https://github.com/PyTorchLightning/pytorch-lightning/pull/12191))
Expand Down
212 changes: 212 additions & 0 deletions docs/source/accelerators/hpu.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,212 @@
.. _hpu:

Habana Gaudi AI Processor (HPU)
===============================

Habana® Gaudi® AI training processors have been architected from the ground up and optimized for deep learning training efficiency.
Gaudi offers substantial price/performance advantage -- so you get to do more deep learning training while spending less.

You can use either `the Gaudi-based AWS EC2 DL1 instances <https://aws.amazon.com/ec2/instance-types/dl1/>`_ or `the Supermicro X12 Gaudi server <https://www.supermicro.com/en/solutions/habana-gaudi>`_.

Habana’s SynapseAI® software suite is optimized for building and training deep learning models using TensorFlow and PyTorch frameworks. Gaudi is referred to as the Habana Processing Unit (HPU).
With SynapseAI, we aim to make training workloads on Gaudi easy, whether you're developing from scratch or migrating existing workloads.

For more information, check out `<https://developer.habana.ai>`_ and `<https://habana.ai/>`_.

----------------

PyTorch Lightning With Gaudi HPU
--------------------------------

Lightning supports training on a single HPU device or 8 HPU devices with the plugins described in the following sections


----------------

.. _hpu_accelerator:

HPU accelerator
---------------

To enable PyTorch Lightning to utilize the HPU accelerator, simply provide ``Trainer(accelerator="hpu")`` parameters in the trainer class.


----------------

.. _single_device_strategy:

Training on Single HPU
----------------------

The ``devices=1`` and ``accelerator="hpu"`` with ``strategy=SingleHPUStrategy(device=torch.device("hpu"))`` parameter in the trainer class enables the Habana backend for single Gaudi training.


----------------

.. _parallel_device_strategy:

Distributed Training
---------------------


The ``devices=8`` and ``accelerator="hpu"`` with ``strategy=HPUParallelStrategy(parallel_devices=[torch.device("hpu")]*devices)`` parameter in the trainer class enables the Habana backend for distributed training with 8 Gaudis.

The Habana parallel device strategy is based on DDP strategy with the addition of Habana's collective communication library (HCCL) to support scale-up within a node and scale-out across multiple nodes.


----------------

.. _mixed_precision_plugin:

Mixed Precision Plugin
----------------------

The ``precision=16`` and a ``hmp_params`` parameter in the trainer class enables the Habana plugin for mixed precision using the Habana Mixed Precision (HMP) package.

You can execute the ops in FP32 or BF16 precision. The HMP package modifies the python operators to add the appropriate cast operations for the arguments before execution.
The default settings enable users to easily enable mixed precision training with minimal code.

In addition to the default settings in HMP, users also have the option of overriding these defaults and providing their own BF16 and FP32 operator lists.

For more details, please refer to `PyTorch Mixed Precision Training on Gaudi <https://docs.habana.ai/en/latest/PyTorch_User_Guide/PyTorch_User_Guide.html#pytorch-mixed-precision-training-on-gaudi>`_.


----------------

.. _pytorch_lightning_examples:

Getting Started with Lightning on Gaudi
---------------------------------------

This section describes how to train models using PyTorch Lightning with Habana Gaudi.

More Lightning HPU examples can be found in pl_examples (`<https://github.com/PyTorchLightning/pytorch-lightning/tree/master/pl_examples/hpu_examples>`_)

----------------

Enabling Lightning with Single Gaudi HPU
----------------------------------------

The below snippet shows an example model using MNIST with single Habana Gaudi device:

.. code-block:: python

import habana_frameworks.torch.core as htcore


class LitClassifier(pl.LightningModule):
def __init__(self):
super(LitClassifier, self).__init__()

...


# Init our model
model = LitClassifier()

# Init DataLoader from MNIST Dataset
dm = MNISTDataModule(batch_size=batch_size)

...

num_hpus = 1

# enable HPU strategy for single device, with mixed precision using default HMP settings
hpu_strategy = SingleHPUStrategy(device=torch.device("hpu"), precision_plugin=HPUPrecisionPlugin(precision=16))

# Initialize a trainer with 1 HPU accelerator
trainer = pl.Trainer(accelerator="hpu", devices=num_hpus, strategy=hpu_strategy)

# Train the model ⚡
trainer.fit(model, datamodule=dm)


----------------

Enabling Lightning with 8 Gaudi HPUs (distributed)
--------------------------------------------------

The below snippet shows an example model using MNIST with 8 Habana Gaudi devices:

.. code-block:: python

import habana_frameworks.torch.core as htcore


class LitClassifier(pl.LightningModule):
def __init__(self):
super(LitClassifier, self).__init__()

...


# Init our model
model = LitClassifier()

# Init DataLoader from MNIST Dataset
dm = MNISTDataModule(batch_size=batch_size)

...

# Initialize a trainer with HPU accelerator with 8 devices
trainer = pl.Trainer(accelerator="hpu", devices=8, plugins=[HPUPrecisionPlugin(precision=16)])

# Train the model ⚡
trainer.fit(model, datamodule=dm)


----------------

Enabling Mixed Precision Options
--------------------------------

The below snippet shows an example model using MNIST with single Habana Gaudi and making use of HMP by overriding the default parameters.
This enables advanced users to provide their own BF16 and FP32 operator list instead of using the HMP defaults.

.. code-block:: python

import habana_frameworks.torch.core as htcore


class LitClassifier(pl.LightningModule):
def __init__(self):
super(LitClassifier, self).__init__()

...


# Init our model
model = LitClassifier()

# Init DataLoader from MNIST Dataset
dm = MNISTDataModule(batch_size=batch_size)

...

num_hpus = 1

# Optional Habana mixed precision params to be set
hmp_keys = ["level", "verbose", "bf16_ops", "fp32_ops"]
hmp_params = dict.fromkeys(hmp_keys)
hmp_params["level"] = "O1"
hmp_params["verbose"] = False
hmp_params["bf16_ops"] = "ops_bf16_mnist.txt"
hmp_params["fp32_ops"] = "ops_fp32_mnist.txt"

# Initialize a trainer with HPU accelerator for HPU strategy for single device,
# with mixed precision using overidden HMP settings
trainer = pl.Trainer(accelerator="hpu", devices=1, plugins=[HPUPrecisionPlugin(precision=16, hmp_params=hmp_params)])

# Train the model ⚡
trainer.fit(model, datamodule=dm)


----------------

.. _known-limitations_hpu:

Known limitations
-----------------

* Habana dataloader is not supported.
* Device stats monitoring is not supported.
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,7 @@ Welcome to PyTorch Lightning
accelerators/gpu
accelerators/tpu
accelerators/ipu
accelerators/hpu

.. toctree::
:maxdepth: 1
Expand Down
Loading