-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Add support for Habana accelerator (HPU) #11808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
kaushikb11
merged 183 commits into
Lightning-AI:master
from
jerome-habana:hpu_accelerator
Mar 25, 2022
Merged
Changes from 109 commits
Commits
Show all changes
183 commits
Select commit
Hold shift + click to select a range
f7175c4
Add hpu accelerator support
jerome-habana 7fb871b
Update strategy for optimizer usage
jerome-habana a1a1ca9
Add checkpointing support
jerome-habana 9a6da43
Fix distributed support with hpu
jerome-habana 3e76db9
Enable usage of static_graph with hpu
jerome-habana b43d226
Add HPU tests
jerome-habana 992093d
Add basic hpu_stats monitor
jerome-habana 943be49
Code cleanup
jerome-habana 3015972
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 257d644
Update tests
jerome-habana f1867cd
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] c61d68b
Add configurable params for tests
jerome-habana f74a898
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 963cd1e
Enable inference test
jerome-habana 53a5416
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 2de04e8
Resolve issue with hmp params type and load hpu
jerome-habana 0197b9c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] b412638
Move hmp_params to HPUPrecision plugin
jerome-habana e549434
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 1cc0a37
Update habana distributed with ddp subclass
jerome-habana aeda681
Add hpu backend, datatype checks
jerome-habana fe32865
Merge branch 'master' into hpu_accelerator
jerome-habana f9b0c5f
Merge branch 'master' into hpu_accelerator
jerome-habana 123112d
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] ede68eb
Remove unused param for 'on_train_batch_end' in hpu test
jerome-habana 262343a
Merge branch 'master' into hpu_accelerator
jerome-habana 3a029c1
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 0a959f0
Addres review comments
jerome-habana 1434299
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 400ea77
Address review comments
jerome-habana 4146bab
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] f5cb696
remove deprecated logging
jerome-habana d3cd6b1
Merge branch 'master' into hpu_accelerator
jerome-habana 448ed77
Fix imports for failing CI
kaushikb11 10b190f
fix str to_device section in converting.rst (#12243)
awaelchli c17c62b
Disable tuner with distributed strategies (#12179)
rohitgr7 28bc4f0
Add callout items to the Docs landing page (#12196)
kaushikb11 97e1d28
Integrate global step with progress tracking (#11805)
carmocca 5aecf65
Deprecate `LightningDataModule.on_save/load_checkpoint` (#11893)
jjenniferdai 0949599
add Azure HPU agent (#12258)
Borda 4bd5034
Add `LightningCLI(auto_registry)` (#12108)
carmocca bd76456
Drop PyTorch 1.7 testing from the CI (#12191)
krshrimali 80b8d01
Have the outputs match the loops format (#12182)
carmocca c168db5
Address review comments
jerome-habana 831a672
Review comment :Make use of Boring model
jerome-habana 328329e
Update stats example trainer params
jerome-habana c8e331e
Correct flake8 errors
jerome-habana 9a71bdc
Remove docstring examples
jerome-habana 8efed0b
Update hpu-tests.yml
raoakarsha 90409a2
prune
Borda 5bbc6dc
Update hpu-tests.yml
Borda 85f535b
Apply suggestions from code review
Borda 75227d9
hwinfo
Borda 711bbf3
Override mypy warnings
jerome-habana bc174f6
Update test and requirements file
jerome-habana b28c0ce
Remove hpu stats monitor and deprecated API's
jerome-habana 3c08bf5
Update non-hpu tests
jerome-habana f857721
Add hpu-tests.yml and run_hpu_tests.py to support HPU Testing
Borda a2b2cb1
Merge branch 'master' into hpu_accelerator
jerome-habana 7cb34bc
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] f6baf69
Add exception for non-hpu tests
jerome-habana 21fc9a4
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 3665ffc
Throw exception when accelerator is not present
jerome-habana e0b4611
Resolve mypy and error message
jerome-habana 545ab6a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 96ed1cd
Disable hpu pl examples on CPU
jerome-habana c44b017
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 410875c
Address review comments
jerome-habana 8efe56f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 073b170
Add documentation for habana gaudi accelerator (HPU)
jerome-habana 7bdcaf6
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] da1037a
Update test code syntax
jerome-habana 5e7af01
Mitigate duplicate label error
jerome-habana 70d6993
Add hpu to toctree
jerome-habana 5061d71
Update pytorch_lightning/plugins/precision/hpu_precision.py
kaushikb11 f6c36ce
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 798f137
Update _broadvast_object_list
kaushikb11 5e098cb
Update broadcast for HPUParallelStrategy
kaushikb11 093056c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 0563310
Update reference links
kaushikb11 65886ba
Update Strategies
kaushikb11 d837ef3
Address reviews
kaushikb11 37e0000
Address reviews
kaushikb11 07c60b4
Address reviews
jerome-habana 394d9e2
Merge branch 'master' into hpu_accelerator
jerome-habana 12dc3ca
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 3064544
Remove too many sections from sidebar
akihironitta 7c7721d
Fix invalid formatting and links
akihironitta cc71c7a
Merge branch 'master' into hpu_accelerator
kaushikb11 e6eaa9f
Address reviews for HPUCHeckpointIO
kaushikb11 33beabd
Address reviews for HPU + AcceleratorConnector
kaushikb11 759804e
Fix tests
kaushikb11 bda7e36
Address reviews
kaushikb11 bdc19be
Remove setting hpu accelerator by just strategy
kaushikb11 2d34cc5
Remove unnecessary properties for HPU
kaushikb11 c32601a
Fix HPU tests
kaushikb11 f43750e
Move tests
kaushikb11 4e09286
Improve docs
kaushikb11 ab2f595
Improve tests
kaushikb11 549d784
Update Changelog
kaushikb11 ec929df
Fix test for the rigth device type
kaushikb11 c55a82f
Fix tests
kaushikb11 05dcc1c
Fix tests
kaushikb11 150e667
Merge branch 'master' into hpu_accelerator
kaushikb11 f5a333b
Address reviews
kaushikb11 57b9c24
Update plugins
kaushikb11 3dd763c
Update docs/source/accelerators/hpu.rst
kaushikb11 773a7a0
Update HPU mnist example
kaushikb11 9378c87
Update strategy
kaushikb11 9aefcd2
Address reviews
jerome-habana 1f0b187
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 1d30ef9
Add precision tests to azure pipeline
jerome-habana fd9488f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] a4f79fb
Add comments
kaushikb11 a6a336d
Fix argparse
kaushikb11 dca30ee
Remove unnecessary use of PL_TORCH_DISTRIBUTED_BACKEND env variable
kaushikb11 bb8984f
Update pytorch_lightning/strategies/hpu_parallel.py
kaushikb11 4ab35db
Update pytorch_lightning/utilities/distributed.py
kaushikb11 e65a3fb
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] a517942
Address review
jerome-habana d89815d
Address reviews
kaushikb11 0238b45
Update document
jerome-habana 4f44ea9
Improve Habana doc
kaushikb11 f332e1c
Improve Habana doc
kaushikb11 81202c6
Improve Habana doc
kaushikb11 503df4e
Update pytorch_lightning/trainer/connectors/accelerator_connector.py
kaushikb11 e6af417
Update links
kaushikb11 2bd4a66
Merge branch 'hpu_accelerator' of https://github.com/jerome-habana/py…
kaushikb11 67e710e
Update precision sections
kaushikb11 1df801b
Update doc
kaushikb11 9152114
Add defaults to hmp_params for Precision Plugin
kaushikb11 9846b6a
Update .azure-pipelines/run_hpu_tests.py
kaushikb11 e86becf
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] d165c44
Apply suggestions from code review
kaushikb11 c76b95f
Update docs/source/accelerators/hpu.rst
kaushikb11 bafcb8d
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 2d6c6dd
Apply suggestions from code review
kaushikb11 75728b6
Apply suggestions from code review
kaushikb11 68c5281
Update docs/source/accelerators/hpu.rst
kaushikb11 600e1bd
Address reviews
kaushikb11 b03d079
Apply suggestions from code review
kaushikb11 6e4474e
Update API references
kaushikb11 efd9f65
Address reviews regarding precision
kaushikb11 22827f0
Address reviews regarding docs and precision
kaushikb11 e82544c
Update docs/source/accelerators/hpu.rst
kaushikb11 4500a7e
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 98ba21f
Apply suggestions from code review
kaushikb11 3c10359
Address reviews & update tests
kaushikb11 6c0dd88
Merge branch 'hpu_accelerator' of https://github.com/jerome-habana/py…
kaushikb11 e137f19
Update testing pipeline & conftest
kaushikb11 a62cfa1
Fix ci
kaushikb11 1078a69
Add device parsing logic for HPUs
kaushikb11 a9dfcf3
Fix device parsing
kaushikb11 4665101
Use the CLI in the example
2ee4bbf
Docs
e9ae312
Merge branch 'master' into hpu_accelerator
kaushikb11 dc3eca7
Update docs/source/accelerators/hpu.rst
kaushikb11 6952125
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 91cced3
Update hmp_params
kaushikb11 0671d2c
Support passing amp_level to HPUPrecision
kaushikb11 522106e
Update HPUAccelerator
kaushikb11 c8b89ea
Update tests
kaushikb11 7d028b1
Fix precision tests
kaushikb11 3c86aff
Update device parsing logic
kaushikb11 3c8e321
Fix tests & address reviews
kaushikb11 dcda0ac
Update run_hpu_tests
kaushikb11 e254cd0
Update CLI test
jerome-habana c452bd2
Fix typing
kaushikb11 4c51b33
Merge branch 'hpu_accelerator' of https://github.com/jerome-habana/py…
kaushikb11 b66c867
Merge branch 'master' into hpu_accelerator
jerome-habana dca6b0f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 98e901d
Enable example test in pipeline
jerome-habana 2860a4e
export path of modules
jerome-habana a297593
Fix test
kaushikb11 9c1fff7
Merge branch 'hpu_accelerator' of https://github.com/jerome-habana/py…
kaushikb11 65f1fb9
Update torch distributed
kaushikb11 2380887
Update strategy
kaushikb11 59ef6fd
Update example
kaushikb11 c02c1ed
Apply suggestions from code review
kaushikb11 beda30c
Address reviews
kaushikb11 eb99e52
Merge branch 'hpu_accelerator' of https://github.com/jerome-habana/py…
kaushikb11 c465a06
Update backend env variable for strategy
kaushikb11 60f2da4
Update backend env variable for strategy
kaushikb11 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,142 @@ | ||
"""This file is called from the hpu-tests.yml pipeline. | ||
kaushikb11 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
The following script run the hpu tests in parallel. | ||
Tests run are: | ||
1. test_inference_only is run on four cards | ||
2. test_all_stages on two cards | ||
3. complete hpu tests using one card | ||
4. complete hpu tests using eight cards. | ||
""" | ||
import itertools | ||
import subprocess | ||
import sys | ||
|
||
HPU_TESTS_DICTIONARY = { | ||
kaushikb11 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
"hpu1_test": "python -m coverage run --source pytorch_lightning -m pytest -sv tests/accelerators/test_hpu.py \ | ||
--hmp-bf16 'tests/accelerators/ops_bf16_mnist.txt' \ | ||
--hmp-fp32 'tests/accelerators/ops_fp32_mnist.txt' \ | ||
--forked \ | ||
--junitxml=hpu1_test-results.xml", | ||
"hpu2_test": "python -m coverage run --source pytorch_lightning -m pytest -sv tests/accelerators/test_hpu.py \ | ||
-k test_all_stages \ | ||
--hpus 2 \ | ||
--verbose \ | ||
--capture=no \ | ||
--forked \ | ||
--junitxml=hpu2_test-results.xml", | ||
"hpu4_test": "python -m coverage run --source pytorch_lightning -m pytest -sv tests/accelerators/test_hpu.py \ | ||
-k test_inference_only \ | ||
--hpus 4 \ | ||
--capture=no \ | ||
--verbose \ | ||
--forked \ | ||
--junitxml=hpu4_test-results.xml", | ||
"hpu8_test": "python -m coverage run --source pytorch_lightning -m pytest -sv tests/accelerators/test_hpu.py \ | ||
--hmp-bf16 'tests/accelerators/ops_bf16_mnist.txt' \ | ||
--hmp-fp32 'tests/accelerators/ops_fp32_mnist.txt' \ | ||
--forked \ | ||
--hpus 8 \ | ||
--junitxml=hpu8_test-results.xml", | ||
} | ||
|
||
HPU1_TEST = HPU_TESTS_DICTIONARY["hpu1_test"] | ||
HPU2_TEST = HPU_TESTS_DICTIONARY["hpu2_test"] | ||
HPU4_TEST = HPU_TESTS_DICTIONARY["hpu4_test"] | ||
HPU8_TEST = HPU_TESTS_DICTIONARY["hpu8_test"] | ||
|
||
PARALLEL_HPU_TESTS_EXECUTION = [[HPU4_TEST, HPU1_TEST], [HPU2_TEST, HPU1_TEST], [HPU8_TEST]] | ||
TIMEOUT = 60 | ||
kaushikb11 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
TIMEOUT_EXIT_CODE = -9 | ||
|
||
|
||
def run_hpu_tests_parallel(timeout=TIMEOUT): | ||
"""This function is called to run the HPU tests in parallel. | ||
|
||
We run the tests in sub process to utilize all the eight cards available in the DL1 instance | ||
Considering the max time taken to run the HPU tests as 60 seconds, we kill the process if the time taken exceeds. | ||
Return of this function will be the list of exit status of the HPU tests that were run in the subprocess. | ||
Here, the exit_status 0 means the test run is successful. exit_status 1 means the test run is failed. | ||
Args: | ||
timeout: The threshold time to run the HPU tests in parallel. | ||
Exception is logged if the threshold timeout gets expired. | ||
TIMEOUT_EXIT_CODE will be returned as -9 in case of timeout, 0 in case of success and 4 in case of a failure. | ||
kaushikb11 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
""" | ||
exit_status = [] | ||
with open("stdout_log.txt", "w") as stdout_log, open("error_log.txt", "w") as error_log: | ||
for hpu_tests in PARALLEL_HPU_TESTS_EXECUTION: | ||
process_list = [ | ||
subprocess.Popen( | ||
each_hpu_test, shell=True, stdout=stdout_log, stderr=error_log, universal_newlines=True | ||
) | ||
for each_hpu_test in hpu_tests | ||
] | ||
for process in process_list: | ||
try: | ||
exit_status.append(process.wait(timeout=TIMEOUT)) | ||
except subprocess.TimeoutExpired as e: | ||
print(e) | ||
print("Killing the process....") | ||
kaushikb11 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
process.kill() | ||
exit_status.append(TIMEOUT_EXIT_CODE) | ||
return exit_status | ||
|
||
|
||
def zip_cmd_exitcode(exit_status): | ||
"""This function is called to zip the tests that were executed with the exit status of the test. | ||
|
||
Return of this function will be list of hpu tests called and their exit status. | ||
Args: | ||
exit_status: The returned exit_status after executing run_hpu_tests_parallel(). | ||
kaushikb11 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
""" | ||
status_list = [] | ||
hpu_tests_called = [] | ||
for hpu_tests in PARALLEL_HPU_TESTS_EXECUTION: | ||
hpu_tests_called.append(hpu_tests) | ||
kaushikb11 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
status_list = list(zip(list(itertools.chain(*hpu_tests_called)), exit_status)) | ||
return status_list | ||
|
||
|
||
def print_logs(filename): | ||
"""This function is called to read the file and print the logs. | ||
|
||
Args: | ||
filename: Provide the log filename that need to be print on the console. | ||
""" | ||
with open(filename) as f: | ||
print(f.read()) | ||
|
||
|
||
def print_subprocess_logs_and_return_status(exit_status): | ||
"""This function is called to print the logs of subprocess stdout and stderror and return the status of test | ||
execution. | ||
|
||
Args: | ||
exit_status: The returned exit_status after executing run_hpu_tests_parallel(). | ||
Return of this function will be the return to main(). | ||
Based on the exit status of the HPU tests, we return success or failure to the main method. | ||
kaushikb11 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
""" | ||
if all(v == 0 for v in exit_status): | ||
print("All HPU tests passed") | ||
file_name = "stdout_log.txt" | ||
print_logs(file_name) | ||
return 0 | ||
else: | ||
print("HPU tests are failing") | ||
print("Printing stdout_log.txt...") | ||
file_name = "stdout_log.txt" | ||
print_logs(file_name) | ||
print("Printing error_log.txt...") | ||
file_name = "error_log.txt" | ||
print_logs(file_name) | ||
return 1 | ||
|
||
|
||
def main(): | ||
exit_status = run_hpu_tests_parallel(timeout=TIMEOUT) | ||
kaushikb11 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
status_list = zip_cmd_exitcode(exit_status) | ||
print("HPU Tests executed and their exit status:", status_list) | ||
return print_subprocess_logs_and_return_status(exit_status) | ||
|
||
|
||
if __name__ == "__main__": | ||
jerome-habana marked this conversation as resolved.
Show resolved
Hide resolved
|
||
sys.exit(main()) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,212 @@ | ||
.. _hpu: | ||
kaushikb11 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Habana Gaudi AI Processor (HPU) | ||
kaushikb11 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
=============================== | ||
|
||
Habana® Gaudi® AI training processors have been architected from the ground up and optimized for deep learning training efficiency. | ||
Gaudi offers substantial price/performance advantage -- so you get to do more deep learning training while spending less. | ||
|
||
You can use either `the Gaudi-based AWS EC2 DL1 instances <https://aws.amazon.com/ec2/instance-types/dl1/>`_ or `the Supermicro X12 Gaudi server <https://www.supermicro.com/en/solutions/habana-gaudi>`_. | ||
|
||
Habana’s SynapseAI® software suite is optimized for building and training deep learning models using TensorFlow and PyTorch frameworks. Gaudi is referred to as the Habana Processing Unit (HPU). | ||
With SynapseAI, we aim to make training workloads on Gaudi easy, whether you're developing from scratch or migrating existing workloads. | ||
kaushikb11 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
For more information, check out `<https://developer.habana.ai>`_ and `<https://habana.ai/>`_. | ||
|
||
---------------- | ||
|
||
PyTorch Lightning With Gaudi HPU | ||
-------------------------------- | ||
|
||
Lightning supports training on a single HPU device or 8 HPU devices with the plugins described in the following sections | ||
kaushikb11 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
|
||
---------------- | ||
|
||
.. _hpu_accelerator: | ||
|
||
HPU accelerator | ||
kaushikb11 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
--------------- | ||
|
||
To enable PyTorch Lightning to utilize the HPU accelerator, simply provide ``Trainer(accelerator="hpu")`` parameters in the trainer class. | ||
kaushikb11 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
|
||
---------------- | ||
|
||
.. _single_device_strategy: | ||
|
||
Training on Single HPU | ||
---------------------- | ||
|
||
The ``devices=1`` and ``accelerator="hpu"`` with ``strategy=SingleHPUStrategy(device=torch.device("hpu"))`` parameter in the trainer class enables the Habana backend for single Gaudi training. | ||
jerome-habana marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
|
||
---------------- | ||
|
||
.. _parallel_device_strategy: | ||
|
||
Distributed Training | ||
--------------------- | ||
|
||
|
||
The ``devices=8`` and ``accelerator="hpu"`` with ``strategy=HPUParallelStrategy(parallel_devices=[torch.device("hpu")]*devices)`` parameter in the trainer class enables the Habana backend for distributed training with 8 Gaudis. | ||
|
||
The Habana parallel device strategy is based on DDP strategy with the addition of Habana's collective communication library (HCCL) to support scale-up within a node and scale-out across multiple nodes. | ||
|
||
|
||
---------------- | ||
|
||
.. _mixed_precision_plugin: | ||
|
||
Mixed Precision Plugin | ||
---------------------- | ||
|
||
The ``precision=16`` and a ``hmp_params`` parameter in the trainer class enables the Habana plugin for mixed precision using the Habana Mixed Precision (HMP) package. | ||
kaushikb11 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
You can execute the ops in FP32 or BF16 precision. The HMP package modifies the python operators to add the appropriate cast operations for the arguments before execution. | ||
The default settings enable users to easily enable mixed precision training with minimal code. | ||
|
||
In addition to the default settings in HMP, users also have the option of overriding these defaults and providing their own BF16 and FP32 operator lists. | ||
|
||
For more details, please refer to `PyTorch Mixed Precision Training on Gaudi <https://docs.habana.ai/en/latest/PyTorch_User_Guide/PyTorch_User_Guide.html#pytorch-mixed-precision-training-on-gaudi>`_. | ||
|
||
|
||
---------------- | ||
|
||
.. _pytorch_lightning_examples: | ||
|
||
Getting Started with Lightning on Gaudi | ||
--------------------------------------- | ||
|
||
This section describes how to train models using PyTorch Lightning with Habana Gaudi. | ||
|
||
More Lightning HPU examples can be found in pl_examples (`<https://github.com/PyTorchLightning/pytorch-lightning/tree/master/pl_examples/hpu_examples>`_) | ||
|
||
---------------- | ||
|
||
Enabling Lightning with Single Gaudi HPU | ||
---------------------------------------- | ||
|
||
The below snippet shows an example model using MNIST with single Habana Gaudi device: | ||
|
||
.. code-block:: python | ||
|
||
import habana_frameworks.torch.core as htcore | ||
|
||
|
||
class LitClassifier(pl.LightningModule): | ||
def __init__(self): | ||
super(LitClassifier, self).__init__() | ||
|
||
... | ||
|
||
|
||
# Init our model | ||
model = LitClassifier() | ||
|
||
# Init DataLoader from MNIST Dataset | ||
dm = MNISTDataModule(batch_size=batch_size) | ||
|
||
... | ||
|
||
num_hpus = 1 | ||
|
||
# enable HPU strategy for single device, with mixed precision using default HMP settings | ||
hpu_strategy = SingleHPUStrategy(device=torch.device("hpu"), precision_plugin=HPUPrecisionPlugin(precision=16)) | ||
|
||
# Initialize a trainer with 1 HPU accelerator | ||
trainer = pl.Trainer(accelerator="hpu", devices=num_hpus, strategy=hpu_strategy) | ||
kaushikb11 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
# Train the model ⚡ | ||
trainer.fit(model, datamodule=dm) | ||
|
||
|
||
---------------- | ||
|
||
Enabling Lightning with 8 Gaudi HPUs (distributed) | ||
-------------------------------------------------- | ||
|
||
The below snippet shows an example model using MNIST with 8 Habana Gaudi devices: | ||
|
||
.. code-block:: python | ||
|
||
import habana_frameworks.torch.core as htcore | ||
|
||
|
||
class LitClassifier(pl.LightningModule): | ||
def __init__(self): | ||
super(LitClassifier, self).__init__() | ||
|
||
... | ||
|
||
|
||
# Init our model | ||
model = LitClassifier() | ||
|
||
# Init DataLoader from MNIST Dataset | ||
dm = MNISTDataModule(batch_size=batch_size) | ||
|
||
... | ||
|
||
# Initialize a trainer with HPU accelerator with 8 devices | ||
trainer = pl.Trainer(accelerator="hpu", devices=8, plugins=[HPUPrecisionPlugin(precision=16)]) | ||
|
||
# Train the model ⚡ | ||
trainer.fit(model, datamodule=dm) | ||
|
||
|
||
---------------- | ||
|
||
Enabling Mixed Precision Options | ||
-------------------------------- | ||
|
||
The below snippet shows an example model using MNIST with single Habana Gaudi and making use of HMP by overriding the default parameters. | ||
This enables advanced users to provide their own BF16 and FP32 operator list instead of using the HMP defaults. | ||
kaushikb11 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
.. code-block:: python | ||
|
||
import habana_frameworks.torch.core as htcore | ||
|
||
|
||
class LitClassifier(pl.LightningModule): | ||
def __init__(self): | ||
super(LitClassifier, self).__init__() | ||
|
||
... | ||
|
||
|
||
kaushikb11 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
# Init our model | ||
model = LitClassifier() | ||
|
||
# Init DataLoader from MNIST Dataset | ||
dm = MNISTDataModule(batch_size=batch_size) | ||
|
||
... | ||
|
||
num_hpus = 1 | ||
|
||
# Optional Habana mixed precision params to be set | ||
hmp_keys = ["level", "verbose", "bf16_ops", "fp32_ops"] | ||
hmp_params = dict.fromkeys(hmp_keys) | ||
hmp_params["level"] = "O1" | ||
hmp_params["verbose"] = False | ||
hmp_params["bf16_ops"] = "ops_bf16_mnist.txt" | ||
hmp_params["fp32_ops"] = "ops_fp32_mnist.txt" | ||
kaushikb11 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
# Initialize a trainer with HPU accelerator for HPU strategy for single device, | ||
# with mixed precision using overidden HMP settings | ||
trainer = pl.Trainer(accelerator="hpu", devices=1, plugins=[HPUPrecisionPlugin(precision=16, hmp_params=hmp_params)]) | ||
|
||
# Train the model ⚡ | ||
trainer.fit(model, datamodule=dm) | ||
|
||
|
||
---------------- | ||
|
||
.. _known-limitations_hpu: | ||
|
||
Known limitations | ||
----------------- | ||
|
||
* Habana dataloader is not supported. | ||
jerome-habana marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* Device stats monitoring is not supported. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.