Skip to content

Add support for Habana accelerator (HPU) #11808

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 183 commits into from
Mar 25, 2022
Merged
Show file tree
Hide file tree
Changes from 67 commits
Commits
Show all changes
183 commits
Select commit Hold shift + click to select a range
f7175c4
Add hpu accelerator support
jerome-habana Feb 8, 2022
7fb871b
Update strategy for optimizer usage
jerome-habana Feb 8, 2022
a1a1ca9
Add checkpointing support
jerome-habana Feb 8, 2022
9a6da43
Fix distributed support with hpu
jerome-habana Feb 8, 2022
3e76db9
Enable usage of static_graph with hpu
jerome-habana Feb 8, 2022
b43d226
Add HPU tests
jerome-habana Feb 8, 2022
992093d
Add basic hpu_stats monitor
jerome-habana Feb 8, 2022
943be49
Code cleanup
jerome-habana Feb 8, 2022
3015972
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 8, 2022
257d644
Update tests
jerome-habana Feb 9, 2022
f1867cd
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 9, 2022
c61d68b
Add configurable params for tests
jerome-habana Feb 10, 2022
f74a898
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 10, 2022
963cd1e
Enable inference test
jerome-habana Feb 11, 2022
53a5416
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 11, 2022
2de04e8
Resolve issue with hmp params type and load hpu
jerome-habana Feb 15, 2022
0197b9c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 15, 2022
b412638
Move hmp_params to HPUPrecision plugin
jerome-habana Feb 17, 2022
e549434
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 17, 2022
1cc0a37
Update habana distributed with ddp subclass
jerome-habana Feb 18, 2022
aeda681
Add hpu backend, datatype checks
jerome-habana Feb 18, 2022
fe32865
Merge branch 'master' into hpu_accelerator
jerome-habana Feb 23, 2022
f9b0c5f
Merge branch 'master' into hpu_accelerator
jerome-habana Feb 23, 2022
123112d
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 23, 2022
ede68eb
Remove unused param for 'on_train_batch_end' in hpu test
jerome-habana Feb 23, 2022
262343a
Merge branch 'master' into hpu_accelerator
jerome-habana Mar 3, 2022
3a029c1
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 3, 2022
0a959f0
Addres review comments
jerome-habana Mar 3, 2022
1434299
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 3, 2022
400ea77
Address review comments
jerome-habana Mar 4, 2022
4146bab
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 4, 2022
f5cb696
remove deprecated logging
jerome-habana Mar 4, 2022
d3cd6b1
Merge branch 'master' into hpu_accelerator
jerome-habana Mar 7, 2022
448ed77
Fix imports for failing CI
kaushikb11 Mar 9, 2022
10b190f
fix str to_device section in converting.rst (#12243)
awaelchli Mar 7, 2022
c17c62b
Disable tuner with distributed strategies (#12179)
rohitgr7 Mar 7, 2022
28bc4f0
Add callout items to the Docs landing page (#12196)
kaushikb11 Mar 7, 2022
97e1d28
Integrate global step with progress tracking (#11805)
carmocca Mar 7, 2022
5aecf65
Deprecate `LightningDataModule.on_save/load_checkpoint` (#11893)
jjenniferdai Mar 8, 2022
0949599
add Azure HPU agent (#12258)
Borda Mar 8, 2022
4bd5034
Add `LightningCLI(auto_registry)` (#12108)
carmocca Mar 8, 2022
bd76456
Drop PyTorch 1.7 testing from the CI (#12191)
krshrimali Mar 8, 2022
80b8d01
Have the outputs match the loops format (#12182)
carmocca Mar 8, 2022
c168db5
Address review comments
jerome-habana Mar 9, 2022
831a672
Review comment :Make use of Boring model
jerome-habana Mar 9, 2022
328329e
Update stats example trainer params
jerome-habana Mar 9, 2022
c8e331e
Correct flake8 errors
jerome-habana Mar 9, 2022
9a71bdc
Remove docstring examples
jerome-habana Mar 9, 2022
8efed0b
Update hpu-tests.yml
raoakarsha Mar 3, 2022
90409a2
prune
Borda Mar 7, 2022
5bbc6dc
Update hpu-tests.yml
Borda Mar 8, 2022
85f535b
Apply suggestions from code review
Borda Mar 9, 2022
75227d9
hwinfo
Borda Mar 9, 2022
711bbf3
Override mypy warnings
jerome-habana Mar 10, 2022
bc174f6
Update test and requirements file
jerome-habana Mar 10, 2022
b28c0ce
Remove hpu stats monitor and deprecated API's
jerome-habana Mar 10, 2022
3c08bf5
Update non-hpu tests
jerome-habana Mar 10, 2022
f857721
Add hpu-tests.yml and run_hpu_tests.py to support HPU Testing
Borda Mar 10, 2022
a2b2cb1
Merge branch 'master' into hpu_accelerator
jerome-habana Mar 10, 2022
7cb34bc
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 10, 2022
f6baf69
Add exception for non-hpu tests
jerome-habana Mar 10, 2022
21fc9a4
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 10, 2022
3665ffc
Throw exception when accelerator is not present
jerome-habana Mar 10, 2022
e0b4611
Resolve mypy and error message
jerome-habana Mar 10, 2022
545ab6a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 10, 2022
96ed1cd
Disable hpu pl examples on CPU
jerome-habana Mar 10, 2022
c44b017
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 10, 2022
410875c
Address review comments
jerome-habana Mar 14, 2022
8efe56f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 14, 2022
073b170
Add documentation for habana gaudi accelerator (HPU)
jerome-habana Mar 15, 2022
7bdcaf6
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 15, 2022
da1037a
Update test code syntax
jerome-habana Mar 15, 2022
5e7af01
Mitigate duplicate label error
jerome-habana Mar 15, 2022
70d6993
Add hpu to toctree
jerome-habana Mar 16, 2022
5061d71
Update pytorch_lightning/plugins/precision/hpu_precision.py
kaushikb11 Mar 16, 2022
f6c36ce
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 16, 2022
798f137
Update _broadvast_object_list
kaushikb11 Mar 16, 2022
5e098cb
Update broadcast for HPUParallelStrategy
kaushikb11 Mar 16, 2022
093056c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 16, 2022
0563310
Update reference links
kaushikb11 Mar 17, 2022
65886ba
Update Strategies
kaushikb11 Mar 17, 2022
d837ef3
Address reviews
kaushikb11 Mar 17, 2022
37e0000
Address reviews
kaushikb11 Mar 17, 2022
07c60b4
Address reviews
jerome-habana Mar 18, 2022
394d9e2
Merge branch 'master' into hpu_accelerator
jerome-habana Mar 18, 2022
12dc3ca
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 18, 2022
3064544
Remove too many sections from sidebar
akihironitta Mar 19, 2022
7c7721d
Fix invalid formatting and links
akihironitta Mar 19, 2022
cc71c7a
Merge branch 'master' into hpu_accelerator
kaushikb11 Mar 21, 2022
e6eaa9f
Address reviews for HPUCHeckpointIO
kaushikb11 Mar 21, 2022
33beabd
Address reviews for HPU + AcceleratorConnector
kaushikb11 Mar 21, 2022
759804e
Fix tests
kaushikb11 Mar 21, 2022
bda7e36
Address reviews
kaushikb11 Mar 21, 2022
bdc19be
Remove setting hpu accelerator by just strategy
kaushikb11 Mar 21, 2022
2d34cc5
Remove unnecessary properties for HPU
kaushikb11 Mar 21, 2022
c32601a
Fix HPU tests
kaushikb11 Mar 21, 2022
f43750e
Move tests
kaushikb11 Mar 21, 2022
4e09286
Improve docs
kaushikb11 Mar 21, 2022
ab2f595
Improve tests
kaushikb11 Mar 21, 2022
549d784
Update Changelog
kaushikb11 Mar 21, 2022
ec929df
Fix test for the rigth device type
kaushikb11 Mar 21, 2022
c55a82f
Fix tests
kaushikb11 Mar 21, 2022
05dcc1c
Fix tests
kaushikb11 Mar 21, 2022
150e667
Merge branch 'master' into hpu_accelerator
kaushikb11 Mar 21, 2022
f5a333b
Address reviews
kaushikb11 Mar 21, 2022
57b9c24
Update plugins
kaushikb11 Mar 21, 2022
3dd763c
Update docs/source/accelerators/hpu.rst
kaushikb11 Mar 22, 2022
773a7a0
Update HPU mnist example
kaushikb11 Mar 22, 2022
9378c87
Update strategy
kaushikb11 Mar 22, 2022
9aefcd2
Address reviews
jerome-habana Mar 22, 2022
1f0b187
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 22, 2022
1d30ef9
Add precision tests to azure pipeline
jerome-habana Mar 22, 2022
fd9488f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 22, 2022
a4f79fb
Add comments
kaushikb11 Mar 22, 2022
a6a336d
Fix argparse
kaushikb11 Mar 22, 2022
dca30ee
Remove unnecessary use of PL_TORCH_DISTRIBUTED_BACKEND env variable
kaushikb11 Mar 22, 2022
bb8984f
Update pytorch_lightning/strategies/hpu_parallel.py
kaushikb11 Mar 22, 2022
4ab35db
Update pytorch_lightning/utilities/distributed.py
kaushikb11 Mar 22, 2022
e65a3fb
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 22, 2022
a517942
Address review
jerome-habana Mar 23, 2022
d89815d
Address reviews
kaushikb11 Mar 23, 2022
0238b45
Update document
jerome-habana Mar 23, 2022
4f44ea9
Improve Habana doc
kaushikb11 Mar 23, 2022
f332e1c
Improve Habana doc
kaushikb11 Mar 23, 2022
81202c6
Improve Habana doc
kaushikb11 Mar 23, 2022
503df4e
Update pytorch_lightning/trainer/connectors/accelerator_connector.py
kaushikb11 Mar 23, 2022
e6af417
Update links
kaushikb11 Mar 23, 2022
2bd4a66
Merge branch 'hpu_accelerator' of https://github.com/jerome-habana/py…
kaushikb11 Mar 23, 2022
67e710e
Update precision sections
kaushikb11 Mar 23, 2022
1df801b
Update doc
kaushikb11 Mar 23, 2022
9152114
Add defaults to hmp_params for Precision Plugin
kaushikb11 Mar 23, 2022
9846b6a
Update .azure-pipelines/run_hpu_tests.py
kaushikb11 Mar 24, 2022
e86becf
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 24, 2022
d165c44
Apply suggestions from code review
kaushikb11 Mar 24, 2022
c76b95f
Update docs/source/accelerators/hpu.rst
kaushikb11 Mar 24, 2022
bafcb8d
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 24, 2022
2d6c6dd
Apply suggestions from code review
kaushikb11 Mar 24, 2022
75728b6
Apply suggestions from code review
kaushikb11 Mar 24, 2022
68c5281
Update docs/source/accelerators/hpu.rst
kaushikb11 Mar 24, 2022
600e1bd
Address reviews
kaushikb11 Mar 24, 2022
b03d079
Apply suggestions from code review
kaushikb11 Mar 24, 2022
6e4474e
Update API references
kaushikb11 Mar 24, 2022
efd9f65
Address reviews regarding precision
kaushikb11 Mar 24, 2022
22827f0
Address reviews regarding docs and precision
kaushikb11 Mar 24, 2022
e82544c
Update docs/source/accelerators/hpu.rst
kaushikb11 Mar 24, 2022
4500a7e
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 24, 2022
98ba21f
Apply suggestions from code review
kaushikb11 Mar 24, 2022
3c10359
Address reviews & update tests
kaushikb11 Mar 24, 2022
6c0dd88
Merge branch 'hpu_accelerator' of https://github.com/jerome-habana/py…
kaushikb11 Mar 24, 2022
e137f19
Update testing pipeline & conftest
kaushikb11 Mar 24, 2022
a62cfa1
Fix ci
kaushikb11 Mar 24, 2022
1078a69
Add device parsing logic for HPUs
kaushikb11 Mar 24, 2022
a9dfcf3
Fix device parsing
kaushikb11 Mar 24, 2022
4665101
Use the CLI in the example
Mar 24, 2022
2ee4bbf
Docs
Mar 24, 2022
e9ae312
Merge branch 'master' into hpu_accelerator
kaushikb11 Mar 24, 2022
dc3eca7
Update docs/source/accelerators/hpu.rst
kaushikb11 Mar 24, 2022
6952125
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 24, 2022
91cced3
Update hmp_params
kaushikb11 Mar 24, 2022
0671d2c
Support passing amp_level to HPUPrecision
kaushikb11 Mar 24, 2022
522106e
Update HPUAccelerator
kaushikb11 Mar 24, 2022
c8b89ea
Update tests
kaushikb11 Mar 25, 2022
7d028b1
Fix precision tests
kaushikb11 Mar 25, 2022
3c86aff
Update device parsing logic
kaushikb11 Mar 25, 2022
3c8e321
Fix tests & address reviews
kaushikb11 Mar 25, 2022
dcda0ac
Update run_hpu_tests
kaushikb11 Mar 25, 2022
e254cd0
Update CLI test
jerome-habana Mar 25, 2022
c452bd2
Fix typing
kaushikb11 Mar 25, 2022
4c51b33
Merge branch 'hpu_accelerator' of https://github.com/jerome-habana/py…
kaushikb11 Mar 25, 2022
b66c867
Merge branch 'master' into hpu_accelerator
jerome-habana Mar 25, 2022
dca6b0f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 25, 2022
98e901d
Enable example test in pipeline
jerome-habana Mar 25, 2022
2860a4e
export path of modules
jerome-habana Mar 25, 2022
a297593
Fix test
kaushikb11 Mar 25, 2022
9c1fff7
Merge branch 'hpu_accelerator' of https://github.com/jerome-habana/py…
kaushikb11 Mar 25, 2022
65f1fb9
Update torch distributed
kaushikb11 Mar 25, 2022
2380887
Update strategy
kaushikb11 Mar 25, 2022
59ef6fd
Update example
kaushikb11 Mar 25, 2022
c02c1ed
Apply suggestions from code review
kaushikb11 Mar 25, 2022
beda30c
Address reviews
kaushikb11 Mar 25, 2022
eb99e52
Merge branch 'hpu_accelerator' of https://github.com/jerome-habana/py…
kaushikb11 Mar 25, 2022
c465a06
Update backend env variable for strategy
kaushikb11 Mar 25, 2022
60f2da4
Update backend env variable for strategy
kaushikb11 Mar 25, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions .azure-pipelines/hpu-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,5 +28,21 @@ jobs:

steps:
- bash: |
apt-get install hwinfo
hwinfo --short
displayName: 'Instance HW info'

- bash: |
pip install . --requirement requirements/test.txt
displayName: 'Install dependencies'

- bash: |
python ".azure-pipelines/run_hpu_tests.py"
displayName: 'HPU Tests in parallel'

- task: PublishTestResults@2
inputs:
testResultsFiles: 'hpu*_test-results.xml'
testRunTitle: '$(Agent.OS) - $(Build.DefinitionName) - Python $(python.version)'
condition: succeededOrFailed()
displayName: 'Publish test results'
142 changes: 142 additions & 0 deletions .azure-pipelines/run_hpu_tests.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
"""This file is called from the hpu-tests.yml pipeline.

The following script run the hpu tests in parallel.
Tests run are:
1. test_inference_only is run on four cards
2. test_all_stages on two cards
3. complete hpu tests using one card
4. complete hpu tests using eight cards.
"""
import itertools
import subprocess
import sys

HPU_TESTS_DICTIONARY = {
"hpu1_test": "python -m coverage run --source pytorch_lightning -m pytest -sv tests/accelerators/test_hpu.py \
--hmp-bf16 'tests/accelerators/ops_bf16_mnist.txt' \
--hmp-fp32 'tests/accelerators/ops_fp32_mnist.txt' \
--forked \
--junitxml=hpu1_test-results.xml",
"hpu2_test": "python -m coverage run --source pytorch_lightning -m pytest -sv tests/accelerators/test_hpu.py \
-k test_all_stages \
--hpus 2 \
--verbose \
--capture=no \
--forked \
--junitxml=hpu2_test-results.xml",
"hpu4_test": "python -m coverage run --source pytorch_lightning -m pytest -sv tests/accelerators/test_hpu.py \
-k test_inference_only \
--hpus 4 \
--capture=no \
--verbose \
--forked \
--junitxml=hpu4_test-results.xml",
"hpu8_test": "python -m coverage run --source pytorch_lightning -m pytest -sv tests/accelerators/test_hpu.py \
--hmp-bf16 'tests/accelerators/ops_bf16_mnist.txt' \
--hmp-fp32 'tests/accelerators/ops_fp32_mnist.txt' \
--forked \
--hpus 8 \
--junitxml=hpu8_test-results.xml",
}

HPU1_TEST = HPU_TESTS_DICTIONARY["hpu1_test"]
HPU2_TEST = HPU_TESTS_DICTIONARY["hpu2_test"]
HPU4_TEST = HPU_TESTS_DICTIONARY["hpu4_test"]
HPU8_TEST = HPU_TESTS_DICTIONARY["hpu8_test"]

PARALLEL_HPU_TESTS_EXECUTION = [[HPU4_TEST, HPU1_TEST], [HPU2_TEST, HPU1_TEST], [HPU8_TEST]]
TIMEOUT = 60
TIMEOUT_EXIT_CODE = -9


def run_hpu_tests_parallel(timeout=TIMEOUT):
"""This function is called to run the HPU tests in parallel.

We run the tests in sub process to utilize all the eight cards available in the DL1 instance
Considering the max time taken to run the HPU tests as 60 seconds, we kill the process if the time taken exceeds.
Return of this function will be the list of exit status of the HPU tests that were run in the subprocess.
Here, the exit_status 0 means the test run is successful. exit_status 1 means the test run is failed.
Args:
timeout: The threshold time to run the HPU tests in parallel.
Exception is logged if the threshold timeout gets expired.
TIMEOUT_EXIT_CODE will be returned as -9 in case of timeout, 0 in case of success and 4 in case of a failure.
"""
exit_status = []
with open("stdout_log.txt", "w") as stdout_log, open("error_log.txt", "w") as error_log:
for hpu_tests in PARALLEL_HPU_TESTS_EXECUTION:
process_list = [
subprocess.Popen(
each_hpu_test, shell=True, stdout=stdout_log, stderr=error_log, universal_newlines=True
)
for each_hpu_test in hpu_tests
]
for process in process_list:
try:
exit_status.append(process.wait(timeout=TIMEOUT))
except subprocess.TimeoutExpired as e:
print(e)
print("Killing the process....")
process.kill()
exit_status.append(TIMEOUT_EXIT_CODE)
return exit_status


def zip_cmd_exitcode(exit_status):
"""This function is called to zip the tests that were executed with the exit status of the test.

Return of this function will be list of hpu tests called and their exit status.
Args:
exit_status: The returned exit_status after executing run_hpu_tests_parallel().
"""
status_list = []
hpu_tests_called = []
for hpu_tests in PARALLEL_HPU_TESTS_EXECUTION:
hpu_tests_called.append(hpu_tests)
status_list = list(zip(list(itertools.chain(*hpu_tests_called)), exit_status))
return status_list


def print_logs(filename):
"""This function is called to read the file and print the logs.

Args:
filename: Provide the log filename that need to be print on the console.
"""
with open(filename) as f:
print(f.read())


def print_subprocess_logs_and_return_status(exit_status):
"""This function is called to print the logs of subprocess stdout and stderror and return the status of test
execution.

Args:
exit_status: The returned exit_status after executing run_hpu_tests_parallel().
Return of this function will be the return to main().
Based on the exit status of the HPU tests, we return success or failure to the main method.
"""
if all(v == 0 for v in exit_status):
print("All HPU tests passed")
file_name = "stdout_log.txt"
print_logs(file_name)
return 0
else:
print("HPU tests are failing")
print("Printing stdout_log.txt...")
file_name = "stdout_log.txt"
print_logs(file_name)
print("Printing error_log.txt...")
file_name = "error_log.txt"
print_logs(file_name)
return 1


def main():
exit_status = run_hpu_tests_parallel(timeout=TIMEOUT)
status_list = zip_cmd_exitcode(exit_status)
print("HPU Tests executed and their exit status:", status_list)
return print_subprocess_logs_and_return_status(exit_status)


if __name__ == "__main__":
sys.exit(main())
133 changes: 133 additions & 0 deletions pl_examples/hpu_examples/simple_mnist/mnist.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
# Copyright The PyTorch Lightning team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import os

import torch
from torch.nn import functional as F

import pytorch_lightning as pl
from pl_examples.basic_examples.mnist_datamodule import MNISTDataModule
from pytorch_lightning.plugins import HPUPrecisionPlugin
from pytorch_lightning.strategies.hpu import HPUStrategy
from pytorch_lightning.strategies.hpu_parallel import HPUParallelStrategy
from pytorch_lightning.utilities.imports import _HPU_AVAILABLE


def parse_args():
import argparse

parser = argparse.ArgumentParser(description="PyTorch Classification Training")

parser.add_argument("-b", "--batch-size", default=32, type=int)
parser.add_argument("--epochs", default=1, type=int, metavar="N", help="number of total epochs to run")
parser.add_argument(
"--hpus", default=1, type=int, metavar="N", help="number of habana accelerator for training (default: 1)"
)
parser.add_argument("--hmp", dest="is_hmp", action="store_true", help="enable habana mixed precision mode")
parser.add_argument("--hmp-bf16", default="", help="path to bf16 ops list in hmp O1 mode")
parser.add_argument("--hmp-fp32", default="", help="path to fp32 ops list in hmp O1 mode")
parser.add_argument("--hmp-opt-level", default="O1", help="choose optimization level for hmp")
parser.add_argument("--hmp-verbose", action="store_true", help="enable verbose mode for hmp")

args = parser.parse_args()

return args


class LitClassifier(pl.LightningModule):
def __init__(self):
super().__init__()

self.l1 = torch.nn.Linear(28 * 28, 10)

def forward(self, x):
return torch.relu(self.l1(x.view(x.size(0), -1)))

def training_step(self, batch, batch_idx):
x, y = batch
loss = F.cross_entropy(self(x), y)
return loss

def validation_step(self, batch, batch_idx):
x, y = batch
probs = self(x)
acc = self.accuracy(probs, y)
return acc

def test_step(self, batch, batch_idx):
x, y = batch
logits = self(x)
acc = self.accuracy(logits, y)
return acc

def accuracy(self, logits, y):
acc = torch.sum(torch.eq(torch.argmax(logits, -1), y).to(torch.float32)) / len(y)
return acc

def validation_epoch_end(self, outputs) -> None:
self.log("val_acc", torch.stack(outputs).mean(), prog_bar=True)

def test_epoch_end(self, outputs) -> None:
self.log("test_acc", torch.stack(outputs).mean())

def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=0.02)


if __name__ == "__main__":

if _HPU_AVAILABLE:

args = parse_args()

# Init our model
model = LitClassifier()

# Init DataLoader from MNIST Dataset
dm = MNISTDataModule(batch_size=args.batch_size)

# TBD: import these keys from hmp
hmp_keys = ["level", "verbose", "bf16_ops", "fp32_ops"]
hmp_params = dict.fromkeys(hmp_keys)
hmp_params["level"] = args.hmp_opt_level
hmp_params["verbose"] = args.hmp_verbose
hmp_params["bf16_ops"] = args.hmp_bf16 # "./pl_examples/hpu_examples/simple_mnist/ops_bf16_mnist.txt"
hmp_params["fp32_ops"] = args.hmp_fp32 # "./pl_examples/hpu_examples/simple_mnist/ops_fp32_mnist.txt"

parallel_devices = args.hpus
hpustrat_1 = HPUStrategy(
device=torch.device("hpu"), precision_plugin=HPUPrecisionPlugin(precision=16, hmp_params=hmp_params)
)
hpustrat_8 = HPUParallelStrategy(
parallel_devices=[torch.device("hpu")] * parallel_devices,
precision_plugin=HPUPrecisionPlugin(precision=16, hmp_params=hmp_params),
)

# Initialize a trainer
trainer = pl.Trainer(
strategy=hpustrat_8 if (parallel_devices == 8) else hpustrat_1,
devices=parallel_devices,
max_epochs=args.epochs,
default_root_dir=os.getcwd(),
accelerator="hpu",
)

# Train the model ⚡
trainer.fit(model, datamodule=dm)
trainer.test(model, datamodule=dm)
trainer.validate(model, datamodule=dm)

else:
print("This example is supported only on HPU !")
2 changes: 2 additions & 0 deletions pl_examples/hpu_examples/simple_mnist/ops_bf16_mnist.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
linear
relu
1 change: 1 addition & 0 deletions pl_examples/hpu_examples/simple_mnist/ops_fp32_mnist.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
cross_entropy
1 change: 1 addition & 0 deletions pytorch_lightning/accelerators/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,5 +13,6 @@
from pytorch_lightning.accelerators.accelerator import Accelerator # noqa: F401
from pytorch_lightning.accelerators.cpu import CPUAccelerator # noqa: F401
from pytorch_lightning.accelerators.gpu import GPUAccelerator # noqa: F401
from pytorch_lightning.accelerators.hpu import HPUAccelerator # noqa: F401
from pytorch_lightning.accelerators.ipu import IPUAccelerator # noqa: F401
from pytorch_lightning.accelerators.tpu import TPUAccelerator # noqa: F401
1 change: 1 addition & 0 deletions pytorch_lightning/accelerators/accelerator.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ class Accelerator(ABC):
- GPU
- TPU
- IPU
- HPU
"""

def setup_environment(self, root_device: torch.device) -> None:
Expand Down
53 changes: 53 additions & 0 deletions pytorch_lightning/accelerators/hpu.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Copyright The PyTorch Lightning team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from typing import Any, Dict, List, Union

import torch

from pytorch_lightning.accelerators.accelerator import Accelerator
from pytorch_lightning.utilities import _HPU_AVAILABLE


class HPUAccelerator(Accelerator):
"""Accelerator for HPU devices."""

@staticmethod
def name() -> str:
"""Name of the Accelerator."""
return "hpu"

def get_device_stats(self, device: Union[str, torch.device]) -> Dict[str, Any]:
"""HPU device stats aren't supported yet."""
return {}

@staticmethod
def parse_devices(devices: int) -> int:
"""Accelerator device parsing logic."""
return devices

@staticmethod
def get_parallel_devices(devices: int) -> List[int]:
"""Gets parallel devices for the Accelerator."""
return list(range(devices))

@staticmethod
def auto_device_count() -> int:
"""Get the devices when set to auto."""
# TODO: Update this when api is exposed by the Habana team
return 8

@staticmethod
def is_available() -> bool:
return _HPU_AVAILABLE
Loading