-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Launch options for Lightning Lite #14992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 40 commits
Commits
Show all changes
55 commits
Select commit
Hold shift + click to select a range
ec164b9
squash all
awaelchli 43b9768
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] bd2cde2
support script args
awaelchli bae0c67
wip
awaelchli 71c7b60
refactor
awaelchli 39ff946
reset
awaelchli 98cb643
cli stuff
awaelchli 2450512
cli tests
awaelchli e541119
test connector
awaelchli 198c59c
function inspection
awaelchli 7f71abe
types
awaelchli a956e55
tests for collision
awaelchli 52894b7
add notice
awaelchli 71cafff
docs
awaelchli 002e5f7
mypy stuff
awaelchli 145d88f
changelog
awaelchli a8cf41f
remove demo examples
awaelchli 70d8b78
error handling for run and cli
awaelchli 36f177d
Merge branch 'master' into lite/launcher-poc
awaelchli 1592d7d
Merge branch 'master' into lite/launcher-poc
awaelchli d08791d
remove handled todo
awaelchli 71421dc
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 784d59b
Merge branch 'master' into lite/launcher-poc
awaelchli 3a5964a
fix test
awaelchli 5d9dd67
update cli detection
awaelchli 7762963
mypy
awaelchli ffa18e6
Merge branch 'master' into lite/launcher-poc
awaelchli 5d6d983
add description
awaelchli 9b200a9
address review comments
awaelchli 5a0e6fc
fix env variable selection
awaelchli 85c9463
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 1a9c982
Merge branch 'master' into lite/launcher-poc
awaelchli 43742d3
Merge branch 'master' into lite/launcher-poc
awaelchli 7332980
fix test
awaelchli 9b8ec68
unused import
awaelchli bda824a
Merge branch 'master' into lite/launcher-poc
awaelchli d92b37f
notebook
awaelchli c08b3f7
notebook
awaelchli 9698440
raise error on win + 1.13
awaelchli c47a0af
fix
awaelchli ce470c8
fix
awaelchli 0a1c0f5
Update src/lightning_lite/CHANGELOG.md
awaelchli 84184ab
fix type
awaelchli 0c937a2
nit
awaelchli 6a63051
update gpu parsing
awaelchli c37a4fe
chlog
awaelchli add70ef
skip
awaelchli 5f829fa
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 0544a7b
Update src/lightning_lite/cli.py
awaelchli 2d79cb3
update test from code review
awaelchli d69523a
local import
awaelchli 63f5931
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] c00ff74
address review
awaelchli 61d091e
Merge branch 'lite/launcher-poc' of github.com:Lightning-AI/lightning…
awaelchli 20ee566
fix windows test import
awaelchli File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,168 @@ | ||
# Copyright The PyTorch Lightning team. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
import logging | ||
import os | ||
from argparse import ArgumentParser, Namespace | ||
from typing import List, Tuple | ||
|
||
from lightning_lite.accelerators import CPUAccelerator, CUDAAccelerator, MPSAccelerator | ||
from lightning_lite.utilities.imports import _IS_WINDOWS, _TORCH_GREATER_EQUAL_1_13 | ||
|
||
# torchrun in PyTorch 1.13.0 has a bug on the Windows platform and is thus not importable: | ||
# https://github.com/pytorch/pytorch/issues/85427 | ||
if _IS_WINDOWS and _TORCH_GREATER_EQUAL_1_13: | ||
import torch.distributed.run as torchrun | ||
else: | ||
torchrun = None | ||
|
||
_log = logging.getLogger(__name__) | ||
|
||
_SUPPORTED_ACCELERATORS = ("cpu", "gpu", "cuda", "mps", "tpu") | ||
_SUPPORTED_STRATEGIES = (None, "ddp", "dp", "deepspeed") | ||
_SUPPORTED_PRECISION = ("64", "32", "16", "bf16") | ||
|
||
|
||
def _parse_args() -> Tuple[Namespace, List[str]]: | ||
parser = ArgumentParser(description="Launch your script with the Lightning Lite CLI.") | ||
parser.add_argument("script", type=str, help="Path to the Python script with Lightning Lite inside.") | ||
parser.add_argument( | ||
"--accelerator", | ||
type=str, | ||
default="cpu", | ||
choices=_SUPPORTED_ACCELERATORS, | ||
help="The hardware accelerator to run on.", | ||
) | ||
parser.add_argument( | ||
"--strategy", | ||
type=str, | ||
default=None, | ||
choices=_SUPPORTED_STRATEGIES, | ||
help="Strategy for how to run across multiple devices.", | ||
) | ||
parser.add_argument( | ||
"--devices", | ||
type=str, | ||
default="1", | ||
help=( | ||
"Number of devices to run on (``int``), which devices to run on (``list`` or ``str``), or ``'auto'``." | ||
" The value applies per node." | ||
), | ||
) | ||
parser.add_argument( | ||
justusschock marked this conversation as resolved.
Show resolved
Hide resolved
|
||
"--num-nodes", | ||
awaelchli marked this conversation as resolved.
Show resolved
Hide resolved
|
||
type=int, | ||
default=1, | ||
help="Number of machines (nodes) for distributed execution.", | ||
) | ||
parser.add_argument( | ||
"--node-rank", | ||
type=int, | ||
default=0, | ||
help=( | ||
"The index of the machine (node) this command gets started on. Must be a number in the range" | ||
" 0, ..., num_nodes - 1." | ||
), | ||
) | ||
parser.add_argument( | ||
"--main-address", | ||
type=str, | ||
default="127.0.0.1", | ||
help="The hostname or IP address of the main machine (usually the one with node_rank = 0).", | ||
) | ||
parser.add_argument( | ||
"--main-port", | ||
type=int, | ||
default=29400, | ||
help="The main port to connect to the main machine.", | ||
) | ||
parser.add_argument( | ||
"--precision", | ||
type=str, | ||
default="32", | ||
choices=_SUPPORTED_PRECISION, | ||
help=( | ||
"Double precision (``64``), full precision (``32``), half precision (``16``) or bfloat16 precision" | ||
" (``'bf16'``)" | ||
), | ||
) | ||
|
||
args, script_args = parser.parse_known_args() | ||
return args, script_args | ||
|
||
|
||
def _set_env_variables(args: Namespace) -> None: | ||
awaelchli marked this conversation as resolved.
Show resolved
Hide resolved
|
||
"""Set the environment variables for the new processes. | ||
|
||
The Lite connector will parse the arguments set here. | ||
""" | ||
os.environ["LT_CLI_USED"] = "1" | ||
os.environ["LT_ACCELERATOR"] = str(args.accelerator) | ||
if args.strategy is not None: | ||
os.environ["LT_STRATEGY"] = str(args.strategy) | ||
os.environ["LT_DEVICES"] = str(args.devices) | ||
os.environ["LT_NUM_NODES"] = str(args.num_nodes) | ||
os.environ["LT_PRECISION"] = str(args.precision) | ||
|
||
|
||
def _get_num_processes(accelerator: str, devices: str) -> int: | ||
carmocca marked this conversation as resolved.
Show resolved
Hide resolved
|
||
"""Parse the `devices` argument to determine how many processes need to be launched on the current machine.""" | ||
if accelerator in ("cuda", "gpu"): | ||
parsed_devices = CUDAAccelerator.parse_devices(devices) | ||
elif accelerator in ("mps", "gpu"): | ||
awaelchli marked this conversation as resolved.
Show resolved
Hide resolved
|
||
parsed_devices = MPSAccelerator.parse_devices(devices) | ||
elif accelerator == "tpu": | ||
raise ValueError("Launching processes for TPU through the CLI is not supported.") | ||
awaelchli marked this conversation as resolved.
Show resolved
Hide resolved
|
||
else: | ||
return CPUAccelerator.parse_devices(devices) | ||
return len(parsed_devices) if parsed_devices is not None else 0 | ||
|
||
|
||
def _torchrun_launch(args: Namespace, script_args: List[str]) -> None: | ||
"""This will invoke `torchrun` programmatically to launch the given script in new processes.""" | ||
|
||
if _IS_WINDOWS and _TORCH_GREATER_EQUAL_1_13: | ||
# TODO: remove once import issue is resolved: https://github.com/pytorch/pytorch/issues/85427 | ||
_log.error( | ||
"On the Windows platform, this launcher is currently only supported on torch < 1.13 due to a bug" | ||
" upstream: https://github.com/pytorch/pytorch/issues/85427" | ||
) | ||
exit(1) | ||
|
||
if args.strategy == "dp": | ||
num_processes = 1 | ||
else: | ||
num_processes = _get_num_processes(args.accelerator, args.devices) | ||
awaelchli marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
torchrun_args = [] | ||
awaelchli marked this conversation as resolved.
Show resolved
Hide resolved
|
||
torchrun_args.extend(["--nproc_per_node", str(num_processes)]) | ||
torchrun_args.extend(["--nnodes", str(args.num_nodes)]) | ||
torchrun_args.extend(["--node_rank", str(args.node_rank)]) | ||
torchrun_args.extend(["--master_addr", args.main_address]) | ||
torchrun_args.extend(["--master_port", str(args.main_port)]) | ||
torchrun_args.append(args.script) | ||
torchrun_args.extend(script_args) | ||
|
||
# set a good default number of threads for OMP to avoid warnings being emitted to the user | ||
os.environ.setdefault("OMP_NUM_THREADS", str(max(1, (os.cpu_count() or 1) // num_processes))) | ||
awaelchli marked this conversation as resolved.
Show resolved
Hide resolved
|
||
torchrun.main(torchrun_args) | ||
|
||
|
||
def main() -> None: | ||
args, script_args = _parse_args() | ||
_set_env_variables(args) | ||
_torchrun_launch(args, script_args) | ||
|
||
|
||
if __name__ == "__main__": | ||
main() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.