Skip to content

CI: Use self-hosted Azure GPU runners #14632

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 68 commits into from
Oct 5, 2022
Merged
Show file tree
Hide file tree
Changes from 67 commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
69df775
move config
Borda Sep 9, 2022
ab7ef13
dir
Borda Sep 9, 2022
cb14d43
update
Borda Sep 13, 2022
f2276bc
rev
Borda Sep 13, 2022
44ec3fa
all
Borda Sep 15, 2022
ff819f5
Empty-Commit
Borda Sep 15, 2022
b070450
Empty-Commit
Borda Sep 15, 2022
c2218df
Merge branch 'master' into ci/azure-runner
Borda Sep 15, 2022
41b8181
export
Borda Sep 16, 2022
ae53c9f
devices
Borda Sep 16, 2022
de871ab
env
Borda Sep 16, 2022
f9a3619
env
Borda Sep 16, 2022
77b1dde
env
Borda Sep 16, 2022
519de03
env
Borda Sep 16, 2022
0690053
hard
Borda Sep 16, 2022
7c00074
0,1,2,3,4,5,6,7
Borda Sep 16, 2022
dec98bc
all
Borda Sep 16, 2022
9865044
other
Borda Sep 16, 2022
20f9e79
var
Borda Sep 16, 2022
989a69d
var
Borda Sep 16, 2022
6ed7c2e
var
Borda Sep 16, 2022
08882a9
echo
Borda Sep 16, 2022
73f9a0b
[]
Borda Sep 16, 2022
045d79f
{}
Borda Sep 16, 2022
4c59f32
()
Borda Sep 16, 2022
60039a1
var_val
Borda Sep 16, 2022
f696493
var_val
Borda Sep 16, 2022
dc76d49
var_val
Borda Sep 16, 2022
662fcdd
[]
Borda Sep 16, 2022
41c81d6
[]
Borda Sep 16, 2022
f7e89c5
var
Borda Sep 16, 2022
30199cd
var_val
Borda Sep 16, 2022
9fd24a8
val
Borda Sep 16, 2022
130fa24
()
Borda Sep 16, 2022
daa0982
env
Borda Sep 16, 2022
c26d3ef
env
Borda Sep 16, 2022
3179036
env
Borda Sep 16, 2022
e165df8
env
Borda Sep 16, 2022
92b19cd
env
Borda Sep 16, 2022
4084256
env
Borda Sep 16, 2022
aa0e79c
readme
Borda Sep 16, 2022
7d54bf8
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 16, 2022
b24dd96
Merge branch 'master' into ci/azure-runner
Borda Sep 16, 2022
decb292
Fix test blocklist. Skip nvprof test on CUDA>8.0
carmocca Sep 18, 2022
bbdf63f
Fix nvprof skip
carmocca Sep 18, 2022
97ea70e
REVERT ME - DEBUGGING
carmocca Sep 18, 2022
c722e68
DEBUG - Is it Bagua?
carmocca Sep 19, 2022
cc428d3
skip bagua
Borda Sep 19, 2022
7eab43f
Add print when trap triggers
carmocca Sep 19, 2022
a11efba
Skip Bagua installation
carmocca Sep 19, 2022
1ea606b
Merge branch 'ci/azure-runner' of https://github.com/PyTorchLightning…
Borda Sep 19, 2022
d81d422
skip bagua
Borda Sep 19, 2022
68c4316
Merge branch 'master' into ci/azure-runner
Borda Sep 24, 2022
09328dc
runif
Borda Sep 24, 2022
313a5b7
dockers
Borda Sep 24, 2022
073ff8e
Apply suggestions from code review
carmocca Sep 27, 2022
8adf5db
Skip Bagua async test
carmocca Sep 27, 2022
b1e3d0e
Fix installation
carmocca Sep 27, 2022
98aef0f
DEBUG - skip to standalone
carmocca Sep 27, 2022
905dc62
Revert "DEBUG - skip to standalone"
carmocca Sep 27, 2022
0df21cb
Undo change
carmocca Sep 27, 2022
b0c04e6
Fix env var
carmocca Sep 27, 2022
be7ec9e
Merge branch 'master' into ci/azure-runner
akihironitta Sep 29, 2022
4ad1014
Merge branch 'master' into ci/azure-runner
carmocca Sep 29, 2022
078b2a3
Merge branch 'master' into ci/azure-runner
otaj Sep 30, 2022
f1eed82
Merge branch 'master' into ci/azure-runner
Borda Oct 5, 2022
0445742
args
Borda Oct 5, 2022
08abf66
Apply suggestions from code review
carmocca Oct 5, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 46 additions & 0 deletions .azure/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Creation GPU self-hosted agent pool

## Prepare the machine

This is a slightly modified version of the script from
https://docs.microsoft.com/en-us/azure/devops/pipelines/agents/docker

```bash
apt-get update
apt-get install -y --no-install-recommends \
ca-certificates \
curl \
jq \
git \
iputils-ping \
libcurl4 \
libunwind8 \
netcat \
libssl1.0

curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
mkdir /azp
```

## Stating the agents

```bash
export TARGETARCH=linux-x64
export AZP_URL="https://dev.azure.com/Lightning-AI"
export AZP_TOKEN="xxxxxxxxxxxxxxxxxxxxxxxxxx"
export AZP_POOL="lit-rtx-3090"

for i in {0..7..2}
do
nohup bash .azure/start.sh \
"AZP_AGENT_NAME=litGPU-YX_$i,$((i+1))" \
"CUDA_VISIBLE_DEVICES=$i,$((i+1))" \
> "agent-$i.log" &
done
```

## Check running agents

```bash
ps aux | grep start.sh
```
15 changes: 12 additions & 3 deletions .azure/gpu-tests-lite.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,12 +41,16 @@ jobs:
timeoutInMinutes: "20"
# how much time to give 'run always even if cancelled tasks' before stopping them
cancelTimeoutInMinutes: "2"
pool: azure-jirka-spot
pool: lit-rtx-3090
variables:
DEVICES: $( python -c 'print("$(Agent.Name)".split("_")[-1])' )
container:
image: "pytorchlightning/pytorch_lightning:base-cuda-py3.9-torch1.12-cuda11.6.1"
# default shm size is 64m. Increase it to avoid:
# 'Error while creating shared memory: unhandled system error, NCCL version 2.7.8'
options: "--runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all --shm-size=512m"
# argument `--runtime=nvidia` was deprecated in favor of `--gpus=all`
# see: https://gitlab.com/gitlab-org/gitlab-runner/-/issues/4585
options: "--gpus=all --shm-size=2gb"
workspace:
clean: all

Expand All @@ -61,6 +65,10 @@ jobs:
pip list
displayName: 'Image info & NVIDIA'

- bash: |
echo "##vso[task.setvariable variable=CUDA_VISIBLE_DEVICES]$(DEVICES)"
displayName: 'set visible devices'

- bash: |
set -e
PYTORCH_VERSION=$(python -c "import torch; print(torch.__version__.split('+')[0])")
Expand All @@ -78,8 +86,9 @@ jobs:

- bash: |
set -e
echo $CUDA_VISIBLE_DEVICES
python requirements/collect_env_details.py
python -c "import torch ; mgpu = torch.cuda.device_count() ; assert mgpu >= 2, f'GPU: {mgpu}'"
python -c "import torch ; mgpu = torch.cuda.device_count() ; assert mgpu == 2, f'GPU: {mgpu}'"
displayName: 'Env details'

- bash: python -m coverage run --source lightning_lite -m pytest --ignore benchmarks -v --junitxml=$(Build.StagingDirectory)/test-results.xml --durations=50
Expand Down
17 changes: 13 additions & 4 deletions .azure/gpu-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -67,12 +67,16 @@ jobs:
timeoutInMinutes: "80"
# how much time to give 'run always even if cancelled tasks' before stopping them
cancelTimeoutInMinutes: "2"
pool: azure-jirka-spot
pool: lit-rtx-3090
variables:
DEVICES: $( python -c 'print("$(Agent.Name)".split("_")[-1])' )
container:
image: $(image)
# default shm size is 64m. Increase it to avoid:
# 'Error while creating shared memory: unhandled system error, NCCL version 2.7.8'
options: "--runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all --shm-size=512m"
# argument `--runtime=nvidia` was deprecated in favor of `--gpus=all`
# see: https://gitlab.com/gitlab-org/gitlab-runner/-/issues/4585
options: "--gpus=all --shm-size=2gb"
workspace:
clean: all

Expand All @@ -87,6 +91,10 @@ jobs:
pip list
displayName: 'Image info & NVIDIA'

- bash: |
echo "##vso[task.setvariable variable=CUDA_VISIBLE_DEVICES]$(DEVICES)"
displayName: 'set visible devices'

- bash: |
set -e
python -c "fname = 'requirements/pytorch/strategies.txt' ; lines = [line for line in open(fname).readlines() if 'horovod' not in line] ; open(fname, 'w').writelines(lines)"
Expand All @@ -112,16 +120,17 @@ jobs:

- bash: |
set -e
echo $CUDA_VISIBLE_DEVICES
python requirements/collect_env_details.py
python -c "import torch ; mgpu = torch.cuda.device_count() ; assert mgpu >= 2, f'GPU: {mgpu}'"
python -c "import torch ; mgpu = torch.cuda.device_count() ; assert mgpu == 2, f'GPU: {mgpu}'"
python requirements/pytorch/check-avail-strategies.py
python requirements/pytorch/check-avail-extras.py
displayName: 'Env details'

- bash: bash .actions/pull_legacy_checkpoints.sh
displayName: 'Get legacy checkpoints'

- bash: python -m coverage run --source pytorch_lightning -m pytest
- bash: python -m coverage run --source pytorch_lightning -m pytest .
workingDirectory: src/pytorch_lightning
displayName: 'Testing: PyTorch doctests'

Expand Down
17 changes: 13 additions & 4 deletions dockers/ci-runner-hpu/start.sh → .azure/start.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,15 @@

set -e

# export all args as env variables
for var in "$@"
do
echo "$var"
eval "export $var"
done

printenv

if [ -z "$AZP_URL" ]; then
echo 1>&2 "error: missing AZP_URL environment variable"
exit 1
Expand All @@ -26,9 +35,9 @@ if [ -n "$AZP_WORK" ]; then
mkdir -p "$AZP_WORK"
fi

rm -rf /azp/agent
mkdir /azp/agent
cd /azp/agent
rm -rf /azp/agent-$AZP_AGENT_NAME
mkdir /azp/agent-$AZP_AGENT_NAME
cd /azp/agent-$AZP_AGENT_NAME

export AGENT_ALLOW_RUNASROOT="1"

Expand Down Expand Up @@ -74,7 +83,7 @@ curl -LsS $AZP_AGENTPACKAGE_URL | tar -xz & wait $!

source ./env.sh

print_header "3. Configuring Azure Pipelines agent..."
print_header "3. Configuring Azure Pipelines agent $AZP_AGENT_NAME..."

./config.sh --unattended \
--agent "${AZP_AGENT_NAME:-$(hostname)}" \
Expand Down
2 changes: 1 addition & 1 deletion dockers/ci-runner-hpu/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ RUN pip uninstall pytorch-lightning -y

WORKDIR /azp

COPY ./dockers/ci-runner-hpu/start.sh /usr/local/bin/
COPY ./.azure/start.sh /usr/local/bin/
RUN chmod +x /usr/local/bin/start.sh

ENTRYPOINT ["/usr/local/bin/start.sh"]
2 changes: 1 addition & 1 deletion dockers/ci-runner-ipu/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ RUN echo "ALL ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers

WORKDIR /azp

COPY ./dockers/ci-runner-ipu/start.sh /usr/local/bin/
COPY ./.azure/start.sh /usr/local/bin/

RUN curl -o /usr/local/bin/installdependencies.sh \
"https://raw.githubusercontent.com/microsoft/azure-pipelines-agent/d2acd5f77c6b3914cdb6ed0e5fbea672929c7da9/src/Misc/layoutbin/installdependencies.sh" && \
Expand Down
96 changes: 0 additions & 96 deletions dockers/ci-runner-ipu/start.sh

This file was deleted.