Docker tests consistently flaky and failing #10060

daniellepintz · 2021-10-20T23:28:15Z

The Docker jobs on our CI are very flaky and fail often, making our CI red :/
Some recent examples:
https://github.com/PyTorchLightning/pytorch-lightning/runs/3956925878
https://github.com/PyTorchLightning/pytorch-lightning/runs/3956925126

I have spent a good amount of time trying to debug them (e.g. #9676), but it is difficult since the failures come and go.

As part of #9445, we want to work towards a state where all of our CI jobs are required, which means eventually for these Docker tests we should aim to mark them as required, or delete them.

Personally I believe we should delete them, since they seem to be extremely flaky, and use up a lot of our time trying to fix, which could be better spent elsewhere. However I don't know much about the history of these tests so I'd like to hear others' opinions on this matter. What do people think is the best course of action here?

cc @carmocca @akihironitta @Borda

carmocca · 2021-10-22T18:49:59Z

I'll summarize my knowledge about this as I've also banged my head with this a lot.

Our Conda and Azure jobs run from a dockerhub image:

https://github.com/PyTorchLightning/pytorch-lightning/blob/c9bc10ce8473a2249ffa4e00972c0c3c1d2641c4/.azure-pipelines/gpu-tests.yml#L32
https://github.com/PyTorchLightning/pytorch-lightning/blob/c9bc10ce8473a2249ffa4e00972c0c3c1d2641c4/.github/workflows/ci_test-conda.yml#L13

There is a job type (.github/workflows/events-nightly.yml) that tries to create the latest docker images and push them to dockerhub.
https://github.com/PyTorchLightning/pytorch-lightning/blob/c9bc10ce8473a2249ffa4e00972c0c3c1d2641c4/.github/workflows/events-nightly.yml#L7-L8
As you mentioned, they often fail due to multiple reasons and there's little visibility since they are nightly jobs that happen automatically.

Additionally, when you touch a file in the requirements or dockers directories, the PR will also try to create the docker images again.
https://github.com/PyTorchLightning/pytorch-lightning/blob/c9bc10ce8473a2249ffa4e00972c0c3c1d2641c4/.github/workflows/ci_dockers.yml#L10-L18
Note that even if these docker builds succeed, they do not get pushed to dockerhub
https://github.com/PyTorchLightning/pytorch-lightning/blob/c9bc10ce8473a2249ffa4e00972c0c3c1d2641c4/.github/workflows/ci_dockers.yml#L40
until the PR is merged to master and the nightly job runs.
https://github.com/PyTorchLightning/pytorch-lightning/blob/c9bc10ce8473a2249ffa4e00972c0c3c1d2641c4/.github/workflows/events-nightly.yml#L10-L11

Since the Conda and Azure jobs rely on these images being available in dockerhub, if your CI or test changes need the new image you just tried to generate (for example, if you added a new docker image for a new test job), you find yourself in a deadlock:

The PR needs dockerhub to be updated to succesfully run the tests.
The tests need to pass for the PR to be merged and published to dockerhub (on the night after the merge).

There's a trick to break this deadlock by manually forcing your PR to run the nightly routine, it's basically setting:

on:
  push: {}

in .github/workflows/events-nightly.yml and then removing the change when it has successfully finished.

This is dangerous because it can impact the current master build which will now install the new image that you forcefully published.

To recap, the main issues are:

Little visibility for failures in the docker publishing job.
Mechanism prone to deadlocks. The solution is dangerous and manual.
Images are stale until EOD
The number of docker images can be overwhelming.

If you have ideas on how to improve and simplify this, please share them!

cc @Borda

daniellepintz · 2021-10-22T22:44:22Z

thanks for this very helpful summary @carmocca! Question - Why do we need the images to be available in dockerhub (besides the Conda and Azure jobs relying on it). Do we know how many people are using these docker images?

carmocca · 2021-10-25T11:12:34Z

The list of images is here: https://hub.docker.com/r/pytorchlightning/pytorch_lightning/tags

I don't think we keep any sort of backward compatibility. We just mold it to what our CI needs.

Why do we need the images to be available in dockerhub

The only other reason I can think of is ease of debugging when something fails in CI but not locally.

Do we know how many people are using these docker images?

I don't.

stale · 2021-12-27T09:25:17Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

stale · 2022-02-18T09:52:36Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

daniellepintz · 2022-02-20T04:10:11Z

It seems this issue has been miraculously fixed!

daniellepintz added the refactor label Oct 20, 2021

carmocca added the ci Continuous Integration label Oct 22, 2021

daniellepintz removed the refactor label Oct 22, 2021

This was referenced Oct 26, 2021

3 Docker CI jobs failing on master #9676

Closed

Pin fairscale version #10200

Merged

Minimize the number of docker jobs #10202

Merged

stale bot added the won't fix This will not be worked on label Nov 25, 2021

Lightning-AI deleted a comment from stale bot Nov 26, 2021

stale bot removed the won't fix This will not be worked on label Nov 26, 2021

stale bot added the won't fix This will not be worked on label Dec 27, 2021

daniellepintz removed the won't fix This will not be worked on label Dec 27, 2021

stale bot added the won't fix This will not be worked on label Feb 18, 2022

daniellepintz closed this as completed Feb 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Docker tests consistently flaky and failing #10060

Docker tests consistently flaky and failing #10060

daniellepintz commented Oct 20, 2021 •

edited by github-actions bot

Loading

carmocca commented Oct 22, 2021 •

edited

Loading

Uh oh!

daniellepintz commented Oct 22, 2021

Uh oh!

carmocca commented Oct 25, 2021

Uh oh!

stale bot commented Dec 27, 2021

Uh oh!

stale bot commented Feb 18, 2022

Uh oh!

daniellepintz commented Feb 20, 2022

Uh oh!

Docker tests consistently flaky and failing #10060

Docker tests consistently flaky and failing #10060

Comments

daniellepintz commented Oct 20, 2021 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

carmocca commented Oct 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daniellepintz commented Oct 22, 2021

Uh oh!

carmocca commented Oct 25, 2021

Uh oh!

stale bot commented Dec 27, 2021

Uh oh!

stale bot commented Feb 18, 2022

Uh oh!

daniellepintz commented Feb 20, 2022

Uh oh!

daniellepintz commented Oct 20, 2021 •

edited by github-actions bot

Loading

carmocca commented Oct 22, 2021 •

edited

Loading