Skip to content

Docker tests consistently flaky and failing #10060

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
daniellepintz opened this issue Oct 20, 2021 · 6 comments
Closed

Docker tests consistently flaky and failing #10060

daniellepintz opened this issue Oct 20, 2021 · 6 comments
Labels
ci Continuous Integration won't fix This will not be worked on

Comments

@daniellepintz
Copy link
Contributor

daniellepintz commented Oct 20, 2021

The Docker jobs on our CI are very flaky and fail often, making our CI red :/
Some recent examples:
https://github.com/PyTorchLightning/pytorch-lightning/runs/3956925878
https://github.com/PyTorchLightning/pytorch-lightning/runs/3956925126

I have spent a good amount of time trying to debug them (e.g. #9676), but it is difficult since the failures come and go.

As part of #9445, we want to work towards a state where all of our CI jobs are required, which means eventually for these Docker tests we should aim to mark them as required, or delete them.

Personally I believe we should delete them, since they seem to be extremely flaky, and use up a lot of our time trying to fix, which could be better spent elsewhere. However I don't know much about the history of these tests so I'd like to hear others' opinions on this matter. What do people think is the best course of action here?

cc @carmocca @akihironitta @Borda

@carmocca carmocca added the ci Continuous Integration label Oct 22, 2021
@carmocca
Copy link
Contributor

carmocca commented Oct 22, 2021

I'll summarize my knowledge about this as I've also banged my head with this a lot.

Our Conda and Azure jobs run from a dockerhub image:

https://github.com/PyTorchLightning/pytorch-lightning/blob/c9bc10ce8473a2249ffa4e00972c0c3c1d2641c4/.azure-pipelines/gpu-tests.yml#L32
https://github.com/PyTorchLightning/pytorch-lightning/blob/c9bc10ce8473a2249ffa4e00972c0c3c1d2641c4/.github/workflows/ci_test-conda.yml#L13

There is a job type (.github/workflows/events-nightly.yml) that tries to create the latest docker images and push them to dockerhub.
https://github.com/PyTorchLightning/pytorch-lightning/blob/c9bc10ce8473a2249ffa4e00972c0c3c1d2641c4/.github/workflows/events-nightly.yml#L7-L8
As you mentioned, they often fail due to multiple reasons and there's little visibility since they are nightly jobs that happen automatically.

Additionally, when you touch a file in the requirements or dockers directories, the PR will also try to create the docker images again.
https://github.com/PyTorchLightning/pytorch-lightning/blob/c9bc10ce8473a2249ffa4e00972c0c3c1d2641c4/.github/workflows/ci_dockers.yml#L10-L18
Note that even if these docker builds succeed, they do not get pushed to dockerhub
https://github.com/PyTorchLightning/pytorch-lightning/blob/c9bc10ce8473a2249ffa4e00972c0c3c1d2641c4/.github/workflows/ci_dockers.yml#L40
until the PR is merged to master and the nightly job runs.
https://github.com/PyTorchLightning/pytorch-lightning/blob/c9bc10ce8473a2249ffa4e00972c0c3c1d2641c4/.github/workflows/events-nightly.yml#L10-L11

Since the Conda and Azure jobs rely on these images being available in dockerhub, if your CI or test changes need the new image you just tried to generate (for example, if you added a new docker image for a new test job), you find yourself in a deadlock:

  • The PR needs dockerhub to be updated to succesfully run the tests.
  • The tests need to pass for the PR to be merged and published to dockerhub (on the night after the merge).

There's a trick to break this deadlock by manually forcing your PR to run the nightly routine, it's basically setting:

on:
  push: {}

in .github/workflows/events-nightly.yml and then removing the change when it has successfully finished.

This is dangerous because it can impact the current master build which will now install the new image that you forcefully published.

To recap, the main issues are:

  1. Little visibility for failures in the docker publishing job.
  2. Mechanism prone to deadlocks. The solution is dangerous and manual.
  3. Images are stale until EOD
  4. The number of docker images can be overwhelming.

If you have ideas on how to improve and simplify this, please share them!

cc @Borda

@daniellepintz
Copy link
Contributor Author

thanks for this very helpful summary @carmocca! Question - Why do we need the images to be available in dockerhub (besides the Conda and Azure jobs relying on it). Do we know how many people are using these docker images?

@carmocca
Copy link
Contributor

The list of images is here: https://hub.docker.com/r/pytorchlightning/pytorch_lightning/tags

I don't think we keep any sort of backward compatibility. We just mold it to what our CI needs.

Why do we need the images to be available in dockerhub

The only other reason I can think of is ease of debugging when something fails in CI but not locally.

Do we know how many people are using these docker images?

I don't.

@stale stale bot added the won't fix This will not be worked on label Nov 25, 2021
@Lightning-AI Lightning-AI deleted a comment from stale bot Nov 26, 2021
@stale stale bot removed the won't fix This will not be worked on label Nov 26, 2021
@stale
Copy link

stale bot commented Dec 27, 2021

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Dec 27, 2021
@daniellepintz daniellepintz removed the won't fix This will not be worked on label Dec 27, 2021
@stale
Copy link

stale bot commented Feb 18, 2022

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Feb 18, 2022
@daniellepintz
Copy link
Contributor Author

It seems this issue has been miraculously fixed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci Continuous Integration won't fix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants