-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Docker tests consistently flaky and failing #10060
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'll summarize my knowledge about this as I've also banged my head with this a lot. Our Conda and Azure jobs run from a dockerhub image: https://github.com/PyTorchLightning/pytorch-lightning/blob/c9bc10ce8473a2249ffa4e00972c0c3c1d2641c4/.azure-pipelines/gpu-tests.yml#L32 There is a job type ( Additionally, when you touch a file in the requirements or dockers directories, the PR will also try to create the docker images again. Since the Conda and Azure jobs rely on these images being available in dockerhub, if your CI or test changes need the new image you just tried to generate (for example, if you added a new docker image for a new test job), you find yourself in a deadlock:
There's a trick to break this deadlock by manually forcing your PR to run the nightly routine, it's basically setting: on:
push: {} in This is dangerous because it can impact the current master build which will now install the new image that you forcefully published. To recap, the main issues are:
If you have ideas on how to improve and simplify this, please share them! cc @Borda |
thanks for this very helpful summary @carmocca! Question - Why do we need the images to be available in dockerhub (besides the Conda and Azure jobs relying on it). Do we know how many people are using these docker images? |
The list of images is here: https://hub.docker.com/r/pytorchlightning/pytorch_lightning/tags I don't think we keep any sort of backward compatibility. We just mold it to what our CI needs.
The only other reason I can think of is ease of debugging when something fails in CI but not locally.
I don't. |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
It seems this issue has been miraculously fixed! |
Uh oh!
There was an error while loading. Please reload this page.
The Docker jobs on our CI are very flaky and fail often, making our CI red :/
Some recent examples:
https://github.com/PyTorchLightning/pytorch-lightning/runs/3956925878
https://github.com/PyTorchLightning/pytorch-lightning/runs/3956925126
I have spent a good amount of time trying to debug them (e.g. #9676), but it is difficult since the failures come and go.
As part of #9445, we want to work towards a state where all of our CI jobs are required, which means eventually for these Docker tests we should aim to mark them as required, or delete them.
Personally I believe we should delete them, since they seem to be extremely flaky, and use up a lot of our time trying to fix, which could be better spent elsewhere. However I don't know much about the history of these tests so I'd like to hear others' opinions on this matter. What do people think is the best course of action here?
cc @carmocca @akihironitta @Borda
The text was updated successfully, but these errors were encountered: