3 Docker CI jobs failing on master #9676

daniellepintz · 2021-09-24T01:24:59Z

🐛 Bug

These are the failing jobs:

build-XLA (3.7, nightly)
- link: https://github.com/PyTorchLightning/pytorch-lightning/runs/3693387242
- error: ValueError: Missing 1.11.0 in [{'torch': '1.10.0', 'torchvision': '0.11.*', 'torchtext': ''}, {'torch': '1.9.1', 'torchvision': '0.10.1', 'torchtext': '0.10.1'}, {'torch': '1.9.0', 'torchvision': '0.10.0', 'torchtext': '0.10.0'}, {'torch': '1.8.2', 'torchvision': '0.9.1', 'torchtext': '0.9.1'}, {'torch': '1.8.1', 'torchvision': '0.9.1', 'torchtext': '0.9.1'}, {'torch': '1.8.0', 'torchvision': '0.9.0', 'torchtext': '0.9.0'}, {'torch': '1.7.1', 'torchvision': '0.8.2', 'torchtext': '0.8.1'}, {'torch': '1.7.0', 'torchvision': '0.8.1', 'torchtext': '0.8.0'}, {'torch': '1.6.0', 'torchvision': '0.7.0', 'torchtext': '0.7'}]
build-CUDA (3.7, 1.6)
- link: https://github.com/PyTorchLightning/pytorch-lightning/runs/3693385716
- error: Downloading matplotlib-3.1.2.tar.gz Error: The action has timed out.
build-CUDA (3.9, 1.9)
- link: https://github.com/PyTorchLightning/pytorch-lightning/runs/3693385768
- error: Downloading matplotlib-3.1.2.tar.gz Error: The action has timed out.

To Reproduce

check latest CI jobs on master

Expected behavior

Green CI!

The text was updated successfully, but these errors were encountered:

daniellepintz · 2021-09-24T01:31:40Z

For build-XLA (3.7, nightly) it seems to be failing because the nightly version of Pytorch was updated to 1.11, and we don't have torch 1.11 in
https://github.com/PyTorchLightning/pytorch-lightning/blob/41e3be197f5a2fd0f65b37b743ebfd157a55595d/requirements/adjust_versions.py#L7-L17

The two options I see to solve this are we can update this list in adjust_versions, or we can remove the "nightly" version, similarly to what we are doing in #9673.

daniellepintz · 2021-09-24T07:39:41Z

For the build-CUDA Timeout errors, we are seeing this logged a lot before the Timeout INFO: pip is looking at multiple versions of jsonargparse[signatures] to determine which version is compatible with other requirements. This could take a while.

I tried using several different versions of pip but no luck :/
Now I am wondering if the issue could be duplicate dependencies in https://github.com/PyTorchLightning/pytorch-lightning/blob/master/environment.yml and in https://github.com/PyTorchLightning/pytorch-lightning/blob/master/requirements/extra.txt. Interestingly in most of the INFO messages "pip is looking at multiple versions of X", X comes from extra.txt. However this doesn't explain why this suddenly started happening today.

cc @Borda

daniellepintz · 2021-09-24T07:56:33Z

We may have to pin the dependencies to a specific version number as per https://stackoverflow.com/questions/65122957/resolving-new-pip-backtracking-runtime-issue

Still doesn't explain why we didn't have this problem before :/

Borda · 2021-09-24T08:19:26Z

The download timeouts are most likely temporary connection issues... Or how frequently is it happening over all PRs?

daniellepintz · 2021-09-24T08:29:50Z

It is happening pretty frequently in the past few PRs, but it doesn't seem completely deterministic. Maybe it is just temporary connection issues

Borda · 2021-09-24T08:49:32Z

The looking for a version is becase pip become more strict on dependency intersection, so it is possible that argparaet ban some other package versions which are required by another packages...
You can try to comment some packages to find which two are not feasible (another sanity check is that this shall happen only for latest configuration not for minimal, right?

On the requirements side, conda and pip list is installed separately so there shall not be any conflicts, in fact the conda may be overwritten by pip as it runs after it

daniellepintz · 2021-09-24T19:06:03Z

This is interesting, most recent commits on master have 7-8 failing jobs with this Timeout, but a few of them are just fine, such as this one: https://github.com/PyTorchLightning/pytorch-lightning/runs/3702218949
This successful job still has some errors in the logs though, which are different than those in the Timeout:
https://gist.github.com/daniellepintz/4c5c2aa0827a1e39fcfe94a3dd5d3e39

stale · 2021-10-25T06:23:56Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

carmocca · 2021-10-26T15:41:25Z

I believe all have been fixed with #10087 and #10088 (see PR checks on the latter)

To be more concrete, the consistent failures have been fixed but the jobs are still flaky due to caches/timeouts/connectivity. But this is already being discussed in #10060

daniellepintz added the bug Something isn't working label Sep 24, 2021

daniellepintz mentioned this issue Sep 24, 2021

Add torch v1.11.0 to the list of versions in adjust_versions.py #9679

Merged

12 tasks

daniellepintz closed this as completed in #9679 Sep 24, 2021

daniellepintz reopened this Sep 24, 2021

awaelchli added the ci Continuous Integration label Sep 24, 2021

daniellepintz mentioned this issue Oct 20, 2021

Docker tests consistently flaky and failing #10060

Closed

stale bot added the won't fix This will not be worked on label Oct 25, 2021

carmocca closed this as completed Oct 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

3 Docker CI jobs failing on master #9676

3 Docker CI jobs failing on master #9676

daniellepintz commented Sep 24, 2021 •

edited

Loading

daniellepintz commented Sep 24, 2021 •

edited

Loading

Uh oh!

daniellepintz commented Sep 24, 2021 •

edited

Loading

Uh oh!

daniellepintz commented Sep 24, 2021

Uh oh!

Borda commented Sep 24, 2021

Uh oh!

daniellepintz commented Sep 24, 2021

Uh oh!

Borda commented Sep 24, 2021

Uh oh!

daniellepintz commented Sep 24, 2021

Uh oh!

stale bot commented Oct 25, 2021

Uh oh!

carmocca commented Oct 26, 2021 •

edited

Loading

Uh oh!

3 Docker CI jobs failing on master #9676

3 Docker CI jobs failing on master #9676

Comments

daniellepintz commented Sep 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🐛 Bug

To Reproduce

Expected behavior

daniellepintz commented Sep 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daniellepintz commented Sep 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daniellepintz commented Sep 24, 2021

Uh oh!

Borda commented Sep 24, 2021

Uh oh!

daniellepintz commented Sep 24, 2021

Uh oh!

Borda commented Sep 24, 2021

Uh oh!

daniellepintz commented Sep 24, 2021

Uh oh!

stale bot commented Oct 25, 2021

Uh oh!

carmocca commented Oct 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daniellepintz commented Sep 24, 2021 •

edited

Loading

daniellepintz commented Sep 24, 2021 •

edited

Loading

daniellepintz commented Sep 24, 2021 •

edited

Loading

carmocca commented Oct 26, 2021 •

edited

Loading