[RFC] Make all CI jobs required for Merge #9445

daniellepintz · 2021-09-10T21:01:33Z

Motivation

Currently our CI is red on master, which has the following negative repercussions:

It is confusing to new contributors (like myself) and requires ramp up time to learn which failures are "normal" and which ones are not. This may make people less likely to submit PRs, since it requires extra energy to siphon through the CI failures.
It slightly downgrades the public perception/reputation of Lightning, when people visit the repo just to check it out and see that CI is red.
It leads to cascading failures. For example the conda (3.7, 1.10) CI job has been failing consistently on everyone's PRs for a while. A few days ago it only had 4 test failures, but now it has 30 - https://github.com/PyTorchLightning/pytorch-lightning/pull/9118/checks?check_run_id=3570542595. This is because whether 4 tests or 30 tests are failing for one CI "job" it appears the same on the checks on your PR, so people may merge a PR that causes test failures.

Proposal

Make all CI jobs required for merging a PR. If we have a job running on a PR that is "okay" if it is failing, why have it at all? I propose we make all CI jobs required, and delete ones that we don't want to make required. This will also incentivize/force everyone to keep the CI green.

@PyTorchLightning/core-contributors

The text was updated successfully, but these errors were encountered:

carmocca · 2021-09-11T13:59:31Z

I don't think marking all as required is feasible given the number of tests we run and the overall process. Some reasons:

Flaky tests can appear out of the blue (dependency update, bad network, environment change, bad caches, ...) which would block the full pipeline. They should be avoided at all costs but sometimes they just happen.
GitHub doesn't allow restarting individual jobs.
Network errors are bound to happen if running enough jobs. Keep in mind that some tests download datasets so it's not just the initial request to run the job.

To me, the key is in investing resources to improving everything around the CI. Some ideas:

Simplifiying the CI code
Reducing the duplication of jobs
Properly define the responsibilities and scope of each job run.
Better Docker image handling. Better image cache handling
Reducing test duplication, sorting the tests, relying more on unit testing.
Reducing and scoping the big integration tests (special tests, benchmarks, examples).
Use Dependabot to minimize failures due to dependency updates.
Enable alerts on failures for better visibility. Or a CI bot to comment on issues with summarization of the failed jobs.
Better tracking of CI times per step and per job to evaluate regressions in performance.
Document everything around the CI. The what, the why, and the how.
... many more potential improvements?

The community feels helpless when things fail because there is no resource for how our CI is structured on what are the goals of each job. They rely on us entirely.

But at this stage, setting all jobs as required would just make things worse.

If you or anybody from the community is motivated to improve the efficiency of all this, please reach out ❤️

daniellepintz · 2021-09-24T07:07:46Z

@carmocca Going back to this, I must say I still believe the best thing to do is go through all our CI jobs, and make each one either required, or delete it. Otherwise, I feel we will always be fighting a never-ending battle against the CI, because as long as people are able to merge PRs with failing jobs, they will (not out of malice, just by nature of the system). Currently there are no real incentives for people to fix failing CI jobs.

To address your concerns:

Flaky tests can appear out of the blue (dependency update, bad network, environment change, bad caches, ...) which would block the full pipeline. They should be avoided at all costs but sometimes they just happen.

True. But in this case here are the scenarios I see:

there is a quick fix (such as in the case of Update versions after recent PyTorch releases #9623), so no harm done, and the benefit is it actually gets fixed quickly rather than causing hours or days of broken CI before someone gets to it
there is no quick fix, but at least the issue is getting the attention it deserves and there are multiple motivated hands on deck trying to fix it. if we are unable to find a solution there are workarounds, like temporarily disabling the job, or overriding the job on a high-priority PR
if this happens often for a certain test, this might point to some fundamental flaws in the test's design and perhaps we should consider changing it or deleting it. after all, our tests should not be inherently flaky

GitHub doesn't allow restarting individual jobs.

True, but this is a pain point either way, and relatively minor.

Network errors are bound to happen if running enough jobs. Keep in mind that some tests download datasets so it's not just the initial request to run the job.

Then we should be running less jobs. If in the current state we already know some CI jobs are bound to fail, we are essentially resigning ourselves to a eternally-red CI, whether the jobs are required or not. We must fix this by deleting some jobs.

carmocca · 2021-09-27T13:20:54Z

You raise good points, Danielle. My proposal would be to keep things as they are right now, fix existing broken tests (you are on this already!), reduce flakiness, etc. Then, as our confidence in the CI increases, we can start marking things as required.

daniellepintz · 2021-09-27T18:37:22Z

Sounds great, I will continue working towards that!

carmocca · 2022-02-28T18:09:09Z

I think CI is in a much better state right now compared to back in september 🤞. Ideally we would mark everything as required but there's still sources of flakiness around.

Closing this for now.

daniellepintz · 2022-02-28T22:28:25Z

Agree we are in a better state! It is so nice to see the green checkmarks on PRs! Although there are still a few red X's we have to fix...

akihironitta added discussion In a discussion stage ci Continuous Integration labels Sep 13, 2021

ananthsub mentioned this issue Oct 6, 2021

Fix test_quantization with Pytorch 1.10 #9808

Merged

12 tasks

carmocca mentioned this issue Oct 20, 2021

Remove the DeepSource integration #10056

Merged

8 tasks

daniellepintz mentioned this issue Oct 20, 2021

Docker tests consistently flaky and failing #10060

Closed

stale bot added the won't fix This will not be worked on label Oct 28, 2021

Lightning-AI deleted a comment from stale bot Oct 28, 2021

stale bot removed the won't fix This will not be worked on label Oct 28, 2021

carmocca added this to the v1.6 milestone Oct 28, 2021

daniellepintz mentioned this issue Dec 21, 2021

CI job cpu (ubuntu-18.04, 3.9, oldest, stable) failing with Segmentation fault #11206

Closed

carmocca modified the milestones: 1.6, future Feb 1, 2022

carmocca closed this as completed Feb 28, 2022

carmocca removed this from the future milestone Mar 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Make all CI jobs required for Merge #9445

[RFC] Make all CI jobs required for Merge #9445

daniellepintz commented Sep 10, 2021 •

edited

Loading

carmocca commented Sep 11, 2021

Uh oh!

daniellepintz commented Sep 24, 2021

Uh oh!

carmocca commented Sep 27, 2021

Uh oh!

daniellepintz commented Sep 27, 2021

Uh oh!

carmocca commented Feb 28, 2022

Uh oh!

daniellepintz commented Feb 28, 2022

Uh oh!

[RFC] Make all CI jobs required for Merge #9445

[RFC] Make all CI jobs required for Merge #9445

Comments

daniellepintz commented Sep 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Proposal

carmocca commented Sep 11, 2021

Uh oh!

daniellepintz commented Sep 24, 2021

Uh oh!

carmocca commented Sep 27, 2021

Uh oh!

daniellepintz commented Sep 27, 2021

Uh oh!

carmocca commented Feb 28, 2022

Uh oh!

daniellepintz commented Feb 28, 2022

Uh oh!

daniellepintz commented Sep 10, 2021 •

edited

Loading