Skip to content

[RFC] Make all CI jobs required for Merge #9445

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
daniellepintz opened this issue Sep 10, 2021 · 6 comments
Closed

[RFC] Make all CI jobs required for Merge #9445

daniellepintz opened this issue Sep 10, 2021 · 6 comments
Labels
ci Continuous Integration discussion In a discussion stage

Comments

@daniellepintz
Copy link
Contributor

daniellepintz commented Sep 10, 2021

Motivation

Currently our CI is red on master, which has the following negative repercussions:

  1. It is confusing to new contributors (like myself) and requires ramp up time to learn which failures are "normal" and which ones are not. This may make people less likely to submit PRs, since it requires extra energy to siphon through the CI failures.
  2. It slightly downgrades the public perception/reputation of Lightning, when people visit the repo just to check it out and see that CI is red.
  3. It leads to cascading failures. For example the conda (3.7, 1.10) CI job has been failing consistently on everyone's PRs for a while. A few days ago it only had 4 test failures, but now it has 30 - https://github.com/PyTorchLightning/pytorch-lightning/pull/9118/checks?check_run_id=3570542595. This is because whether 4 tests or 30 tests are failing for one CI "job" it appears the same on the checks on your PR, so people may merge a PR that causes test failures.

Proposal

Make all CI jobs required for merging a PR. If we have a job running on a PR that is "okay" if it is failing, why have it at all? I propose we make all CI jobs required, and delete ones that we don't want to make required. This will also incentivize/force everyone to keep the CI green.

@PyTorchLightning/core-contributors

@carmocca
Copy link
Contributor

I don't think marking all as required is feasible given the number of tests we run and the overall process. Some reasons:

  • Flaky tests can appear out of the blue (dependency update, bad network, environment change, bad caches, ...) which would block the full pipeline. They should be avoided at all costs but sometimes they just happen.
  • GitHub doesn't allow restarting individual jobs.
  • Network errors are bound to happen if running enough jobs. Keep in mind that some tests download datasets so it's not just the initial request to run the job.

To me, the key is in investing resources to improving everything around the CI. Some ideas:

  • Simplifiying the CI code
  • Reducing the duplication of jobs
  • Properly define the responsibilities and scope of each job run.
  • Better Docker image handling. Better image cache handling
  • Reducing test duplication, sorting the tests, relying more on unit testing.
  • Reducing and scoping the big integration tests (special tests, benchmarks, examples).
  • Use Dependabot to minimize failures due to dependency updates.
  • Enable alerts on failures for better visibility. Or a CI bot to comment on issues with summarization of the failed jobs.
  • Better tracking of CI times per step and per job to evaluate regressions in performance.
  • Document everything around the CI. The what, the why, and the how.
  • ... many more potential improvements?

The community feels helpless when things fail because there is no resource for how our CI is structured on what are the goals of each job. They rely on us entirely.

But at this stage, setting all jobs as required would just make things worse.

If you or anybody from the community is motivated to improve the efficiency of all this, please reach out ❤️

@akihironitta akihironitta added discussion In a discussion stage ci Continuous Integration labels Sep 13, 2021
@daniellepintz
Copy link
Contributor Author

@carmocca Going back to this, I must say I still believe the best thing to do is go through all our CI jobs, and make each one either required, or delete it. Otherwise, I feel we will always be fighting a never-ending battle against the CI, because as long as people are able to merge PRs with failing jobs, they will (not out of malice, just by nature of the system). Currently there are no real incentives for people to fix failing CI jobs.

To address your concerns:

Flaky tests can appear out of the blue (dependency update, bad network, environment change, bad caches, ...) which would block the full pipeline. They should be avoided at all costs but sometimes they just happen.

True. But in this case here are the scenarios I see:

  1. there is a quick fix (such as in the case of Update versions after recent PyTorch releases #9623), so no harm done, and the benefit is it actually gets fixed quickly rather than causing hours or days of broken CI before someone gets to it
  2. there is no quick fix, but at least the issue is getting the attention it deserves and there are multiple motivated hands on deck trying to fix it. if we are unable to find a solution there are workarounds, like temporarily disabling the job, or overriding the job on a high-priority PR
  3. if this happens often for a certain test, this might point to some fundamental flaws in the test's design and perhaps we should consider changing it or deleting it. after all, our tests should not be inherently flaky

GitHub doesn't allow restarting individual jobs.

True, but this is a pain point either way, and relatively minor.

Network errors are bound to happen if running enough jobs. Keep in mind that some tests download datasets so it's not just the initial request to run the job.

Then we should be running less jobs. If in the current state we already know some CI jobs are bound to fail, we are essentially resigning ourselves to a eternally-red CI, whether the jobs are required or not. We must fix this by deleting some jobs.

@carmocca
Copy link
Contributor

You raise good points, Danielle. My proposal would be to keep things as they are right now, fix existing broken tests (you are on this already!), reduce flakiness, etc. Then, as our confidence in the CI increases, we can start marking things as required.

@daniellepintz
Copy link
Contributor Author

Sounds great, I will continue working towards that!

@stale stale bot added the won't fix This will not be worked on label Oct 28, 2021
@Lightning-AI Lightning-AI deleted a comment from stale bot Oct 28, 2021
@stale stale bot removed the won't fix This will not be worked on label Oct 28, 2021
@carmocca carmocca added this to the v1.6 milestone Oct 28, 2021
@carmocca carmocca modified the milestones: 1.6, future Feb 1, 2022
@carmocca
Copy link
Contributor

I think CI is in a much better state right now compared to back in september 🤞. Ideally we would mark everything as required but there's still sources of flakiness around.

Closing this for now.

@daniellepintz
Copy link
Contributor Author

Agree we are in a better state! It is so nice to see the green checkmarks on PRs! Although there are still a few red X's we have to fix...

@carmocca carmocca removed this from the future milestone Mar 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci Continuous Integration discussion In a discussion stage
Projects
None yet
Development

No branches or pull requests

3 participants