-
Notifications
You must be signed in to change notification settings - Fork 3.9k
microbench-ci: improve signal to noise ratio #142206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
8ade54f
to
b92bd20
Compare
Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
c484222
to
2c6a873
Compare
2c6a873
to
81aad5f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Overview
This PR enhances the microbenchmarking process in CI to reduce false positives and improve accuracy. Key changes include:
- Incorporating a retry mechanism and adjusted iteration indexing to require three consecutive benchmark runs.
- Introducing configurable compare alpha thresholds in the metrics builder and updating benchmark configuration accordingly.
- Updating reporting (both JSON and GitHub markdown) and workflow configurations to align with the new benchmarking strategy.
Reviewed Changes
File | Description |
---|---|
pkg/cmd/microbench-ci/run.go | Refactored benchmark run loop with retries and updated iteration indexing for profiling. |
pkg/cmd/roachprod-microbench/model/options.go | Added builder options to support configurable statistical thresholds. |
pkg/cmd/microbench-ci/compare.go | Updated comparison logic to use log tail extraction and redefined status constants. |
pkg/cmd/microbench-ci/report.go | Modified JSON and markdown report generation, including float formatting refinements. |
pkg/cmd/roachprod-microbench/model/builder.go | Revised builder initialization to derive the confidence level from compare_alpha. |
pkg/cmd/microbench-ci/template/github_summary.md | Adjusted markdown table columns and legend to reflect updated status symbols. |
.github/workflows/microbenchmarks-ci.yaml | Modified trigger event and increased job timeout durations. |
pkg/cmd/microbench-ci/config/pull-request-suite.yml | Updated benchmark configuration (count, iterations, compare_alpha, retries, metrics). |
pkg/cmd/microbench-ci/benchmark.go | Altered benchmark struct to incorporate compare_alpha, retries, and metrics fields. |
Copilot reviewed 12 out of 12 changed files in this pull request and generated no comments.
Comments suppressed due to low confidence (3)
pkg/cmd/microbench-ci/run.go:89
- [nitpick] The calculation for the profile suffix uses (tries*b.Count)+i, which may be unclear at first glance. Consider renaming the computed value (e.g., to 'iterationIndex') or adding a comment to explain its purpose.
err := b.runIteration(revision, fmt.Sprintf("%d", (tries*b.Count)+i))
pkg/cmd/microbench-ci/report.go:246
- [nitpick] The regular expression used to reformat floating-point numbers may inadvertently match unintended numeric strings in the JSON output. Ensure that it targets only the intended values to maintain accuracy.
func formatFloats(jsonData []byte, precision int) []byte {
pkg/cmd/roachprod-microbench/model/builder.go:77
- [nitpick] Replacing a fixed confidence value with an expression derived from compare_alpha could be confusing. Consider adding a comment to document why '1.0 - b.thresholds.CompareAlpha' is used and what range of values is expected for compare_alpha.
summary := assumption.Summary(samples, 1.0-b.thresholds.CompareAlpha)
81aad5f
to
00d8511
Compare
pkg/cmd/microbench-ci/run.go
Outdated
if compareStatus == NoChange { | ||
break | ||
} | ||
// Track the most significant change we've seen (Regressed > Improved > NoChange) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be the other way around? If only one of the three runs showed a regression while the other two were NoChange
, we want it to be NoChange
? i.e. if we marked all regressions as such with only one occurrence of a regression, why retry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although that leaves an edge case we may or may not care about. If we see two regressions and one improved, it would be marked as improved even though it should be NoChange
. Perhaps that's rare and seems not that consequential.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we be looking at the amplitude of a regression? If the regression is just due to noise, then retries don't really help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we be looking at the amplitude of a regression? If the regression is just due to noise, then retries don't really help.
We rely one the p-value to decide if it's a regression (big or small), the confidence is what tells us if we should try and reproduce again. If we're able to reproduce this 3 times the chances of it being random chance is extremely low.
Shouldn't this be the other way around?
There is a case here I didn't think about which I need to address and that is if it flipped from Regression to Improved, which should also throw out the results.
The idea behind this is to only keep running if we keep seeing a regression, the moment we don't the results are deemed insignificant.
In other words the only scenario we deem regression worthy is 3 consecutive runs with regressions. Any other scenario fails early so that it doesn't waste any more CI time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if only one of the three runs showed a regression while the other two were NoChange, we want it to be NoChange? i.e. if we marked all regressions as such with only one occurrence of a regression, why retry?
Good catch, the comment there is silly (I'll update it), but yes there's a small bug here that should be addressed, if it changes from regressed <-> improved, then the results will be wrong; i.e., only 2 scenarios lead to a report (positive): [regressed x 3] or [improved x 3] - not a mix. And any NoChange
should short circuit and mark the whole set as insignificant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other words the only scenario we deem regression worthy is 3 consecutive runs with regressions. Any other scenario fails early so that it doesn't waste any more CI time.
That makes sense! Assuming the runs are totally independent, we can derive the p-value for each independent run and compute the resulting one, e.g., using Fisher's method [1]. But we don't have to get too fancy; intuitively, it makes sense to call it a potential regression.
pkg/cmd/microbench-ci/run.go
Outdated
for _, revision := range []Revision{New, Old} { | ||
err = os.WriteFile(path.Join(suite.artifactsDir(revision), ".FAILED"), nil, 0644) | ||
marker := "CHANGED" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: status
should never be neither Improved
or Regressed
at this point right? Maybe add a:
func (s Status) String() string
method
and just set marker := status.String()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, a stringer interface makes more sense here.
00d8511
to
52046f9
Compare
Curious if anyone found Copilot's review summary useful? |
Previously, thresholds were used to discount any regressions that were too small to be significant. This accounted for the larger compare alpha passed to the p-test, that resulted in more false positives which had to be subject to an additional threshold filter. After removing the threshold, adjustments will be made to lower the overall probability of false positives through a different mechanism. Epic: None Release note: None
Previously, the thresholds were not configurable on the metrics builder. This change adds passing options to the builder to configure the thresholds (Compare Alpha) when comparing samples. Epic: None Release note: None
Previously, the default thresholds (compare alpha) were used during benchmark sample comparison. However, this is too sensitive resulting in too many false positives. In order to reduce noise on PRs the threshold should be configurable and tuned to provide a better signal vs. noise. This change adds an option to the suite configuration to adjust the compare alpha for each benchmark. Epic: None Release note: None
Previously, only a single loop of the microbenchmarks were performed. This relied on a single probability to detect a regression. Considering that each PR will have several commits and benchmark metrics the probabilities add up quite quickly resulting in false positives and wasted engineering efforts. This change reduces the chance of false positives by requiring 3 consecutive runs to all have regressed. The change will cause the total running time on CI to be dynamic with each consecutive run having a lower probability if there is no regression. Ultimately, if there is a regression CI will have the longest possible running time of around ±45 minutes. Epic: None Release note: None
52046f9
to
03f2bcc
Compare
TFTR!s bors r=Darrylwong,srosenberg |
Nope not yet, I was just curious since I saw it on a different PR, and was wondering what it would come up with here. |
This PR introduces several enhancements to the microbenchmarking process in CI. It modifies the microbenchmarks to require three consecutive runs to detect a regression, significantly reducing the chance of false positives. As a result, the total CI running time will dynamically adjust, ensuring that if a regression is detected CI will at most take approximately ±45 minutes to complete.
Additionally, it adds configurable compare alpha thresholds to reduce noise during benchmark comparisons. This allows for better tuning and more accurate results. The metrics builder has also been updated to accept options for configuring these thresholds, improving flexibility.
Lastly, the previous use of delta thresholds to filter out insignificant regressions has been removed. This change aims to lower the probability of false positives through alternative mechanisms.
Epic: None
Release note: None