Skip to content

microbench-ci: improve signal to noise ratio #142206

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 17, 2025

Conversation

herkolategan
Copy link
Collaborator

@herkolategan herkolategan commented Mar 3, 2025

This PR introduces several enhancements to the microbenchmarking process in CI. It modifies the microbenchmarks to require three consecutive runs to detect a regression, significantly reducing the chance of false positives. As a result, the total CI running time will dynamically adjust, ensuring that if a regression is detected CI will at most take approximately ±45 minutes to complete.

Additionally, it adds configurable compare alpha thresholds to reduce noise during benchmark comparisons. This allows for better tuning and more accurate results. The metrics builder has also been updated to accept options for configuring these thresholds, improving flexibility.

Lastly, the previous use of delta thresholds to filter out insignificant regressions has been removed. This change aims to lower the probability of false positives through alternative mechanisms.

Epic: None
Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@herkolategan herkolategan force-pushed the hbl/microbench-ci-snr branch 2 times, most recently from 8ade54f to b92bd20 Compare March 4, 2025 22:32
Copy link

blathers-crl bot commented Mar 5, 2025

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@herkolategan herkolategan force-pushed the hbl/microbench-ci-snr branch 2 times, most recently from c484222 to 2c6a873 Compare March 6, 2025 12:11
@herkolategan herkolategan changed the title microbench-ci: remove thresholds microbench-ci: improve signal to noise ratio Mar 6, 2025
@herkolategan herkolategan force-pushed the hbl/microbench-ci-snr branch from 2c6a873 to 81aad5f Compare March 10, 2025 11:43
@herkolategan herkolategan requested a review from Copilot March 10, 2025 11:51
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Overview

This PR enhances the microbenchmarking process in CI to reduce false positives and improve accuracy. Key changes include:

  • Incorporating a retry mechanism and adjusted iteration indexing to require three consecutive benchmark runs.
  • Introducing configurable compare alpha thresholds in the metrics builder and updating benchmark configuration accordingly.
  • Updating reporting (both JSON and GitHub markdown) and workflow configurations to align with the new benchmarking strategy.

Reviewed Changes

File Description
pkg/cmd/microbench-ci/run.go Refactored benchmark run loop with retries and updated iteration indexing for profiling.
pkg/cmd/roachprod-microbench/model/options.go Added builder options to support configurable statistical thresholds.
pkg/cmd/microbench-ci/compare.go Updated comparison logic to use log tail extraction and redefined status constants.
pkg/cmd/microbench-ci/report.go Modified JSON and markdown report generation, including float formatting refinements.
pkg/cmd/roachprod-microbench/model/builder.go Revised builder initialization to derive the confidence level from compare_alpha.
pkg/cmd/microbench-ci/template/github_summary.md Adjusted markdown table columns and legend to reflect updated status symbols.
.github/workflows/microbenchmarks-ci.yaml Modified trigger event and increased job timeout durations.
pkg/cmd/microbench-ci/config/pull-request-suite.yml Updated benchmark configuration (count, iterations, compare_alpha, retries, metrics).
pkg/cmd/microbench-ci/benchmark.go Altered benchmark struct to incorporate compare_alpha, retries, and metrics fields.

Copilot reviewed 12 out of 12 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (3)

pkg/cmd/microbench-ci/run.go:89

  • [nitpick] The calculation for the profile suffix uses (tries*b.Count)+i, which may be unclear at first glance. Consider renaming the computed value (e.g., to 'iterationIndex') or adding a comment to explain its purpose.
err := b.runIteration(revision, fmt.Sprintf("%d", (tries*b.Count)+i))

pkg/cmd/microbench-ci/report.go:246

  • [nitpick] The regular expression used to reformat floating-point numbers may inadvertently match unintended numeric strings in the JSON output. Ensure that it targets only the intended values to maintain accuracy.
func formatFloats(jsonData []byte, precision int) []byte {

pkg/cmd/roachprod-microbench/model/builder.go:77

  • [nitpick] Replacing a fixed confidence value with an expression derived from compare_alpha could be confusing. Consider adding a comment to document why '1.0 - b.thresholds.CompareAlpha' is used and what range of values is expected for compare_alpha.
summary := assumption.Summary(samples, 1.0-b.thresholds.CompareAlpha)

@herkolategan herkolategan force-pushed the hbl/microbench-ci-snr branch from 81aad5f to 00d8511 Compare March 10, 2025 15:42
@herkolategan herkolategan marked this pull request as ready for review March 10, 2025 16:42
@herkolategan herkolategan requested review from a team as code owners March 10, 2025 16:42
@herkolategan herkolategan requested review from srosenberg and DarrylWong and removed request for a team March 10, 2025 16:42
if compareStatus == NoChange {
break
}
// Track the most significant change we've seen (Regressed > Improved > NoChange)
Copy link
Contributor

@DarrylWong DarrylWong Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be the other way around? If only one of the three runs showed a regression while the other two were NoChange, we want it to be NoChange? i.e. if we marked all regressions as such with only one occurrence of a regression, why retry?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although that leaves an edge case we may or may not care about. If we see two regressions and one improved, it would be marked as improved even though it should be NoChange. Perhaps that's rare and seems not that consequential.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we be looking at the amplitude of a regression? If the regression is just due to noise, then retries don't really help.

Copy link
Collaborator Author

@herkolategan herkolategan Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we be looking at the amplitude of a regression? If the regression is just due to noise, then retries don't really help.

We rely one the p-value to decide if it's a regression (big or small), the confidence is what tells us if we should try and reproduce again. If we're able to reproduce this 3 times the chances of it being random chance is extremely low.

Shouldn't this be the other way around?

There is a case here I didn't think about which I need to address and that is if it flipped from Regression to Improved, which should also throw out the results.

The idea behind this is to only keep running if we keep seeing a regression, the moment we don't the results are deemed insignificant.

In other words the only scenario we deem regression worthy is 3 consecutive runs with regressions. Any other scenario fails early so that it doesn't waste any more CI time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if only one of the three runs showed a regression while the other two were NoChange, we want it to be NoChange? i.e. if we marked all regressions as such with only one occurrence of a regression, why retry?

Good catch, the comment there is silly (I'll update it), but yes there's a small bug here that should be addressed, if it changes from regressed <-> improved, then the results will be wrong; i.e., only 2 scenarios lead to a report (positive): [regressed x 3] or [improved x 3] - not a mix. And any NoChange should short circuit and mark the whole set as insignificant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words the only scenario we deem regression worthy is 3 consecutive runs with regressions. Any other scenario fails early so that it doesn't waste any more CI time.

That makes sense! Assuming the runs are totally independent, we can derive the p-value for each independent run and compute the resulting one, e.g., using Fisher's method [1]. But we don't have to get too fancy; intuitively, it makes sense to call it a potential regression.

[1] https://en.wikipedia.org/wiki/Fisher%27s_method

for _, revision := range []Revision{New, Old} {
err = os.WriteFile(path.Join(suite.artifactsDir(revision), ".FAILED"), nil, 0644)
marker := "CHANGED"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: status should never be neither Improved or Regressed at this point right? Maybe add a:

func (s Status) String() string method

and just set marker := status.String()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, a stringer interface makes more sense here.

@herkolategan herkolategan force-pushed the hbl/microbench-ci-snr branch from 00d8511 to 52046f9 Compare March 11, 2025 16:02
@srosenberg
Copy link
Member

Curious if anyone found Copilot's review summary useful?

Previously, thresholds were used to discount any regressions that were too small
to be significant. This accounted for the larger compare alpha passed to the
p-test, that resulted in more false positives which had to be subject to an
additional threshold filter.

After removing the threshold, adjustments will be made to lower the overall
probability of false positives through a different mechanism.

Epic: None
Release note: None
Previously, the thresholds were not configurable on the metrics builder. This
change adds passing options to the builder to configure the thresholds (Compare
Alpha) when comparing samples.

Epic: None
Release note: None
Previously, the default thresholds (compare alpha) were used during benchmark
sample comparison. However, this is too sensitive resulting in too many false
positives. In order to reduce noise on PRs the threshold should be configurable
and tuned to provide a better signal vs. noise.

This change adds an option to the suite configuration to adjust the compare
alpha for each benchmark.

Epic: None
Release note: None
Previously, only a single loop of the microbenchmarks were performed. This
relied on a single probability to detect a regression. Considering that each PR
will have several commits and benchmark metrics the probabilities add up quite
quickly resulting in false positives and wasted engineering efforts. This change
reduces the chance of false positives by requiring 3 consecutive runs to all
have regressed. The change will cause the total running time on CI to be dynamic
with each consecutive run having a lower probability if there is no regression.
Ultimately, if there is a regression CI will have the longest possible running
time of around ±45 minutes.

Epic: None
Release note: None
@herkolategan herkolategan force-pushed the hbl/microbench-ci-snr branch from 52046f9 to 03f2bcc Compare March 17, 2025 12:58
@herkolategan
Copy link
Collaborator Author

TFTR!s

bors r=Darrylwong,srosenberg

@herkolategan
Copy link
Collaborator Author

Curious if anyone found Copilot's review summary useful?

Nope not yet, I was just curious since I saw it on a different PR, and was wondering what it would come up with here.

@craig
Copy link
Contributor

craig bot commented Mar 17, 2025

@craig craig bot merged commit c3a4d77 into cockroachdb:master Mar 17, 2025
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants