microbench-ci: improve signal to noise ratio #142206

herkolategan · 2025-03-03T16:05:42Z

This PR introduces several enhancements to the microbenchmarking process in CI. It modifies the microbenchmarks to require three consecutive runs to detect a regression, significantly reducing the chance of false positives. As a result, the total CI running time will dynamically adjust, ensuring that if a regression is detected CI will at most take approximately ±45 minutes to complete.

Additionally, it adds configurable compare alpha thresholds to reduce noise during benchmark comparisons. This allows for better tuning and more accurate results. The metrics builder has also been updated to accept options for configuring these thresholds, improving flexibility.

Lastly, the previous use of delta thresholds to filter out insignificant regressions has been removed. This change aims to lower the probability of false positives through alternative mechanisms.

Epic: None
Release note: None

cockroach-teamcity · 2025-03-03T16:05:57Z

This change is

blathers-crl · 2025-03-05T10:32:33Z

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

Copilot

PR Overview

This PR enhances the microbenchmarking process in CI to reduce false positives and improve accuracy. Key changes include:

Incorporating a retry mechanism and adjusted iteration indexing to require three consecutive benchmark runs.
Introducing configurable compare alpha thresholds in the metrics builder and updating benchmark configuration accordingly.
Updating reporting (both JSON and GitHub markdown) and workflow configurations to align with the new benchmarking strategy.

Reviewed Changes

File	Description
pkg/cmd/microbench-ci/run.go	Refactored benchmark run loop with retries and updated iteration indexing for profiling.
pkg/cmd/roachprod-microbench/model/options.go	Added builder options to support configurable statistical thresholds.
pkg/cmd/microbench-ci/compare.go	Updated comparison logic to use log tail extraction and redefined status constants.
pkg/cmd/microbench-ci/report.go	Modified JSON and markdown report generation, including float formatting refinements.
pkg/cmd/roachprod-microbench/model/builder.go	Revised builder initialization to derive the confidence level from compare_alpha.
pkg/cmd/microbench-ci/template/github_summary.md	Adjusted markdown table columns and legend to reflect updated status symbols.
.github/workflows/microbenchmarks-ci.yaml	Modified trigger event and increased job timeout durations.
pkg/cmd/microbench-ci/config/pull-request-suite.yml	Updated benchmark configuration (count, iterations, compare_alpha, retries, metrics).
pkg/cmd/microbench-ci/benchmark.go	Altered benchmark struct to incorporate compare_alpha, retries, and metrics fields.

Copilot reviewed 12 out of 12 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (3)

pkg/cmd/microbench-ci/run.go:89

[nitpick] The calculation for the profile suffix uses (tries*b.Count)+i, which may be unclear at first glance. Consider renaming the computed value (e.g., to 'iterationIndex') or adding a comment to explain its purpose.

err := b.runIteration(revision, fmt.Sprintf("%d", (tries*b.Count)+i))

pkg/cmd/microbench-ci/report.go:246

[nitpick] The regular expression used to reformat floating-point numbers may inadvertently match unintended numeric strings in the JSON output. Ensure that it targets only the intended values to maintain accuracy.

func formatFloats(jsonData []byte, precision int) []byte {

pkg/cmd/roachprod-microbench/model/builder.go:77

[nitpick] Replacing a fixed confidence value with an expression derived from compare_alpha could be confusing. Consider adding a comment to document why '1.0 - b.thresholds.CompareAlpha' is used and what range of values is expected for compare_alpha.

summary := assumption.Summary(samples, 1.0-b.thresholds.CompareAlpha)

DarrylWong · 2025-03-10T17:22:42Z

pkg/cmd/microbench-ci/run.go

+		if compareStatus == NoChange {
+			break
+		}
+		// Track the most significant change we've seen (Regressed > Improved > NoChange)


Shouldn't this be the other way around? If only one of the three runs showed a regression while the other two were NoChange, we want it to be NoChange? i.e. if we marked all regressions as such with only one occurrence of a regression, why retry?

Although that leaves an edge case we may or may not care about. If we see two regressions and one improved, it would be marked as improved even though it should be NoChange. Perhaps that's rare and seems not that consequential.

Shouldn't we be looking at the amplitude of a regression? If the regression is just due to noise, then retries don't really help.

Shouldn't we be looking at the amplitude of a regression? If the regression is just due to noise, then retries don't really help.

We rely one the p-value to decide if it's a regression (big or small), the confidence is what tells us if we should try and reproduce again. If we're able to reproduce this 3 times the chances of it being random chance is extremely low.

Shouldn't this be the other way around?

There is a case here I didn't think about which I need to address and that is if it flipped from Regression to Improved, which should also throw out the results.

The idea behind this is to only keep running if we keep seeing a regression, the moment we don't the results are deemed insignificant.

In other words the only scenario we deem regression worthy is 3 consecutive runs with regressions. Any other scenario fails early so that it doesn't waste any more CI time.

if only one of the three runs showed a regression while the other two were NoChange, we want it to be NoChange? i.e. if we marked all regressions as such with only one occurrence of a regression, why retry?

Good catch, the comment there is silly (I'll update it), but yes there's a small bug here that should be addressed, if it changes from regressed <-> improved, then the results will be wrong; i.e., only 2 scenarios lead to a report (positive): [regressed x 3] or [improved x 3] - not a mix. And any NoChange should short circuit and mark the whole set as insignificant.

In other words the only scenario we deem regression worthy is 3 consecutive runs with regressions. Any other scenario fails early so that it doesn't waste any more CI time.

That makes sense! Assuming the runs are totally independent, we can derive the p-value for each independent run and compute the resulting one, e.g., using Fisher's method [1]. But we don't have to get too fancy; intuitively, it makes sense to call it a potential regression.

[1] https://en.wikipedia.org/wiki/Fisher%27s_method

DarrylWong · 2025-03-10T20:43:54Z

pkg/cmd/microbench-ci/run.go

 		for _, revision := range []Revision{New, Old} {
-			err = os.WriteFile(path.Join(suite.artifactsDir(revision), ".FAILED"), nil, 0644)
+			marker := "CHANGED"


Nit: status should never be neither Improved or Regressed at this point right? Maybe add a:

func (s Status) String() string method

and just set marker := status.String()?

Good point, a stringer interface makes more sense here.

srosenberg · 2025-03-14T22:28:05Z

Curious if anyone found Copilot's review summary useful?

Previously, thresholds were used to discount any regressions that were too small to be significant. This accounted for the larger compare alpha passed to the p-test, that resulted in more false positives which had to be subject to an additional threshold filter. After removing the threshold, adjustments will be made to lower the overall probability of false positives through a different mechanism. Epic: None Release note: None

Previously, the thresholds were not configurable on the metrics builder. This change adds passing options to the builder to configure the thresholds (Compare Alpha) when comparing samples. Epic: None Release note: None

Previously, the default thresholds (compare alpha) were used during benchmark sample comparison. However, this is too sensitive resulting in too many false positives. In order to reduce noise on PRs the threshold should be configurable and tuned to provide a better signal vs. noise. This change adds an option to the suite configuration to adjust the compare alpha for each benchmark. Epic: None Release note: None

Previously, only a single loop of the microbenchmarks were performed. This relied on a single probability to detect a regression. Considering that each PR will have several commits and benchmark metrics the probabilities add up quite quickly resulting in false positives and wasted engineering efforts. This change reduces the chance of false positives by requiring 3 consecutive runs to all have regressed. The change will cause the total running time on CI to be dynamic with each consecutive run having a lower probability if there is no regression. Ultimately, if there is a regression CI will have the longest possible running time of around ±45 minutes. Epic: None Release note: None

herkolategan · 2025-03-17T13:02:48Z

TFTR!s

bors r=Darrylwong,srosenberg

herkolategan · 2025-03-17T13:29:01Z

Curious if anyone found Copilot's review summary useful?

Nope not yet, I was just curious since I saw it on a different PR, and was wondering what it would come up with here.

craig · 2025-03-17T13:54:40Z

Build succeeded:

herkolategan force-pushed the hbl/microbench-ci-snr branch 2 times, most recently from 8ade54f to b92bd20 Compare March 4, 2025 22:32

herkolategan force-pushed the hbl/microbench-ci-snr branch 2 times, most recently from c484222 to 2c6a873 Compare March 6, 2025 12:11

herkolategan changed the title ~~microbench-ci: remove thresholds~~ microbench-ci: improve signal to noise ratio Mar 6, 2025

herkolategan force-pushed the hbl/microbench-ci-snr branch from 2c6a873 to 81aad5f Compare March 10, 2025 11:43

herkolategan requested a review from Copilot March 10, 2025 11:51

Copilot AI reviewed Mar 10, 2025

View reviewed changes

herkolategan force-pushed the hbl/microbench-ci-snr branch from 81aad5f to 00d8511 Compare March 10, 2025 15:42

herkolategan marked this pull request as ready for review March 10, 2025 16:42

herkolategan requested review from a team as code owners March 10, 2025 16:42

herkolategan requested review from srosenberg and DarrylWong and removed request for a team March 10, 2025 16:42

DarrylWong reviewed Mar 10, 2025

View reviewed changes

herkolategan force-pushed the hbl/microbench-ci-snr branch from 00d8511 to 52046f9 Compare March 11, 2025 16:02

srosenberg approved these changes Mar 14, 2025

View reviewed changes

herkolategan added 4 commits March 17, 2025 12:55

roachprod-microbench: add options for metrics builder

608a34d

Previously, the thresholds were not configurable on the metrics builder. This change adds passing options to the builder to configure the thresholds (Compare Alpha) when comparing samples. Epic: None Release note: None

herkolategan force-pushed the hbl/microbench-ci-snr branch from 52046f9 to 03f2bcc Compare March 17, 2025 12:58

craig bot merged commit c3a4d77 into cockroachdb:master Mar 17, 2025
24 checks passed

celeste-cockroachdb bot added the target-release-25.2.0 label Mar 17, 2025

celeste-cockroachdb bot added v25.2.0-prerelease and removed target-release-25.2.0 labels Mar 31, 2025

microbench-ci: improve signal to noise ratio #142206

microbench-ci: improve signal to noise ratio #142206

Uh oh!

Conversation

herkolategan commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Mar 3, 2025

Uh oh!

blathers-crl bot commented Mar 5, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

PR Overview

Reviewed Changes

Uh oh!

DarrylWong Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DarrylWong Mar 10, 2025

Choose a reason for hiding this comment

Uh oh!

srosenberg Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

herkolategan Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

herkolategan Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

srosenberg Mar 14, 2025

Choose a reason for hiding this comment

Uh oh!

DarrylWong Mar 10, 2025

Choose a reason for hiding this comment

Uh oh!

herkolategan Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

srosenberg commented Mar 14, 2025

Uh oh!

herkolategan commented Mar 17, 2025

Uh oh!

herkolategan commented Mar 17, 2025

Uh oh!

craig bot commented Mar 17, 2025

Uh oh!

Uh oh!

Uh oh!

herkolategan commented Mar 3, 2025 •

edited

Loading

DarrylWong Mar 10, 2025 •

edited

Loading

herkolategan Mar 11, 2025 •

edited

Loading