Lower contention on requests with many aggs #66895

nik9000 · 2020-12-30T17:25:58Z

This lowers the contention on the REQUEST circuit breaker when building
many aggregations on many threads by preallocating a chunk of breaker
up front. This cuts down on the number of times we enter the busy loop
in ChildMemoryCircuitBreaker.limit. Now we hit it one time when building
aggregations. We still hit the busy loop if we collect many buckets.

We let the AggregationBuilder pick size of the "chunk" that we
preallocate but it doesn't have much to go on - not even the field types.
But it is available in a convenient spot and the estimates don't have to
be particularly accurate.

The benchmarks on my 12 core desktop are interesting:

Benchmark         (breaker)  Mode  Cnt    Score    Error  Units
sum                    noop  avgt   10    1.672 ±  0.042  us/op
sum                    real  avgt   10    4.100 ±  0.027  us/op
sum             preallocate  avgt   10    4.230 ±  0.034  us/op
termsSixtySums         noop  avgt   10   92.658 ±  0.939  us/op
termsSixtySums         real  avgt   10  278.764 ± 39.751  us/op
termsSixtySums  preallocate  avgt   10  120.896 ± 16.097  us/op
termsSum               noop  avgt   10    4.573 ±  0.095  us/op
termsSum               real  avgt   10    9.932 ±  0.211  us/op
termsSum        preallocate  avgt   10    7.695 ±  0.313  us/op

They show pretty clearly that not using the circuit breaker at all is
faster. But we can't do that because we don't want to bring the node
down on bad aggs. When there are many aggs (termsSixtySums) the
preallocation claws back much of the performance. It even helps
marginally when there are two aggs (termsSum). For a single agg (sum)
we see a 130 nanosecond hit. Fine.

But these values are all pretty small. At best we're seeing a 160
microsecond savings. Not so on a 160 vCPU machine:

Benchmark         (breaker)  Mode  Cnt      Score       Error  Units
sum                    noop  avgt   10     44.956 ±     8.851  us/op
sum                    real  avgt   10    118.008 ±    19.505  us/op
sum             preallocate  avgt   10    241.234 ±   305.998  us/op
termsSixtySums         noop  avgt   10   1339.802 ±    51.410  us/op
termsSixtySums         real  avgt   10  12077.671 ± 12110.993  us/op
termsSixtySums  preallocate  avgt   10   3804.515 ±  1458.702  us/op
termsSum               noop  avgt   10     59.478 ±     2.261  us/op
termsSum               real  avgt   10    293.756 ±   253.854  us/op
termsSum        preallocate  avgt   10    197.963 ±    41.578  us/op

All of these numbers are larger because we're running all the CPUs
flat out and we're seeing more contention everywhere. Even the "noop"
breaker sees some contention, but I think it is mostly around memory
allocation. Anyway, with many many (termsSixtySums) aggs we're looking
at 8 milliseconds of savings by preallocating. Just by dodging the busy
loop as much as possible. The error in the measurements there are
substantial. Here are the runs:

real:
Iteration   1: 8679.417 ±(99.9%) 273.220 us/op
Iteration   2: 5849.538 ±(99.9%) 179.258 us/op
Iteration   3: 5953.935 ±(99.9%) 152.829 us/op
Iteration   4: 5763.465 ±(99.9%) 150.759 us/op
Iteration   5: 14157.592 ±(99.9%) 395.224 us/op
Iteration   1: 24857.020 ±(99.9%) 1133.847 us/op
Iteration   2: 24730.903 ±(99.9%) 1107.718 us/op
Iteration   3: 18894.383 ±(99.9%) 738.706 us/op
Iteration   4: 5493.965 ±(99.9%) 120.529 us/op
Iteration   5: 6396.493 ±(99.9%) 143.630 us/op
preallocate:
Iteration   1: 5512.590 ±(99.9%) 110.222 us/op
Iteration   2: 3087.771 ±(99.9%) 120.084 us/op
Iteration   3: 3544.282 ±(99.9%) 110.373 us/op
Iteration   4: 3477.228 ±(99.9%) 107.270 us/op
Iteration   5: 4351.820 ±(99.9%) 82.946 us/op
Iteration   1: 3185.250 ±(99.9%) 154.102 us/op
Iteration   2: 3058.000 ±(99.9%) 143.758 us/op
Iteration   3: 3199.920 ±(99.9%) 61.589 us/op
Iteration   4: 3163.735 ±(99.9%) 71.291 us/op
Iteration   5: 5464.556 ±(99.9%) 59.034 us/op

That variability from 5.5ms to 25ms is terrible. It makes me not
particularly trust the 8ms savings from the report. But still,
the preallocating method has much less variability between runs
and almost all the runs are faster than all of the non-preallocated
runs. Maybe the savings is more like 2 or 3 milliseconds, but still.
Or maybe we should think of hte savings as worst vs worst? If so its
19 milliseconds.

Anyway, its hard to measure how much this helps. But, certainly some.

Closes #58647

elasticmachine · 2020-12-30T17:26:02Z

Pinging @elastic/es-analytics-geo (Team:Analytics)

nik9000 · 2020-12-30T17:27:52Z

server/src/main/java/org/elasticsearch/search/aggregations/support/AggregationContext.java

@@ -63,7 +66,7 @@
 * this is {@code abstract} so that tests can build it without creating the
 * massing {@link QueryShardContext}.
 */
-public abstract class AggregationContext {
+public abstract class AggregationContext implements Releasable {


Making AggregationContext Releasable lets us free the preallocated breaker after freeing all of the aggs without adding complexity to SearchContext.

DJRickyB · 2020-12-30T19:46:57Z

server/src/main/java/org/elasticsearch/search/aggregations/support/AggregationContext.java

+                this.breakerService = null;
+                this.bigArrays = context.bigArrays().withCircuitBreaking();
+            } else {
+                this.breakerService = bytesToPreallocate == 0 ? null : new PreallocatedCircuitBreakerService(


this ternary shouldn't be necessary, right?

Correct! A leftover.

I'll zap it.

DJRickyB

LGTM. reproduced benchmark results at a high level, fwiw

not-napoleon

Left a few comments, but nothing urgent.

not-napoleon · 2020-12-30T19:23:42Z

server/src/internalClusterTest/java/org/elasticsearch/action/search/TransportSearchIT.java

            };
        }

        @Override
        public InternalAggregation buildEmptyAggregation() {
-            return new InternalMax(name(), Double.NaN, DocValueFormat.RAW, Collections.emptyMap());
+            return new InternalMax(name(), Double.NaN, DocValueFormat.RAW, null);


Why do we change this to a null here?

I don't think I had to and on reflection I'll change it back. null here is actually more like what we do in production. emptyMap is more like when I user sends an empty meta.

not-napoleon · 2020-12-30T19:24:54Z

server/src/main/java/org/elasticsearch/common/breaker/PreallocatedCircuitBreakerService.java

+
+/**
+ * {@link CircuitBreakerService} that preallocates some bytes on construction.
+ * Use this when you know you'll be allocating many small


Suggested change

* Use this when you know you'll be allocating many small

* Use this when you know you'll be allocating many small bits of memory.

or something like that. Just a dangling sentence fragment right now.

not-napoleon · 2020-12-30T19:34:46Z

server/src/main/java/org/elasticsearch/common/breaker/PreallocatedCircuitBreakerService.java

+            if (newUsed > preallocated) {
+                preallocationUsed = preallocated;
+                long toAllocate = newUsed - preallocated;
+                if (toAllocate > 0) {


Isn't this always going to be true, since three lines previous, we checked that newUsed > preallocated? Nothing involved is volatile, right?

Why did I have this........ Now I've got to reread and see what I was thinking.

Yeah. When I remove the if > 0 everything just works too. I imagine this is a left over from weird thinking. I mean, math works the way we think it does.

not-napoleon · 2020-12-30T19:40:14Z

server/src/main/java/org/elasticsearch/common/breaker/PreallocatedCircuitBreakerService.java

+            }
+            // This is the fast case. No volatile reads or writes here, ma!
+            preallocationUsed = newUsed;
+            // We return garbage here but callers never use the result for anything interesting


...That is a terrifying comment. Maybe let's open an issue to drop the return value from this method?

Yeah. I feel like there are several "perspectives" on CircuitBreaker which might be useful to split up. The "I'm allocating and deallocating stuff" one. The "how much is used and how many times has this tripped?" one. There is kind of also a "testing" one. And that one is where the caller uses the result here. Maybe tests could just read the result rather than look at the percent. I'll take another look.

not-napoleon · 2020-12-30T20:00:44Z

server/src/main/java/org/elasticsearch/search/aggregations/support/AggregationContext.java

+                 * to preallocate 0 bytes. Mostly this is for testing other
+                 * things, but we should honor it and just not preallocate
+                 * anything. Setting the breakerService reference to null will
+                 * cause us to skip it when we close this context.


This is a great comment. I was literally about to ask why it would ever be 0.

This lowers the contention on the `REQUEST` circuit breaker when building many aggregations on many threads by preallocating a chunk of breaker up front. This cuts down on the number of times we enter the busy loop in `ChildMemoryCircuitBreaker.limit`. Now we hit it one time when building aggregations. We still hit the busy loop if we collect many buckets. We let the `AggregationBuilder` pick size of the "chunk" that we preallocate but it doesn't have much to go on - not even the field types. But it is available in a convenient spot and the estimates don't have to be particularly accurate. The benchmarks on my 12 core desktop are interesting: ``` Benchmark (breaker) Mode Cnt Score Error Units sum noop avgt 10 1.672 ± 0.042 us/op sum real avgt 10 4.100 ± 0.027 us/op sum preallocate avgt 10 4.230 ± 0.034 us/op termsSixtySums noop avgt 10 92.658 ± 0.939 us/op termsSixtySums real avgt 10 278.764 ± 39.751 us/op termsSixtySums preallocate avgt 10 120.896 ± 16.097 us/op termsSum noop avgt 10 4.573 ± 0.095 us/op termsSum real avgt 10 9.932 ± 0.211 us/op termsSum preallocate avgt 10 7.695 ± 0.313 us/op ``` They show pretty clearly that not using the circuit breaker at all is faster. But we can't do that because we don't want to bring the node down on bad aggs. When there are many aggs (termsSixtySums) the preallocation claws back much of the performance. It even helps marginally when there are two aggs (termsSum). For a single agg (sum) we see a 130 nanosecond hit. Fine. But these values are all pretty small. At best we're seeing a 160 microsecond savings. Not so on a 160 vCPU machine: ``` Benchmark (breaker) Mode Cnt Score Error Units sum noop avgt 10 44.956 ± 8.851 us/op sum real avgt 10 118.008 ± 19.505 us/op sum preallocate avgt 10 241.234 ± 305.998 us/op termsSixtySums noop avgt 10 1339.802 ± 51.410 us/op termsSixtySums real avgt 10 12077.671 ± 12110.993 us/op termsSixtySums preallocate avgt 10 3804.515 ± 1458.702 us/op termsSum noop avgt 10 59.478 ± 2.261 us/op termsSum real avgt 10 293.756 ± 253.854 us/op termsSum preallocate avgt 10 197.963 ± 41.578 us/op ``` All of these numbers are larger because we're running all the CPUs flat out and we're seeing more contention everywhere. Even the "noop" breaker sees some contention, but I think it is mostly around memory allocation. Anyway, with many many (termsSixtySums) aggs we're looking at 8 milliseconds of savings by preallocating. Just by dodging the busy loop as much as possible. The error in the measurements there are substantial. Here are the runs: ``` real: Iteration 1: 8679.417 ±(99.9%) 273.220 us/op Iteration 2: 5849.538 ±(99.9%) 179.258 us/op Iteration 3: 5953.935 ±(99.9%) 152.829 us/op Iteration 4: 5763.465 ±(99.9%) 150.759 us/op Iteration 5: 14157.592 ±(99.9%) 395.224 us/op Iteration 1: 24857.020 ±(99.9%) 1133.847 us/op Iteration 2: 24730.903 ±(99.9%) 1107.718 us/op Iteration 3: 18894.383 ±(99.9%) 738.706 us/op Iteration 4: 5493.965 ±(99.9%) 120.529 us/op Iteration 5: 6396.493 ±(99.9%) 143.630 us/op preallocate: Iteration 1: 5512.590 ±(99.9%) 110.222 us/op Iteration 2: 3087.771 ±(99.9%) 120.084 us/op Iteration 3: 3544.282 ±(99.9%) 110.373 us/op Iteration 4: 3477.228 ±(99.9%) 107.270 us/op Iteration 5: 4351.820 ±(99.9%) 82.946 us/op Iteration 1: 3185.250 ±(99.9%) 154.102 us/op Iteration 2: 3058.000 ±(99.9%) 143.758 us/op Iteration 3: 3199.920 ±(99.9%) 61.589 us/op Iteration 4: 3163.735 ±(99.9%) 71.291 us/op Iteration 5: 5464.556 ±(99.9%) 59.034 us/op ``` That variability from 5.5ms to 25ms is terrible. It makes me not particularly trust the 8ms savings from the report. But still, the preallocating method has much less variability between runs and almost all the runs are faster than all of the non-preallocated runs. Maybe the savings is more like 2 or 3 milliseconds, but still. Or maybe we should think of hte savings as worst vs worst? If so its 19 milliseconds. Anyway, its hard to measure how much this helps. But, certainly some. Closes elastic#58647

) This lowers the contention on the `REQUEST` circuit breaker when building many aggregations on many threads by preallocating a chunk of breaker up front. This cuts down on the number of times we enter the busy loop in `ChildMemoryCircuitBreaker.limit`. Now we hit it one time when building aggregations. We still hit the busy loop if we collect many buckets. We let the `AggregationBuilder` pick size of the "chunk" that we preallocate but it doesn't have much to go on - not even the field types. But it is available in a convenient spot and the estimates don't have to be particularly accurate. The benchmarks on my 12 core desktop are interesting: ``` Benchmark (breaker) Mode Cnt Score Error Units sum noop avgt 10 1.672 ± 0.042 us/op sum real avgt 10 4.100 ± 0.027 us/op sum preallocate avgt 10 4.230 ± 0.034 us/op termsSixtySums noop avgt 10 92.658 ± 0.939 us/op termsSixtySums real avgt 10 278.764 ± 39.751 us/op termsSixtySums preallocate avgt 10 120.896 ± 16.097 us/op termsSum noop avgt 10 4.573 ± 0.095 us/op termsSum real avgt 10 9.932 ± 0.211 us/op termsSum preallocate avgt 10 7.695 ± 0.313 us/op ``` They show pretty clearly that not using the circuit breaker at all is faster. But we can't do that because we don't want to bring the node down on bad aggs. When there are many aggs (termsSixtySums) the preallocation claws back much of the performance. It even helps marginally when there are two aggs (termsSum). For a single agg (sum) we see a 130 nanosecond hit. Fine. But these values are all pretty small. At best we're seeing a 160 microsecond savings. Not so on a 160 vCPU machine: ``` Benchmark (breaker) Mode Cnt Score Error Units sum noop avgt 10 44.956 ± 8.851 us/op sum real avgt 10 118.008 ± 19.505 us/op sum preallocate avgt 10 241.234 ± 305.998 us/op termsSixtySums noop avgt 10 1339.802 ± 51.410 us/op termsSixtySums real avgt 10 12077.671 ± 12110.993 us/op termsSixtySums preallocate avgt 10 3804.515 ± 1458.702 us/op termsSum noop avgt 10 59.478 ± 2.261 us/op termsSum real avgt 10 293.756 ± 253.854 us/op termsSum preallocate avgt 10 197.963 ± 41.578 us/op ``` All of these numbers are larger because we're running all the CPUs flat out and we're seeing more contention everywhere. Even the "noop" breaker sees some contention, but I think it is mostly around memory allocation. Anyway, with many many (termsSixtySums) aggs we're looking at 8 milliseconds of savings by preallocating. Just by dodging the busy loop as much as possible. The error in the measurements there are substantial. Here are the runs: ``` real: Iteration 1: 8679.417 ±(99.9%) 273.220 us/op Iteration 2: 5849.538 ±(99.9%) 179.258 us/op Iteration 3: 5953.935 ±(99.9%) 152.829 us/op Iteration 4: 5763.465 ±(99.9%) 150.759 us/op Iteration 5: 14157.592 ±(99.9%) 395.224 us/op Iteration 1: 24857.020 ±(99.9%) 1133.847 us/op Iteration 2: 24730.903 ±(99.9%) 1107.718 us/op Iteration 3: 18894.383 ±(99.9%) 738.706 us/op Iteration 4: 5493.965 ±(99.9%) 120.529 us/op Iteration 5: 6396.493 ±(99.9%) 143.630 us/op preallocate: Iteration 1: 5512.590 ±(99.9%) 110.222 us/op Iteration 2: 3087.771 ±(99.9%) 120.084 us/op Iteration 3: 3544.282 ±(99.9%) 110.373 us/op Iteration 4: 3477.228 ±(99.9%) 107.270 us/op Iteration 5: 4351.820 ±(99.9%) 82.946 us/op Iteration 1: 3185.250 ±(99.9%) 154.102 us/op Iteration 2: 3058.000 ±(99.9%) 143.758 us/op Iteration 3: 3199.920 ±(99.9%) 61.589 us/op Iteration 4: 3163.735 ±(99.9%) 71.291 us/op Iteration 5: 5464.556 ±(99.9%) 59.034 us/op ``` That variability from 5.5ms to 25ms is terrible. It makes me not particularly trust the 8ms savings from the report. But still, the preallocating method has much less variability between runs and almost all the runs are faster than all of the non-preallocated runs. Maybe the savings is more like 2 or 3 milliseconds, but still. Or maybe we should think of hte savings as worst vs worst? If so its 19 milliseconds. Anyway, its hard to measure how much this helps. But, certainly some. Closes #58647

nik9000 added 11 commits December 8, 2020 12:27

WIP

2c8890d

WIP

85dacd4

Sixty sums

26d9809

Doc

9166f27

Test preallocation pattern

46a2295

Merge branch 'master' into agg_construction_contention

bb0ddcc

Compile plz

385607d

Working?

795c89a

Drop .new

ff48353

Release properly

d3f04a4

Drop benchmark we don't need

2c8aeb5

nik9000 added >enhancement :Analytics/Aggregations Aggregations v8.0.0 v7.12.0 labels Dec 30, 2020

nik9000 requested review from not-napoleon, polyfractal and DJRickyB December 30, 2020 17:25

elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Dec 30, 2020

nik9000 commented Dec 30, 2020

View reviewed changes

nik9000 added 2 commits December 30, 2020 13:57

tests

24addcf

Test

e7a64a8

DJRickyB reviewed Dec 30, 2020

View reviewed changes

zap

0bd4bc5

DJRickyB approved these changes Dec 30, 2020

View reviewed changes

not-napoleon approved these changes Dec 30, 2020

View reviewed changes

nik9000 added 3 commits January 4, 2021 09:07

Merge branch 'master' into agg_construction_contention

4e01f0e

Clean up

2d42334

Moar words!

9002674

nik9000 merged commit de8b39e into elastic:master Jan 4, 2021

nik9000 added the backport pending label Jan 4, 2021

nik9000 mentioned this pull request Jan 4, 2021

Aggregations can be bottlenecked on ChildMemoryCircuitBreaker.limit() #58647

Closed

nik9000 mentioned this pull request Jan 4, 2021

Remove return from breakers #66943

Merged

nik9000 removed the backport pending label Jan 4, 2021

This was referenced May 7, 2021

Fix PreallocatedCircuitBreakerService leak #72846

Merged

Fix PreallocatedCircuitBreakerService leak (#72846) #72868

Merged

Fix PreallocatedCircuitBreakerService leak (#72846) #72869

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

	* Use this when you know you'll be allocating many small
	* Use this when you know you'll be allocating many small bits of memory.

Lower contention on requests with many aggs #66895

Lower contention on requests with many aggs #66895

Uh oh!

Conversation

nik9000 commented Dec 30, 2020

Uh oh!

elasticmachine commented Dec 30, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DJRickyB left a comment

Choose a reason for hiding this comment

Uh oh!

not-napoleon left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!