Lower contention on requests with many aggs (backport of #66895) #66941
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This lowers the contention on the
REQUEST
circuit breaker when buildingmany aggregations on many threads by preallocating a chunk of breaker
up front. This cuts down on the number of times we enter the busy loop
in
ChildMemoryCircuitBreaker.limit
. Now we hit it one time when buildingaggregations. We still hit the busy loop if we collect many buckets.
We let the
AggregationBuilder
pick size of the "chunk" that wepreallocate but it doesn't have much to go on - not even the field types.
But it is available in a convenient spot and the estimates don't have to
be particularly accurate.
The benchmarks on my 12 core desktop are interesting:
They show pretty clearly that not using the circuit breaker at all is
faster. But we can't do that because we don't want to bring the node
down on bad aggs. When there are many aggs (termsSixtySums) the
preallocation claws back much of the performance. It even helps
marginally when there are two aggs (termsSum). For a single agg (sum)
we see a 130 nanosecond hit. Fine.
But these values are all pretty small. At best we're seeing a 160
microsecond savings. Not so on a 160 vCPU machine:
All of these numbers are larger because we're running all the CPUs
flat out and we're seeing more contention everywhere. Even the "noop"
breaker sees some contention, but I think it is mostly around memory
allocation. Anyway, with many many (termsSixtySums) aggs we're looking
at 8 milliseconds of savings by preallocating. Just by dodging the busy
loop as much as possible. The error in the measurements there are
substantial. Here are the runs:
That variability from 5.5ms to 25ms is terrible. It makes me not
particularly trust the 8ms savings from the report. But still,
the preallocating method has much less variability between runs
and almost all the runs are faster than all of the non-preallocated
runs. Maybe the savings is more like 2 or 3 milliseconds, but still.
Or maybe we should think of hte savings as worst vs worst? If so its
19 milliseconds.
Anyway, its hard to measure how much this helps. But, certainly some.
Closes #58647