Skip to content

Commit 2acafd4

Browse files
authored
Optimize composite aggregation based on index sorting (#48399) (#50272)
Co-authored-by: Daniel Huang <[email protected]> This is a spinoff of #48130 that generalizes the proposal to allow early termination with the composite aggregation when leading sources match a prefix or the entire index sort specification. In such case the composite aggregation can use the index sort natural order to early terminate the collection when it reaches a composite key that is greater than the bottom of the queue. The optimization is also applicable when a query other than match_all is provided. However the optimization is deactivated for sources that match the index sort in the following cases: * Multi-valued source, in such case early termination is not possible. * missing_bucket is set to true
1 parent 1c7bfeb commit 2acafd4

18 files changed

+660
-127
lines changed

docs/reference/aggregations/bucket/composite-aggregation.asciidoc

+124-1
Original file line numberDiff line numberDiff line change
@@ -117,6 +117,7 @@ Example:
117117
--------------------------------------------------
118118
GET /_search
119119
{
120+
"size": 0,
120121
"aggs" : {
121122
"my_buckets": {
122123
"composite" : {
@@ -135,6 +136,7 @@ Like the `terms` aggregation it is also possible to use a script to create the v
135136
--------------------------------------------------
136137
GET /_search
137138
{
139+
"size": 0,
138140
"aggs" : {
139141
"my_buckets": {
140142
"composite" : {
@@ -170,6 +172,7 @@ Example:
170172
--------------------------------------------------
171173
GET /_search
172174
{
175+
"size": 0,
173176
"aggs" : {
174177
"my_buckets": {
175178
"composite" : {
@@ -188,6 +191,7 @@ The values are built from a numeric field or a script that return numerical valu
188191
--------------------------------------------------
189192
GET /_search
190193
{
194+
"size": 0,
191195
"aggs" : {
192196
"my_buckets": {
193197
"composite" : {
@@ -220,6 +224,7 @@ is specified by date/time expression:
220224
--------------------------------------------------
221225
GET /_search
222226
{
227+
"size": 0,
223228
"aggs" : {
224229
"my_buckets": {
225230
"composite" : {
@@ -249,6 +254,7 @@ the format specified with the format parameter:
249254
--------------------------------------------------
250255
GET /_search
251256
{
257+
"size": 0,
252258
"aggs" : {
253259
"my_buckets": {
254260
"composite" : {
@@ -291,6 +297,7 @@ For example:
291297
--------------------------------------------------
292298
GET /_search
293299
{
300+
"size": 0,
294301
"aggs" : {
295302
"my_buckets": {
296303
"composite" : {
@@ -313,6 +320,7 @@ in the composite buckets.
313320
--------------------------------------------------
314321
GET /_search
315322
{
323+
"size": 0,
316324
"aggs" : {
317325
"my_buckets": {
318326
"composite" : {
@@ -342,6 +350,7 @@ For example:
342350
--------------------------------------------------
343351
GET /_search
344352
{
353+
"size": 0,
345354
"aggs" : {
346355
"my_buckets": {
347356
"composite" : {
@@ -368,6 +377,7 @@ It is possible to include them in the response by setting `missing_bucket` to
368377
--------------------------------------------------
369378
GET /_search
370379
{
380+
"size": 0,
371381
"aggs" : {
372382
"my_buckets": {
373383
"composite" : {
@@ -393,7 +403,7 @@ first 10 composite buckets created from the values source.
393403
The response contains the values for each composite bucket in an array containing the values extracted
394404
from each value source.
395405

396-
==== After
406+
==== Pagination
397407

398408
If the number of composite buckets is too high (or unknown) to be returned in a single response
399409
it is possible to split the retrieval in multiple requests.
@@ -407,6 +417,7 @@ For example:
407417
--------------------------------------------------
408418
GET /_search
409419
{
420+
"size": 0,
410421
"aggs" : {
411422
"my_buckets": {
412423
"composite" : {
@@ -472,6 +483,7 @@ round of result can be retrieved with:
472483
--------------------------------------------------
473484
GET /_search
474485
{
486+
"size": 0,
475487
"aggs" : {
476488
"my_buckets": {
477489
"composite" : {
@@ -489,6 +501,116 @@ GET /_search
489501

490502
<1> Should restrict the aggregation to buckets that sort **after** the provided values.
491503

504+
==== Early termination
505+
506+
For optimal performance the <<index-modules-index-sorting,index sort>> should be set on the index so that it matches
507+
parts or fully the source order in the composite aggregation.
508+
For instance the following index sort:
509+
510+
[source,console]
511+
--------------------------------------------------
512+
PUT twitter
513+
{
514+
"settings" : {
515+
"index" : {
516+
"sort.field" : ["username", "timestamp"], <1>
517+
"sort.order" : ["asc", "desc"] <2>
518+
}
519+
},
520+
"mappings": {
521+
"properties": {
522+
"username": {
523+
"type": "keyword",
524+
"doc_values": true
525+
},
526+
"timestamp": {
527+
"type": "date"
528+
}
529+
}
530+
}
531+
}
532+
--------------------------------------------------
533+
534+
<1> This index is sorted by `username` first then by `timestamp`.
535+
<2> ... in ascending order for the `username` field and in descending order for the `timestamp` field.
536+
537+
.. could be used to optimize these composite aggregations:
538+
539+
[source,console]
540+
--------------------------------------------------
541+
GET /_search
542+
{
543+
"size": 0,
544+
"aggs" : {
545+
"my_buckets": {
546+
"composite" : {
547+
"sources" : [
548+
{ "user_name": { "terms" : { "field": "user_name" } } } <1>
549+
]
550+
}
551+
}
552+
}
553+
}
554+
--------------------------------------------------
555+
556+
<1> `user_name` is a prefix of the index sort and the order matches (`asc`).
557+
558+
[source,console]
559+
--------------------------------------------------
560+
GET /_search
561+
{
562+
"size": 0,
563+
"aggs" : {
564+
"my_buckets": {
565+
"composite" : {
566+
"sources" : [
567+
{ "user_name": { "terms" : { "field": "user_name" } } }, <1>
568+
{ "date": { "date_histogram": { "field": "timestamp", "calendar_interval": "1d", "order": "desc" } } } <2>
569+
]
570+
}
571+
}
572+
}
573+
}
574+
--------------------------------------------------
575+
576+
<1> `user_name` is a prefix of the index sort and the order matches (`asc`).
577+
<2> `timestamp` matches also the prefix and the order matches (`desc`).
578+
579+
In order to optimize the early termination it is advised to set `track_total_hits` in the request
580+
to `false`. The number of total hits that match the request can be retrieved on the first request
581+
and it would be costly to compute this number on every page:
582+
583+
[source,console]
584+
--------------------------------------------------
585+
GET /_search
586+
{
587+
"size": 0,
588+
"track_total_hits": false,
589+
"aggs" : {
590+
"my_buckets": {
591+
"composite" : {
592+
"sources" : [
593+
{ "user_name": { "terms" : { "field": "user_name" } } },
594+
{ "date": { "date_histogram": { "field": "timestamp", "calendar_interval": "1d", "order": "desc" } } }
595+
]
596+
}
597+
}
598+
}
599+
}
600+
--------------------------------------------------
601+
602+
Note that the order of the source is important, in the example below switching the `user_name` with the `timestamp`
603+
would deactivate the sort optimization since this configuration wouldn't match the index sort specification.
604+
If the order of sources do not matter for your use case you can follow these simple guidelines:
605+
606+
* Put the fields with the highest cardinality first.
607+
* Make sure that the order of the field matches the order of the index sort.
608+
* Put multi-valued fields last since they cannot be used for early termination.
609+
610+
WARNING: <<index-modules-index-sorting,index sort>> can slowdown indexing, it is very important to test index sorting
611+
with your specific use case and dataset to ensure that it matches your requirement. If it doesn't note that `composite`
612+
aggregations will also try to early terminate on non-sorted indices if the query matches all document (`match_all` query).
613+
492614
==== Sub-aggregations
493615

494616
Like any `multi-bucket` aggregations the `composite` aggregation can hold sub-aggregations.
@@ -501,6 +623,7 @@ per composite bucket:
501623
--------------------------------------------------
502624
GET /_search
503625
{
626+
"size": 0,
504627
"aggs" : {
505628
"my_buckets": {
506629
"composite" : {

server/src/main/java/org/elasticsearch/search/aggregations/bucket/composite/CompositeAggregationBuilder.java

+2-1
Original file line numberDiff line numberDiff line change
@@ -235,7 +235,8 @@ protected AggregatorFactory doBuild(QueryShardContext queryShardContext, Aggrega
235235
} else {
236236
afterKey = null;
237237
}
238-
return new CompositeAggregationFactory(name, queryShardContext, parent, subfactoriesBuilder, metaData, size, configs, afterKey);
238+
return new CompositeAggregationFactory(name, queryShardContext, parent, subfactoriesBuilder, metaData, size,
239+
configs, afterKey);
239240
}
240241

241242

0 commit comments

Comments
 (0)