Skip to content

Commit 2478dbd

Browse files
committed
Optimize composite aggregation based on index sorting (#48399)
Co-authored-by: Daniel Huang <[email protected]> This is a spinoff of #48130 that generalizes the proposal to allow early termination with the composite aggregation when leading sources match a prefix or the entire index sort specification. In such case the composite aggregation can use the index sort natural order to early terminate the collection when it reaches a composite key that is greater than the bottom of the queue. The optimization is also applicable when a query other than match_all is provided. However the optimization is deactivated for sources that match the index sort in the following cases: * Multi-valued source, in such case early termination is not possible. * missing_bucket is set to true
1 parent 098f540 commit 2478dbd

18 files changed

+660
-127
lines changed

docs/reference/aggregations/bucket/composite-aggregation.asciidoc

+124-1
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,7 @@ Example:
116116
--------------------------------------------------
117117
GET /_search
118118
{
119+
"size": 0,
119120
"aggs" : {
120121
"my_buckets": {
121122
"composite" : {
@@ -134,6 +135,7 @@ Like the `terms` aggregation it is also possible to use a script to create the v
134135
--------------------------------------------------
135136
GET /_search
136137
{
138+
"size": 0,
137139
"aggs" : {
138140
"my_buckets": {
139141
"composite" : {
@@ -168,6 +170,7 @@ Example:
168170
--------------------------------------------------
169171
GET /_search
170172
{
173+
"size": 0,
171174
"aggs" : {
172175
"my_buckets": {
173176
"composite" : {
@@ -186,6 +189,7 @@ The values are built from a numeric field or a script that return numerical valu
186189
--------------------------------------------------
187190
GET /_search
188191
{
192+
"size": 0,
189193
"aggs" : {
190194
"my_buckets": {
191195
"composite" : {
@@ -218,6 +222,7 @@ is specified by date/time expression:
218222
--------------------------------------------------
219223
GET /_search
220224
{
225+
"size": 0,
221226
"aggs" : {
222227
"my_buckets": {
223228
"composite" : {
@@ -247,6 +252,7 @@ the format specified with the format parameter:
247252
--------------------------------------------------
248253
GET /_search
249254
{
255+
"size": 0,
250256
"aggs" : {
251257
"my_buckets": {
252258
"composite" : {
@@ -289,6 +295,7 @@ For example:
289295
--------------------------------------------------
290296
GET /_search
291297
{
298+
"size": 0,
292299
"aggs" : {
293300
"my_buckets": {
294301
"composite" : {
@@ -311,6 +318,7 @@ in the composite buckets.
311318
--------------------------------------------------
312319
GET /_search
313320
{
321+
"size": 0,
314322
"aggs" : {
315323
"my_buckets": {
316324
"composite" : {
@@ -340,6 +348,7 @@ For example:
340348
--------------------------------------------------
341349
GET /_search
342350
{
351+
"size": 0,
343352
"aggs" : {
344353
"my_buckets": {
345354
"composite" : {
@@ -366,6 +375,7 @@ It is possible to include them in the response by setting `missing_bucket` to
366375
--------------------------------------------------
367376
GET /_search
368377
{
378+
"size": 0,
369379
"aggs" : {
370380
"my_buckets": {
371381
"composite" : {
@@ -391,7 +401,7 @@ first 10 composite buckets created from the values source.
391401
The response contains the values for each composite bucket in an array containing the values extracted
392402
from each value source.
393403

394-
==== After
404+
==== Pagination
395405

396406
If the number of composite buckets is too high (or unknown) to be returned in a single response
397407
it is possible to split the retrieval in multiple requests.
@@ -405,6 +415,7 @@ For example:
405415
--------------------------------------------------
406416
GET /_search
407417
{
418+
"size": 0,
408419
"aggs" : {
409420
"my_buckets": {
410421
"composite" : {
@@ -470,6 +481,7 @@ round of result can be retrieved with:
470481
--------------------------------------------------
471482
GET /_search
472483
{
484+
"size": 0,
473485
"aggs" : {
474486
"my_buckets": {
475487
"composite" : {
@@ -487,6 +499,116 @@ GET /_search
487499

488500
<1> Should restrict the aggregation to buckets that sort **after** the provided values.
489501

502+
==== Early termination
503+
504+
For optimal performance the <<index-modules-index-sorting,index sort>> should be set on the index so that it matches
505+
parts or fully the source order in the composite aggregation.
506+
For instance the following index sort:
507+
508+
[source,console]
509+
--------------------------------------------------
510+
PUT twitter
511+
{
512+
"settings" : {
513+
"index" : {
514+
"sort.field" : ["username", "timestamp"], <1>
515+
"sort.order" : ["asc", "desc"] <2>
516+
}
517+
},
518+
"mappings": {
519+
"properties": {
520+
"username": {
521+
"type": "keyword",
522+
"doc_values": true
523+
},
524+
"timestamp": {
525+
"type": "date"
526+
}
527+
}
528+
}
529+
}
530+
--------------------------------------------------
531+
532+
<1> This index is sorted by `username` first then by `timestamp`.
533+
<2> ... in ascending order for the `username` field and in descending order for the `timestamp` field.
534+
535+
.. could be used to optimize these composite aggregations:
536+
537+
[source,console]
538+
--------------------------------------------------
539+
GET /_search
540+
{
541+
"size": 0,
542+
"aggs" : {
543+
"my_buckets": {
544+
"composite" : {
545+
"sources" : [
546+
{ "user_name": { "terms" : { "field": "user_name" } } } <1>
547+
]
548+
}
549+
}
550+
}
551+
}
552+
--------------------------------------------------
553+
554+
<1> `user_name` is a prefix of the index sort and the order matches (`asc`).
555+
556+
[source,console]
557+
--------------------------------------------------
558+
GET /_search
559+
{
560+
"size": 0,
561+
"aggs" : {
562+
"my_buckets": {
563+
"composite" : {
564+
"sources" : [
565+
{ "user_name": { "terms" : { "field": "user_name" } } }, <1>
566+
{ "date": { "date_histogram": { "field": "timestamp", "calendar_interval": "1d", "order": "desc" } } } <2>
567+
]
568+
}
569+
}
570+
}
571+
}
572+
--------------------------------------------------
573+
574+
<1> `user_name` is a prefix of the index sort and the order matches (`asc`).
575+
<2> `timestamp` matches also the prefix and the order matches (`desc`).
576+
577+
In order to optimize the early termination it is advised to set `track_total_hits` in the request
578+
to `false`. The number of total hits that match the request can be retrieved on the first request
579+
and it would be costly to compute this number on every page:
580+
581+
[source,console]
582+
--------------------------------------------------
583+
GET /_search
584+
{
585+
"size": 0,
586+
"track_total_hits": false,
587+
"aggs" : {
588+
"my_buckets": {
589+
"composite" : {
590+
"sources" : [
591+
{ "user_name": { "terms" : { "field": "user_name" } } },
592+
{ "date": { "date_histogram": { "field": "timestamp", "calendar_interval": "1d", "order": "desc" } } }
593+
]
594+
}
595+
}
596+
}
597+
}
598+
--------------------------------------------------
599+
600+
Note that the order of the source is important, in the example below switching the `user_name` with the `timestamp`
601+
would deactivate the sort optimization since this configuration wouldn't match the index sort specification.
602+
If the order of sources do not matter for your use case you can follow these simple guidelines:
603+
604+
* Put the fields with the highest cardinality first.
605+
* Make sure that the order of the field matches the order of the index sort.
606+
* Put multi-valued fields last since they cannot be used for early termination.
607+
608+
WARNING: <<index-modules-index-sorting,index sort>> can slowdown indexing, it is very important to test index sorting
609+
with your specific use case and dataset to ensure that it matches your requirement. If it doesn't note that `composite`
610+
aggregations will also try to early terminate on non-sorted indices if the query matches all document (`match_all` query).
611+
490612
==== Sub-aggregations
491613

492614
Like any `multi-bucket` aggregations the `composite` aggregation can hold sub-aggregations.
@@ -499,6 +621,7 @@ per composite bucket:
499621
--------------------------------------------------
500622
GET /_search
501623
{
624+
"size": 0,
502625
"aggs" : {
503626
"my_buckets": {
504627
"composite" : {

server/src/main/java/org/elasticsearch/search/aggregations/bucket/composite/CompositeAggregationBuilder.java

+2-1
Original file line numberDiff line numberDiff line change
@@ -235,7 +235,8 @@ protected AggregatorFactory doBuild(QueryShardContext queryShardContext, Aggrega
235235
} else {
236236
afterKey = null;
237237
}
238-
return new CompositeAggregationFactory(name, queryShardContext, parent, subfactoriesBuilder, metaData, size, configs, afterKey);
238+
return new CompositeAggregationFactory(name, queryShardContext, parent, subfactoriesBuilder, metaData, size,
239+
configs, afterKey);
239240
}
240241

241242

0 commit comments

Comments
 (0)