Skip to content

Dynamic pruning for aggregations? #85759

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jpountz opened this issue Apr 8, 2022 · 1 comment
Open

Dynamic pruning for aggregations? #85759

jpountz opened this issue Apr 8, 2022 · 1 comment
Labels
:Analytics/Aggregations Aggregations >enhancement Meta Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@jpountz
Copy link
Contributor

jpountz commented Apr 8, 2022

Description

Some queries do not need to look at all matches to return results: while matches are being collected, they can take advantage of the hits that have been collected so far to skip over uninteresting matches. This is what Elasticsearch does for sorted queries for instance, if you want to get the top 100 most recent hits by descending @timestamp and the 100th hit you collected so far has a timestamp of 2022-04-08, then you know you only need to look at hits whole timestamp is greater than or equal to this value, and it's easy to narrow down the set of documents to evaluate thanks to index structures.

Dynamic pruning has the potential of yielding massive speedups. See for instance annotation DD on Lucene's nightly benchmarks for queries sorted by date, dynamic pruning made that query 7.5x faster.

Aggregations have some logic to short-circuit collection entirely already. This is even more efficient than dynamic pruning but has the downside of usually only being applicable to very specific queries, e.g. queries that rewrite to a match_all. On the other hand, dynamic pruning applies to any query.

This made me want to think more of whether dynamic pruning could be applied to aggregations:

  • min and max could leverage dynamic pruning, since they are very similar to sorted queries.
  • geo_bounds might be able to leverage dynamic pruning by filtering documents that are outside of the bounds that have been computed so far.
  • top_hits and top_metrics can leverage dynamic pruning similarly to sorted queries.
  • cardinality aggregations could skip all documents whose values have already been seen. In practice this might only work well on fields that have cardinalities in the order of tens though, otherwise the filter that matches all unseen values might be more costly than collecting all matches naively.

TSDB is also going to introduce more opportunities for dynamic pruning thanks to its index sort. For instance, cardinality aggregations on dimensions could leverage dynamic pruning regardless of the field's cardinality (#85523) and top_metrics by descending @timestamp field could perform even better thanks to the secondary sort on @timestamp: getting the last sample for each timeseries would be very cheap.

Maybe we could also think of aggregations that we do not have that could be interesting mostly because they could leverage dynamic pruning. For instance, listing unique values of a field, which is like terms without the doc_count. Such an aggregation could implement dynamic pruning for the same reasons that the cardinality aggregation could, and could help answer questions like "What are all the hosts that sent data within the last hour" very efficiently.

@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Apr 8, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >enhancement Meta Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

No branches or pull requests

2 participants