Skip to content

Properly size empty filters #71864

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 5, 2021
Merged

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Apr 19, 2021

This saves a few ArrayList realocations by fixing an order of operations
bug issue in the "empty" handling arm for filters. I've also added a
test so I can debug that branch.

@nik9000 nik9000 requested a review from not-napoleon April 19, 2021 16:59
@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Apr 19, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

This saves a few ArrayList realocations by fixing an order of operations
bug issue in the "empty" handling arm for `filters`. I've also added a
test so I can debug that branch.
nik9000 referenced this pull request Apr 19, 2021
This speeds up the `terms` agg in a very specific case:
1. It has no child aggregations
2. It has no parent aggregations
3. There are no deleted documents
4. You are not using document level security
5. There is no top level query
6. The field has global ordinals
7. There are less than one thousand distinct terms

That is a lot of restirctions! But the speed up pretty substantial because
in those cases we can serve the entire aggregation using metadata that
lucene precomputes while it builds the index. In a real rally track we
have we get a 92% speed improvement, but the index isn't *that* big:

```
| 90th percentile service time | keyword-terms-low-cardinality |     446.031 |     36.7677 | -409.263 |     ms |
```

In a rally track with a larger index I ran some tests by hand and the
aggregation went from 2200ms to 8ms.

Even though there are 7 restrictions on this, I expect it to come into
play enough to matter. Restriction 6 just means you are aggregating on
a `keyword` field. Or an `ip`. And its fairly common for `keyword`s to
have less than a thousand distinct values. Certainly not everywhere, but
some places.

I expect "cold tier" indices are very very likely not to have deleted
documents at all. And the optimization works segment by segment - so
it'll save some time on each segment without deleted documents. But more
time if the entire index doesn't have any.

The optimization builds on #68871 which translates `terms` aggregations
against low cardinality fields with global ordinals into a `filters`
aggregation. This teaches the `filters` aggregation to recognize when
it can get its results from the index metadata. Rather, it creates the
infrastructure to make that fairly simple and applies it in the case of
the queries generated by the terms aggregation.
Copy link
Member

@not-napoleon not-napoleon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -251,7 +251,7 @@ private FiltersAggregator(String name, AggregatorFactories factories, List<Query
@Override
public InternalAggregation buildEmptyAggregation() {
InternalAggregations subAggs = buildEmptySubAggregations();
List<InternalFilters.InternalBucket> buckets = new ArrayList<>(filters.size() + otherBucketKey == null ? 0 : 1);
List<InternalFilters.InternalBucket> buckets = new ArrayList<>(filters.size() + (otherBucketKey == null ? 0 : 1));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. I always say, if you have to think about operator precedence, you should add more parenthesis.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Truth.

@nik9000 nik9000 merged commit 47719f8 into elastic:master May 5, 2021
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request May 5, 2021
This saves a few ArrayList realocations by fixing an order of operations
bug issue in the "empty" handling arm for `filters`. I've also added a
test so I can debug that branch.
nik9000 added a commit that referenced this pull request May 10, 2021
This saves a few ArrayList realocations by fixing an order of operations
bug issue in the "empty" handling arm for `filters`. I've also added a
test so I can debug that branch.

Co-authored-by: Elastic Machine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v7.14.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants