Remove single shard optimization when suggesting shard_size #37041

javanna · 2018-12-31T13:26:12Z

When executing terms aggregations we set the shard_size, meaning the
number of buckets to collect on each shard, to a value that's higher than
the number of requested buckets, to guarantee some basic level of
precision. We have an optimization in place so that we leave shard_size
set to size whenever we are searching against a single shard, in which
case maximum precision is guaranteed by definition.

Such optimization requires us access to the total number of shards that
the search is executing against. In the context of cross-cluster search,
once we will introduce multiple reduction steps (one per cluster) each
cluster will only know the number of local shards, which is problematic
as we should only optimize if we are searching against a single shard in a
single cluster. It could be that we are searching against one shard per cluster
in which case the current code would optimize number of terms causing
a loss of precision.

While discussing how to address the CCS scenario, we decided that we do
not want to introduce further complexity caused by this single shard
optimization, as it benefits only a minority of cases, especially when
the benefits are not so great.

This commit removes the single shard optimization, meaning that we will
always have heuristic enabled on how many number of buckets to collect
on the shards, even when searching against a single shard.

This will cause more buckets to be collected when searching against a single
shard compared to before. If that becomes a problem for some users, they
can work around that by setting the shard_size equal to the size.

Relates to #32125

When executing terms aggregations we set the shard_size, meaning the number of buckets to collect on each shard, to a value that's higher than the number of requested buckets, to guarantee some basic level of precision. We have an optimization in place so that we leave shard_size set to size whenever we are searching against a single shard, in which case maximum precision is guaranteed by definition. Such optimization requires us access to the total number of shards that the search is executing against. In the context of cross-cluster search, once we will introduce multiple reduction steps (one per cluster) each cluster will only know the number of local shards, which is problematic as we can only optimize if we are searching against a single shard in a single cluster. While discussing how to address the CCS scenario, we decided that we do not want to introduce further complexity caused by this single shard optimization that benefits only a minority of cases, especially when the benefits are not so huge. This commit removes the single shard optimization, meaning that we will always have heuristic enabled on how many number of buckets to collect on the shards, even when searching against a single shard.

elasticmachine · 2018-12-31T13:26:14Z

Pinging @elastic/es-analytics-geo

jimczi

LGTM

jpountz

It looks good. Should we remove SearchContext#numberOfShards entirely to avoid adding back something like that in the future?

javanna · 2018-12-31T14:11:16Z

Should we remove SearchContext#numberOfShards entirely to avoid adding back something like that in the future?

Would be nice, but it is still used in SearchSlowlog and ScrollingTopDocsCollectorContext

javanna · 2019-01-02T12:50:55Z

retest this please

javanna · 2019-01-02T13:35:32Z

run gradle build tests 2

…ization

javanna · 2019-01-02T16:46:01Z

thanks @jpountz & @jimczi !

When executing terms aggregations we set the shard_size, meaning the number of buckets to collect on each shard, to a value that's higher than the number of requested buckets, to guarantee some basic level of precision. We have an optimization in place so that we leave shard_size set to size whenever we are searching against a single shard, in which case maximum precision is guaranteed by definition. Such optimization requires us access to the total number of shards that the search is executing against. In the context of cross-cluster search, once we will introduce multiple reduction steps (one per cluster) each cluster will only know the number of local shards, which is problematic as we should only optimize if we are searching against a single shard in a single cluster. It could be that we are searching against one shard per cluster in which case the current code would optimize number of terms causing a loss of precision. While discussing how to address the CCS scenario, we decided that we do not want to introduce further complexity caused by this single shard optimization, as it benefits only a minority of cases, especially when the benefits are not so great. This commit removes the single shard optimization, meaning that we will always have heuristic enabled on how many number of buckets to collect on the shards, even when searching against a single shard. This will cause more buckets to be collected when searching against a single shard compared to before. If that becomes a problem for some users, they can work around that by setting the shard_size equal to the size. Relates to #32125

javanna added >enhancement :Analytics/Aggregations Aggregations v7.0.0 v6.7.0 labels Dec 31, 2018

javanna requested review from jpountz and jimczi December 31, 2018 13:26

javanna mentioned this pull request Dec 31, 2018

Cross-cluster search alternate execution mode #32125

Closed

11 tasks

jimczi approved these changes Dec 31, 2018

View reviewed changes

jpountz approved these changes Dec 31, 2018

View reviewed changes

fix TermsShardMinDocCountIT

a795b38

update docs

affffe8

Merge branch 'master' into enhancement/remove_single_shard_size_optim…

21b146b

…ization

javanna added the backport pending label Jan 2, 2019

javanna merged commit 42ea644 into elastic:master Jan 2, 2019

javanna removed backport pending labels Jan 7, 2019

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove single shard optimization when suggesting shard_size #37041

Remove single shard optimization when suggesting shard_size #37041

javanna commented Dec 31, 2018 •

edited

Loading

elasticmachine commented Dec 31, 2018

jimczi left a comment

jpountz left a comment

javanna commented Dec 31, 2018

javanna commented Jan 2, 2019

javanna commented Jan 2, 2019

javanna commented Jan 2, 2019

Remove single shard optimization when suggesting shard_size #37041

Remove single shard optimization when suggesting shard_size #37041

Conversation

javanna commented Dec 31, 2018 • edited Loading

elasticmachine commented Dec 31, 2018

jimczi left a comment

Choose a reason for hiding this comment

jpountz left a comment

Choose a reason for hiding this comment

javanna commented Dec 31, 2018

javanna commented Jan 2, 2019

javanna commented Jan 2, 2019

javanna commented Jan 2, 2019

javanna commented Dec 31, 2018 •

edited

Loading