Skip to content

Remove single shard optimization when suggesting shard_size #37041

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

javanna
Copy link
Member

@javanna javanna commented Dec 31, 2018

When executing terms aggregations we set the shard_size, meaning the
number of buckets to collect on each shard, to a value that's higher than
the number of requested buckets, to guarantee some basic level of
precision. We have an optimization in place so that we leave shard_size
set to size whenever we are searching against a single shard, in which
case maximum precision is guaranteed by definition.

Such optimization requires us access to the total number of shards that
the search is executing against. In the context of cross-cluster search,
once we will introduce multiple reduction steps (one per cluster) each
cluster will only know the number of local shards, which is problematic
as we should only optimize if we are searching against a single shard in a
single cluster. It could be that we are searching against one shard per cluster
in which case the current code would optimize number of terms causing
a loss of precision.

While discussing how to address the CCS scenario, we decided that we do
not want to introduce further complexity caused by this single shard
optimization, as it benefits only a minority of cases, especially when
the benefits are not so great.

This commit removes the single shard optimization, meaning that we will
always have heuristic enabled on how many number of buckets to collect
on the shards, even when searching against a single shard.

This will cause more buckets to be collected when searching against a single
shard compared to before. If that becomes a problem for some users, they
can work around that by setting the shard_size equal to the size.

Relates to #32125

When executing terms aggregations we set the shard_size, meaning the
number of buckets to collect on each shard, to a value that's higher than
 the number of requested buckets, to guarantee some basic level of
 precision. We have an optimization in place so that we leave shard_size
 set to size whenever we are searching against a single shard, in which
 case maximum precision is guaranteed by definition.

Such optimization requires us access to the total number of shards that
the search is executing against. In the context of cross-cluster search,
once we will introduce multiple reduction steps (one per cluster) each
cluster will only know the number of local shards, which is problematic
as we can only optimize if we are searching against a single shard in a
single cluster.

While discussing how to address the CCS scenario, we decided that we do
not want to introduce further complexity caused by this single shard
optimization that benefits only a minority of cases, especially when
the benefits are not so huge.

This commit removes the single shard optimization, meaning that we will
always have heuristic enabled on how many number of buckets to collect
on the shards, even when searching against a single shard.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good. Should we remove SearchContext#numberOfShards entirely to avoid adding back something like that in the future?

@javanna
Copy link
Member Author

javanna commented Dec 31, 2018

Should we remove SearchContext#numberOfShards entirely to avoid adding back something like that in the future?

Would be nice, but it is still used in SearchSlowlog and ScrollingTopDocsCollectorContext

@javanna
Copy link
Member Author

javanna commented Jan 2, 2019

retest this please

@javanna
Copy link
Member Author

javanna commented Jan 2, 2019

run gradle build tests 2

@javanna javanna merged commit 42ea644 into elastic:master Jan 2, 2019
@javanna
Copy link
Member Author

javanna commented Jan 2, 2019

thanks @jpountz & @jimczi !

javanna added a commit that referenced this pull request Jan 7, 2019
When executing terms aggregations we set the shard_size, meaning the
number of buckets to collect on each shard, to a value that's higher than
the number of requested buckets, to guarantee some basic level of
precision. We have an optimization in place so that we leave shard_size
set to size whenever we are searching against a single shard, in which
case maximum precision is guaranteed by definition.

Such optimization requires us access to the total number of shards that
the search is executing against. In the context of cross-cluster search,
once we will introduce multiple reduction steps (one per cluster) each
cluster will only know the number of local shards, which is problematic
as we should only optimize if we are searching against a single shard in a
single cluster. It could be that we are searching against one shard per cluster
in which case the current code would optimize number of terms causing
a loss of precision.

While discussing how to address the CCS scenario, we decided that we do
not want to introduce further complexity caused by this single shard
optimization, as it benefits only a minority of cases, especially when
the benefits are not so great.

This commit removes the single shard optimization, meaning that we will
always have heuristic enabled on how many number of buckets to collect
on the shards, even when searching against a single shard.

This will cause more buckets to be collected when searching against a single
shard compared to before. If that becomes a problem for some users, they
can work around that by setting the shard_size equal to the size.

Relates to #32125
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants