-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Automatic tie-breaking for sorted queries within a PIT #65450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This change generates a tiebreaker automatically for sorted queries that are executed under a PIT (point in time reader). This allows to paginate consistently over the matching documents without requiring to provide a sort criteria that is unique per document. The tiebreaker is automatically added as the last sort values of the search hits in the response. It is then used by `search_after` to ensure that pagination will not miss any documents and that each document will appear only once. This commit also allows queries sorted by internal Lucene id (`_doc`) to be optimized if they are executed under a PIT the same way than scroll queries. Closes elastic#56828
Pinging @elastic/es-search (Team:Search) |
@@ -71,9 +71,9 @@ To get the first page of results, submit a search request with a `sort` | |||
argument. If using a PIT, specify the PIT ID in the `pit.id` parameter and omit | |||
the target data stream or index from the request path. | |||
|
|||
IMPORTANT: We recommend you include a tiebreaker field in your `sort`. This | |||
tiebreaker field should contain a unique value for each document. If you don't | |||
include a tiebreaker field, your paged results could miss or duplicate hits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if a user doesn't use PIT in a sort request? For this case should we still keep this recommendation of adding a tie-breaking field to sort fields?
Or is our official recommendation always use PIT with sort?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jimczi. A great addition to sort!!! I have 2 main questions just to better understand how things work:
- Does PIT always have the same shards and in the same order? Even if PID changes?
- What is our story about _asc sort VS _desc sort? Usually when we search in _desc sort we get docs in the opposite order from what we search in _asc sort. With this new tie breaking field this does not happen: docs with equal sort fields come up in the same increasing order of shard/_docId regardless _asc or _desc sort. And this looks unusual. This could matter for example for a backwards pagination, as we don't have search_before, we suggested users to use
search_after
with a reverse sort; but this strategy will not work for them anymore in tie breaking search as they would be getting the same documents.
private boolean canReturnNullResponseIfMatchNoDocs; | ||
private SearchSortValuesAndFormats bottomSortValues; | ||
|
||
private int shardIndex = -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make shardIndex
field final
and always required?
a PIT, the response's `pit_id` parameter contains an updated PIT ID. | ||
a PIT, the response's `pit_id` parameter contains an updated PIT ID and a tiebreaker | ||
is included as the last `sort` values for each hit. This tiebreaker is a unique value for each document that allows | ||
consistent pagination within a `pit_id`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth to add more details about what this tie_breaker
is made of, e.g. "tie_breaker is a combination of shard index and internal _doc id." ?
InternalAggregations aggregations, SearchProfileShardResults shardResults, SortedTopDocs sortedTopDocs, | ||
DocValueFormat[] sortValueFormats, int numReducePhases, int size, int from, boolean isEmptyResult) { | ||
// <code>true</code> if the search request uses a point in time reader | ||
final boolean hasPIT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: extra space
@@ -52,6 +52,8 @@ | |||
private final TimeValue searchContextKeepAlive; | |||
private final PlainIterator<String> targetNodesIterator; | |||
|
|||
private int searchShardIndex; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the default value be -1
similar to ShardSearchRequest
?
if (Sort.RELEVANCE.equals(sort) || shardIndex == fieldDocShard) { | ||
fieldDoc.doc = decodeDocID(tiebreaker); | ||
} else if (shardIndex < fieldDocShard) { | ||
fieldDoc.doc = Integer.MAX_VALUE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be worth to add more comments explaining the meaning of a certain value for a field doc. e.g.
fieldDoc.doc = Integer.MAX_VALUE; // skipping docs of this shard with equal field values
Thanks for looking @mayya-sharipova .
The order depends on the primary sort so it's consistent for the same sort order. We don't need to use the same ordering for the execution of the shards and the tiebreaker so I opened #65706. That will make the shard index that we use for tie-breaking consistent between requests that target the same shards (even if they change the sort order).
Good catch, I opened #65706 to create a consistent shard index so I am going to close this PR. I'll open a new one that will allow to set the |
This change generates a tiebreaker automatically for sorted queries that are executed
under a PIT (point in time reader). This allows to paginate consistently over the matching documents without
requiring to provide a sort criteria that is unique per document.
The tiebreaker is automatically added as the last sort values of the search hits in the response.
It is then used by
search_after
to ensure that pagination will not miss any documents and that each documentwill appear only once.
This commit also allows queries sorted by internal Lucene id (
_doc
) to be optimized if they are executedunder a PIT the same way than scroll queries.
Closes #56828