Automatic tie-breaking for sorted queries within a PIT #65450

jimczi · 2020-11-24T17:42:36Z

This change generates a tiebreaker automatically for sorted queries that are executed
under a PIT (point in time reader). This allows to paginate consistently over the matching documents without
requiring to provide a sort criteria that is unique per document.
The tiebreaker is automatically added as the last sort values of the search hits in the response.
It is then used by search_after to ensure that pagination will not miss any documents and that each document
will appear only once.
This commit also allows queries sorted by internal Lucene id (_doc) to be optimized if they are executed
under a PIT the same way than scroll queries.

Closes #56828

This change generates a tiebreaker automatically for sorted queries that are executed under a PIT (point in time reader). This allows to paginate consistently over the matching documents without requiring to provide a sort criteria that is unique per document. The tiebreaker is automatically added as the last sort values of the search hits in the response. It is then used by `search_after` to ensure that pagination will not miss any documents and that each document will appear only once. This commit also allows queries sorted by internal Lucene id (`_doc`) to be optimized if they are executed under a PIT the same way than scroll queries. Closes elastic#56828

elasticmachine · 2020-11-24T17:42:40Z

Pinging @elastic/es-search (Team:Search)

mayya-sharipova · 2020-11-30T10:56:47Z

docs/reference/search/search-your-data/paginate-search-results.asciidoc

@@ -71,9 +71,9 @@ To get the first page of results, submit a search request with a `sort`
 argument. If using a PIT, specify the PIT ID in the `pit.id` parameter and omit
 the target data stream or index from the request path.

-IMPORTANT: We recommend you include a tiebreaker field in your `sort`. This
-tiebreaker field should contain a unique value for each document. If you don't
-include a tiebreaker field, your paged results could miss or duplicate hits.


What if a user doesn't use PIT in a sort request? For this case should we still keep this recommendation of adding a tie-breaking field to sort fields?
Or is our official recommendation always use PIT with sort?

mayya-sharipova

Thanks @jimczi. A great addition to sort!!! I have 2 main questions just to better understand how things work:

Does PIT always have the same shards and in the same order? Even if PID changes?
What is our story about _asc sort VS _desc sort? Usually when we search in _desc sort we get docs in the opposite order from what we search in _asc sort. With this new tie breaking field this does not happen: docs with equal sort fields come up in the same increasing order of shard/_docId regardless _asc or _desc sort. And this looks unusual. This could matter for example for a backwards pagination, as we don't have search_before, we suggested users to use search_after with a reverse sort; but this strategy will not work for them anymore in tie breaking search as they would be getting the same documents.

mayya-sharipova · 2020-11-30T11:34:30Z

server/src/main/java/org/elasticsearch/search/internal/ShardSearchRequest.java

+    private boolean canReturnNullResponseIfMatchNoDocs;
+    private SearchSortValuesAndFormats bottomSortValues;
+
+    private int shardIndex = -1;


can we make shardIndex field final and always required?

mayya-sharipova · 2020-11-30T14:15:08Z

docs/reference/search/search-your-data/paginate-search-results.asciidoc

-a PIT, the response's `pit_id` parameter contains an updated PIT ID.
+a PIT, the response's `pit_id` parameter contains an updated PIT ID and a tiebreaker
+is included as the last `sort` values for each hit. This tiebreaker is a unique value for each document that allows
+consistent pagination within a `pit_id`.


Is it worth to add more details about what this tie_breaker is made of, e.g. "tie_breaker is a combination of shard index and internal _doc id." ?

mayya-sharipova · 2020-11-30T14:19:45Z

server/src/main/java/org/elasticsearch/action/search/SearchPhaseController.java

-                          InternalAggregations aggregations, SearchProfileShardResults shardResults, SortedTopDocs sortedTopDocs,
-                          DocValueFormat[] sortValueFormats, int numReducePhases, int size, int from, boolean isEmptyResult) {
+        // <code>true</code> if the search request uses a point in time reader
+        final  boolean hasPIT;


nit: extra space

mayya-sharipova · 2020-11-30T15:51:17Z

server/src/main/java/org/elasticsearch/action/search/SearchShardIterator.java

@@ -52,6 +52,8 @@
    private final TimeValue searchContextKeepAlive;
    private final PlainIterator<String> targetNodesIterator;

+    private int searchShardIndex;


should the default value be -1 similar to ShardSearchRequest?

mayya-sharipova · 2020-11-30T16:27:21Z

server/src/main/java/org/elasticsearch/search/searchafter/SearchAfterBuilder.java

+            if (Sort.RELEVANCE.equals(sort) || shardIndex == fieldDocShard) {
+                fieldDoc.doc = decodeDocID(tiebreaker);
+            } else if (shardIndex < fieldDocShard) {
+                fieldDoc.doc = Integer.MAX_VALUE;


may be worth to add more comments explaining the meaning of a certain value for a field doc. e.g.
fieldDoc.doc = Integer.MAX_VALUE; // skipping docs of this shard with equal field values

jimczi · 2020-12-01T21:33:57Z

Thanks for looking @mayya-sharipova .

Does PIT always have the same shards and in the same order? Even if PID changes?

The order depends on the primary sort so it's consistent for the same sort order. We don't need to use the same ordering for the execution of the shards and the tiebreaker so I opened #65706. That will make the shard index that we use for tie-breaking consistent between requests that target the same shards (even if they change the sort order).

What is our story about _asc sort VS _desc sort? Usually when we search in _desc sort we get docs in the opposite order from what we search in _asc sort. With this new tie breaking field this does not happen: docs with equal sort fields come up in the same increasing order of shard/_docId regardless _asc or _desc sort. And this looks unusual. This could matter for example for a backwards pagination, as we don't have search_before, we suggested users to use search_after with a reverse sort; but this strategy will not work for them anymore in tie breaking search as they would be getting the same documents.

Good catch, I opened #65706 to create a consistent shard index so I am going to close this PR. I'll open a new one that will allow to set the _tiebreaker in the sort criteria directly when #65706 is merged.

jimczi added >enhancement :Search/Search Search-related issues that do not fall into other categories v8.0.0 v7.11.0 labels Nov 24, 2020

elasticmachine added the Team:Search Meta label for search team label Nov 24, 2020

jimczi added 3 commits November 25, 2020 10:23

Fix yml tests

b5cac78

yml test

e248522

Merge branch 'master' into automatic_tiebreak_pit

77a5d2b

mayya-sharipova reviewed Nov 30, 2020

View reviewed changes

droberts195 mentioned this pull request Nov 30, 2020

[ML] Make sort order for datafeeds deterministic #39187

Open

jimczi closed this Dec 1, 2020

stevejgordon mentioned this pull request Dec 17, 2020

7.11.0 Meta Ticket elastic/elasticsearch-net#5198

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic tie-breaking for sorted queries within a PIT #65450

Automatic tie-breaking for sorted queries within a PIT #65450

jimczi commented Nov 24, 2020

elasticmachine commented Nov 24, 2020

mayya-sharipova Nov 30, 2020 •

edited

Loading

mayya-sharipova left a comment •

edited

Loading

mayya-sharipova Nov 30, 2020

mayya-sharipova Nov 30, 2020

mayya-sharipova Nov 30, 2020

mayya-sharipova Nov 30, 2020

mayya-sharipova Nov 30, 2020

jimczi commented Dec 1, 2020

Automatic tie-breaking for sorted queries within a PIT #65450

Automatic tie-breaking for sorted queries within a PIT #65450

Conversation

jimczi commented Nov 24, 2020

elasticmachine commented Nov 24, 2020

mayya-sharipova Nov 30, 2020 • edited Loading

Choose a reason for hiding this comment

mayya-sharipova left a comment • edited Loading

Choose a reason for hiding this comment

mayya-sharipova Nov 30, 2020

Choose a reason for hiding this comment

mayya-sharipova Nov 30, 2020

Choose a reason for hiding this comment

mayya-sharipova Nov 30, 2020

Choose a reason for hiding this comment

mayya-sharipova Nov 30, 2020

Choose a reason for hiding this comment

mayya-sharipova Nov 30, 2020

Choose a reason for hiding this comment

jimczi commented Dec 1, 2020

mayya-sharipova Nov 30, 2020 •

edited

Loading

mayya-sharipova left a comment •

edited

Loading