Skip to content

search_after unexpected/undocumented behaviour #34232

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
patrykk21 opened this issue Oct 2, 2018 · 8 comments
Closed

search_after unexpected/undocumented behaviour #34232

patrykk21 opened this issue Oct 2, 2018 · 8 comments
Labels
>docs General docs changes help wanted adoptme :Search/Search Search-related issues that do not fall into other categories

Comments

@patrykk21
Copy link
Contributor

Elasticsearch version (curl localhost:9200):
"version" : {
"number" : "6.2.4",
"build_hash" : "ccec39f",
"build_date" : "2018-04-12T20:37:28.497551Z",
"build_snapshot" : false,
"lucene_version" : "7.2.1",
"minimum_wire_compatibility_version" : "5.6.0",
"minimum_index_compatibility_version" : "5.0.0"
}

Plugins installed: []

JVM version (java -version):
openjdk version "1.8.0_181"
OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-0ubuntu0.16.04.1-b13)
OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)

OS version (uname -a if on a Unix-like system):
Linux ********* 4.4.0-127-generic #153-Ubuntu SMP Sat May 19 10:58:46 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:
When using the search_after API, it doesn't look for an exact match on _id but for a partial one.

More specific example:

I've set a query like so (ruby):

{:sort=>[{:timestamp=>"desc"}, {:_id=>"desc"}],
:search_after=>["1538490843000", "fd5e06f3-ded0-4cc1-8dc5-f798f3165ca2"]}

So I'd expect ElasticSearch to look for a record with an _id of fd5e06f3-ded0-4cc1-8dc5-f798f3165ca2.

However using:

{:sort=>[{:timestamp=>"desc"}, {:_id=>"desc"}],
:search_after=>["1538490843000", "fd5e06f3"]}

OR

{:sort=>[{:timestamp=>"desc"}, {:_id=>"desc"}],
:search_after=>["1538490843000", "f798f3165ca2"]}

Respectively the first part of the UUID and the last part of the UUID the same result is yielded, at least for my data.

Basically what I'm understanding is that the matching between the _id and the value I provided for the search_after API is not an exact (boolean) match, but rather a partial one. If the string I'm passing fits completely into an _id of one of the records, then the results after that specific record get yielded.

The expected behaviour is to not have a partial matching but an exact (or boolean) matching. Therefore only the record with the same exact _id of the value I'm passing should be used as a starting point.

Steps to reproduce:

Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.

  1. Create an index with timestamp, identifier
  2. Insert some data into the index and have two consecutive values have the beginning of their _id value being equal; and the same timestamp. For example "test12345" and test "test45678" with a timestamp of your choice.
  3. Run a search_after with a sorting of [timestamp, _id] and values of [chosen_timestamp, "test"]

You'll see yielded results starting right after one of the documents having an _id starting with "test" yielded, however none of the documents exactly matches the given _id.

This has been quite a hassle for me and might not even be an undesired behaviour but it is surely not documented or at least it wasn't findable for me.

Provide logs (if relevant):

I'm not sure if this is really an issue but I'd like to know if this is an expected behaviour or rather a lacking documentation issue. Either way, let me know.

Thank you

@nik9000 nik9000 added the :Search/Search Search-related issues that do not fall into other categories label Oct 2, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@nik9000
Copy link
Member

nik9000 commented Oct 2, 2018

@jimczi, could you have a look at this?

@jimczi
Copy link
Contributor

jimczi commented Oct 2, 2018

This is the expected behavior. The sort values are used to filter documents that compare smaller (or greater depending on the order of the sort). If you want to ensure that the response starts exactly at the search_after values you'll need to check the first hit in the response and compare the sort values from those provided in the query. However the values used in a search_after query are usually extracted from the last hit of the previous search so you shouldn't deal with partial values at all. Can you explain what you're trying to achieve with search_after?

@patrykk21
Copy link
Contributor Author

Sorry for the late reply! And thank you for answering so quickly

Yes, the reason I was encountering confusion and/or issues is that the same module is being included into multiple classes.

It wasn't an issue as long as the models we were using were only relying on a primary key. However it became one as soon as we introduced an elasticsearch index for a model using a composite key (DynamoDB).

Let's say those keys are called id and owner_id, the uniqueness on DynamoDB is maintained as the composite key is made of both id and owner_id. That means there can be documents with the same id but not documents with the same id AND the same owner_id. Therefore I started saving those documents not with _id = id but with _id = composite_id, where _id represents the elasticsearch document id.

The common module we were using was running the search_after API using the id key.

All tests were passing since it was correctly paginating even though the actual _id value in the elasticsearch index was actually representing the composite key (id + owner_id) since it was matching the first part of the _id (id + owner_id) with id.

Not long after I noticed reviewing the code that I was using id instead of the composite_id in the search_after API and I wondered and played around trying to understand why all tests were passing anyway.

In fact it was because the search_after API seems to return all results after a document with a partial match on the _id value.

So if we had a document like:

_id: 'testid_testownerid',
_source: {
  id: 'testid',
  owner_id: 'testownerid'
}

and we ran a search_after with a sort on _id for the value of 'testid', it would still return all results after that document even though the two strings weren't completely matching ('testid_testownerid' != 'testid')

I'm not saying this behaviour is wrong, but It would have saved me so much time if it was documented. I don't feel it as a straightforward and obvious way of working. I think there should be a couple of words regarding it.

Overall I think you did a great job with elastcisearch and it's within my intentions to try to help you improve it, even by a little margin, if possible.

Also, sorry if I sound so repetitive and specific, I'm just trying to give you the best explanation of the situation to help you understand.

Thank you and let me know!

@jimczi
Copy link
Contributor

jimczi commented Oct 5, 2018

Overall I think you did a great job with elastcisearch and it's within my intentions to try to help you improve it, even by a little margin, if possible

Thanks, your help is welcome ! I agree that we could add a small note regarding how we handle the provided sort values in search_after. Would you like to contribute with improving the documentation ?

@jimczi jimczi added >docs General docs changes help wanted adoptme and removed feedback_needed labels Oct 5, 2018
@patrykk21
Copy link
Contributor Author

I think the most fitting place for such information would be this doc page. Do you think it would be possible?

However, if we are talking about this repo I'll create a PR in the next couple of days if I find a fitting place.

Thank you

@jimczi
Copy link
Contributor

jimczi commented Oct 5, 2018

I think the most fitting place for such information would be this doc page. Do you think it would be possible?

Yes that's possible, the doc page is in this repo, you can find it here and create a PR to modify it.

@patrykk21
Copy link
Contributor Author

I created a PR

Let me know.

@jimczi jimczi closed this as completed in bb2cf7e Nov 30, 2018
jimczi pushed a commit that referenced this issue Nov 30, 2018
jimczi pushed a commit that referenced this issue Nov 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>docs General docs changes help wanted adoptme :Search/Search Search-related issues that do not fall into other categories
Projects
None yet
Development

No branches or pull requests

5 participants