Skip to content

Use doc_values for streaming _uid / _id #15155

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pickypg opened this issue Dec 1, 2015 · 2 comments
Closed

Use doc_values for streaming _uid / _id #15155

pickypg opened this issue Dec 1, 2015 · 2 comments
Labels
discuss >enhancement high hanging fruit :Search/Search Search-related issues that do not fall into other categories

Comments

@pickypg
Copy link
Member

pickypg commented Dec 1, 2015

This issue relates heavily to #11887.

In many use cases, there is frequently the need to stream all (or many) _ids from Elasticsearch to munge them together with some other data set. In these cases, Elasticsearch is often the actual search platform, but something else is acting as the source of "truth" or a more complete representation of the data (as oppose to just what is indexed to make search work). For example imagine a scroll request that disables _source:

{
  "_source" : false,
  "query" : { ... }
}

For these use cases, it's not uncommon to want to stream literally more than 50K+ document IDs per second (aka as fast as possible). However, in practice, there is a bottleneck on streaming _ids due to the need to fetch the stored _uid field, decompress it, split it into _id, then finally serialize it as part of the response. If the aforementioned issue is merged, then we can use doc_values in order to stream these values from disk more efficiently in this use case.

Note: It may be worthwhile to consider this for other use cases where source filtering is enabled and all of the selected fields exist in doc values, especially if the user supplies the list of fields using fielddata_fields.

@jpountz
Copy link
Contributor

jpountz commented Dec 2, 2015

Note: It may be worthwhile to consider this for other use cases where source filtering is enabled and all of the selected fields exist in doc values, especially if the user supplies the list of fields using fielddata_fields.

For the record, this optimization to go to doc values is only safe if there is a single field to fetch. Otherwise it could be much slower than going to stored fields if the index size is much larger than the amount of free RAM on the server running elasticsearch.

@clintongormley
Copy link
Contributor

This streaming use can can be implemented client side today by using another field to contain the ID and setting it to use doc values. The question of whether _uid should have doc values and what format they should be in is a different one, which I think we should continue discussing on #11887

@clintongormley clintongormley added :Search/Search Search-related issues that do not fall into other categories and removed :Scroll labels Feb 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss >enhancement high hanging fruit :Search/Search Search-related issues that do not fall into other categories
Projects
None yet
Development

No branches or pull requests

3 participants