-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Separate metadata from the actual response in async search index #71223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Pinging @elastic/es-search (Team:Search) |
Currently _async_search/status for stored searches returns below: for successfully completed searches: {
"id" : "FmRldE8zREVEUzA2ZVpUeGs2ejJFUFEaMkZ5QTVrSTZSaVN3WlNFVmtlWHJsdzoxMDc=",
"is_running" : false,
"is_partial" : false,
"start_time_in_millis" : 1583945890986,
"expiration_time_in_millis" : 1584377890986,
"_shards" : {
"total" : 562,
"successful" : 562,
"skipped" : 0,
"failed" : 0
},
"completion_status" : 200
} for unsuccessfully completed searches: {
"id" : "FmRldE8zREVEUzA2ZVpUeGs2ejJFUFEaMkZ5QTVrSTZSaVN3WlNFVmtlWHJsdzoxMDc=",
"is_running" : false,
"is_partial" : true,
"start_time_in_millis" : 1583945890986,
"expiration_time_in_millis" : 1584377890986,
"_shards" : {
"total" : 562,
"successful" : 450,
"skipped" : 0,
"failed" : 112
},
"completion_status" : 503
} Currently all stored responses are supposed to be completed. But in future, we may be storing partial results as well. |
ProblemTo retrieve status from a stored response, we do:
If stored The current mapping for "properties" : {
"expiration_time" : {
"type" : "long"
},
"headers" : {
"type" : "object",
"enabled" : false
},
"response_headers" : {
"type" : "object",
"enabled" : false
},
"result" : {
"type" : "object",
"enabled" : false
}
} There are several ways we can separate metadata from the actual response: Proposal 1: add status field as a base64 encoded stored only binary field:Mapping: "status": {
"type" : "binary",
"doc_values": false,
"store": true
} GET request will retrieve status as a separate source field, and decode it into GET .async-search/_doc/<ID>?stored_fields=status Upgrade scenario and BWC:
Proposal 2: add status fields as separate stored only fields:mapping
"status": {
"type" : "object",
"properties": {
"is_running": {
"type": "boolean",
"store": true,
"index": false,
"doc_values" : false
},
"is_partial": {
"type": "boolean",
"store": true,
"index": false,
"doc_values" : false
},
"start_time": {
"type": "long",
"store": true,
"index": false,
"doc_values" : false
},
"expiration_time": {
"type": "long",
"store": true,
"index": false,
"doc_values" : false
},
"shards.total": {
"type": "integer",
"store": true,
"index": false,
"doc_values" : false
},
"shards.successful": {
"type": "integer",
"store": true,
"index": false,
"doc_values" : false
},
"shards.skipped": {
"type": "integer",
"store": true,
"index": false,
"doc_values" : false
},
"shards.failed": {
"type": "integer",
"store": true,
"index": false,
"doc_values" : false
},
"completion_status": {
"type": "short",
"store": true,
"index": false,
"doc_values" : false
}
}
} And this GET request will retrieve only these stored field without retrieving and decoding an expensive _source field: GET .async-search/_doc/<ID>?stored_fields=status.is_running,status.is_partial,status.start_time,status.expiration_time,status.shards.total,status.shards.successful,status.shards.skipped,status.shards.failed,status.completion_status
Proposal 3: create a separate index for status updates: .async-search-statusPros:
Cons:
|
We had a team discussion and decided to go with the Proposal 1. So the plan is:
|
@mayya-sharipova I think we can add a version to |
@dnhatn Thanks, that's a great proposal. I will experiment with this idea |
We had another discussion on this topic today, and decided to go with GET response, and for a search response we need to worry about refresh interval. We need to measure performance of retrieving doc values fields VS stored field fields for status, and if retrieving doc values fields turns out to be much faster than stored fields, we can think of adding support for retrieving doc value fields to GET api. |
I've benchmarked the time needed for retrieving stored fields through GET request VS retrieving doc values fields through SEARCH request depending on the size of GET request with stored fields VS SEARCH request with doc values fieldsnumber of docs in the index: 1000
It looks like retrieving just doc values fields through a search request doesn't bring significant benefits. It could be explained that while doing _search request we still need to access stored field of _id, and read and decompress stored fields. GET request with _source VS GET request with stored fieldsnumber of docs in the index: 1000
So the conclusion will be to proceed with using GET request where status field is a separate stored field, as this will allow us to have a request at least 3 times faster. cc @jimczi |
Are you sure that you're benchmarking the doc values case with |
Great question! No, I only disabled _source, _id field was still being read from stored field. Doing
|
Considering that response in the |
I'd also be really curious to see a comparison against loading from _source in the 100KB - 1MB range. I tried a similar experiment on metricbeat data and only saw a small improvement when moving from _source to stored fields: #9034 (comment). This made me think that JSON parsing overhead is low. However the metricbeat documents are substantially smaller, it would be helpful to see this datapoint as well. |
Index contains 1000 docs
@jimczi @jtibshirani Thanks for the comments. Indeed for smaller documents there is no much difference between retrieving the whole _source from disk or retrieving separate stored fields. Also, a note, in my experiments, |
We've discussed this again, and considering that But we still would like to keep the issue open, as there could be other ways to improve retrieving metadata:
|
I am closing this issue as we have no concrete plans to work on it. |
Getting the status of an async search is fast when the query is running but can be very costly if the response is already stored.
When the task is gone/finished, we interrogate the async search index to retrieve the full response if it is still available. The metadata and statistics are stored inside a binary field that also contains the actual response so the cost to retrieve the status depends greatly on the size of the search response.
We should separate the actual response from the metadata in order to allow to query these informations separately.
That would make the cost of status calls more constant and cheaper than retrieving the full response each time.
The text was updated successfully, but these errors were encountered: