-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Queries using allowPartialSearchResults=false involving only successful retries fail with status 503 #40743
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Pinging @elastic/es-search |
@markharwood can you take a look at this ? |
@jimczi I was surprised to see InitialSearchPhase retries searches:
Presumably we don't do this for all queries? I'm particularly thinking of the effect of killer queries on a cluster. Do you know what failures do or don't warrant a retry? |
@markharwood @jimczi did you reach any conclusions regarding the retry policy? |
…tries When set to false, allowPartialSearchResults option does not check if the shard failures have been reseted to null. The atomic array, that is used to record shard failures, is filled with a null value if a successful request on a shard happens after a failure on a shard of another replica. In this case the atomic array is not empty but contains only null values so this shouldn't be considered as a failure since all shards are successful (some replicas have failed but the retries on another replica succeeded). This change fixes this bug by checking the content of the atomic array and fails the request only if allowPartialSearchResults is set to false and at least one shard failure is not null. Closes elastic#40743
Sorry @amirhadadi this felt through the cracks. This is indeed a bug so I opened #43095 |
…tries (#43095) When set to false, allowPartialSearchResults option does not check if the shard failures have been reseted to null. The atomic array, that is used to record shard failures, is filled with a null value if a successful request on a shard happens after a failure on a shard of another replica. In this case the atomic array is not empty but contains only null values so this shouldn't be considered as a failure since all shards are successful (some replicas have failed but the retries on another replica succeeded). This change fixes this bug by checking the content of the atomic array and fails the request only if allowPartialSearchResults is set to false and at least one shard failure is not null. Closes #40743
…tries (#43095) When set to false, allowPartialSearchResults option does not check if the shard failures have been reseted to null. The atomic array, that is used to record shard failures, is filled with a null value if a successful request on a shard happens after a failure on a shard of another replica. In this case the atomic array is not empty but contains only null values so this shouldn't be considered as a failure since all shards are successful (some replicas have failed but the retries on another replica succeeded). This change fixes this bug by checking the content of the atomic array and fails the request only if allowPartialSearchResults is set to false and at least one shard failure is not null. Closes #40743
…tries (#43095) When set to false, allowPartialSearchResults option does not check if the shard failures have been reseted to null. The atomic array, that is used to record shard failures, is filled with a null value if a successful request on a shard happens after a failure on a shard of another replica. In this case the atomic array is not empty but contains only null values so this shouldn't be considered as a failure since all shards are successful (some replicas have failed but the retries on another replica succeeded). This change fixes this bug by checking the content of the atomic array and fails the request only if allowPartialSearchResults is set to false and at least one shard failure is not null. Closes #40743
Elasticsearch version 6.3.2
Plugins installed: []
JVM version: 1.8.144
OS version: Linux 3.13.0-88-generic #135-Ubuntu SMP x86_64 x86_64 x86_64 GNU/Linux
In
AbstractSearchAsyncAction::executeNextPhase
there's the following code:if (allowPartialResults == false && shardFailures.get() != null )
This code assumes that
shardFailures.get() != null
indicates shard failures.However, since shard failures can be retried and then nulled out in
AbstractSearchAsyncAction::onShardSuccess
, it's possible that shardFailures.get() consists of only nullShardSearchFailure
s. When that happens,executeNextPhase
fails with "Partial shards failure".In addition, the status code in this case is 503.
This is our query configuration:
SearchRequest{searchType=QUERY_THEN_FETCH, indices=[index], indicesOptions=IndicesOptions[id=38, ignore_unavailable=false, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, allow_aliases_to_multiple_indices=true, forbid_closed_indices=true, ignore_aliases=false], types=[], routing='null', preference='_local', requestCache=null, scroll=null, maxConcurrentShardRequests=30, batchedReduceSize=512, preFilterShardSize=128, allowPartialSearchResults=false
Steps to reproduce:
Provide logs (if relevant):
After a query fails (due to NPE in a custom java search script we use) with
org.elasticsearch.search.query.QueryPhaseExecutionException: Query Failed [Failed to execute main query]
and the query is retried on a different node and succeeds, the following appears in the log:
[2019-04-02T09:30:24,986][TRACE][o.e.a.s.TransportSearchAction] [esrec11d-10001-prod-nydc1.nydc1] got first-phase result from [t7GpBRj0TUCXKaYYwiMJVA][index][1]
[2019-04-02T09:30:24,990][TRACE][o.e.a.s.TransportSearchAction] [esrec11d-10001-prod-nydc1.nydc1] got first-phase result from [QBZwuD5MSLyDMD56SoZpWg][index][0]
[2019-04-02T09:30:24,990][DEBUG][o.e.a.s.TransportSearchAction] [esrec11d-10001-prod-nydc1.nydc1] 0 shards failed for phase: [query]
The text was updated successfully, but these errors were encountered: