Queries using allowPartialSearchResults=false involving only successful retries fail with status 503 #40743

amirhadadi · 2019-04-02T15:26:44Z

Elasticsearch version 6.3.2

Plugins installed: []

JVM version: 1.8.144

OS version: Linux 3.13.0-88-generic #135-Ubuntu SMP x86_64 x86_64 x86_64 GNU/Linux

In AbstractSearchAsyncAction::executeNextPhase there's the following code:
if (allowPartialResults == false && shardFailures.get() != null )

This code assumes that shardFailures.get() != null indicates shard failures.
However, since shard failures can be retried and then nulled out in AbstractSearchAsyncAction::onShardSuccess, it's possible that shardFailures.get() consists of only null ShardSearchFailures. When that happens, executeNextPhase fails with "Partial shards failure".
In addition, the status code in this case is 503.

This is our query configuration:

SearchRequest{searchType=QUERY_THEN_FETCH, indices=[index], indicesOptions=IndicesOptions[id=38, ignore_unavailable=false, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, allow_aliases_to_multiple_indices=true, forbid_closed_indices=true, ignore_aliases=false], types=[], routing='null', preference='_local', requestCache=null, scroll=null, maxConcurrentShardRequests=30, batchedReduceSize=512, preFilterShardSize=128, allowPartialSearchResults=false

Steps to reproduce:

Provide logs (if relevant):
After a query fails (due to NPE in a custom java search script we use) with
org.elasticsearch.search.query.QueryPhaseExecutionException: Query Failed [Failed to execute main query]
and the query is retried on a different node and succeeds, the following appears in the log:

[2019-04-02T09:30:24,986][TRACE][o.e.a.s.TransportSearchAction] [esrec11d-10001-prod-nydc1.nydc1] got first-phase result from [t7GpBRj0TUCXKaYYwiMJVA][index][1]

[2019-04-02T09:30:24,990][TRACE][o.e.a.s.TransportSearchAction] [esrec11d-10001-prod-nydc1.nydc1] got first-phase result from [QBZwuD5MSLyDMD56SoZpWg][index][0]

[2019-04-02T09:30:24,990][DEBUG][o.e.a.s.TransportSearchAction] [esrec11d-10001-prod-nydc1.nydc1] 0 shards failed for phase: [query]

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-04-03T07:23:19Z

Pinging @elastic/es-search

jimczi · 2019-04-03T07:40:59Z

@markharwood can you take a look at this ?

markharwood · 2019-05-03T16:42:44Z

@jimczi I was surprised to see InitialSearchPhase retries searches:

If a shard request returns a failure this class handles the advance to the next replica of the shard until the shards replica iterator is exhausted.

Presumably we don't do this for all queries? I'm particularly thinking of the effect of killer queries on a cluster. Do you know what failures do or don't warrant a retry?

amirhadadi · 2019-05-30T19:53:38Z

@markharwood @jimczi did you reach any conclusions regarding the retry policy?

…tries When set to false, allowPartialSearchResults option does not check if the shard failures have been reseted to null. The atomic array, that is used to record shard failures, is filled with a null value if a successful request on a shard happens after a failure on a shard of another replica. In this case the atomic array is not empty but contains only null values so this shouldn't be considered as a failure since all shards are successful (some replicas have failed but the retries on another replica succeeded). This change fixes this bug by checking the content of the atomic array and fails the request only if allowPartialSearchResults is set to false and at least one shard failure is not null. Closes elastic#40743

jimczi · 2019-06-11T11:12:26Z

Sorry @amirhadadi this felt through the cracks. This is indeed a bug so I opened #43095

…tries (#43095) When set to false, allowPartialSearchResults option does not check if the shard failures have been reseted to null. The atomic array, that is used to record shard failures, is filled with a null value if a successful request on a shard happens after a failure on a shard of another replica. In this case the atomic array is not empty but contains only null values so this shouldn't be considered as a failure since all shards are successful (some replicas have failed but the retries on another replica succeeded). This change fixes this bug by checking the content of the atomic array and fails the request only if allowPartialSearchResults is set to false and at least one shard failure is not null. Closes #40743

amirhadadi changed the title ~~AbstractSearchAsyncAction::executeNextPhase sometimes fails if allowPartialResults=false when it shouldn't~~ AbstractSearchAsyncAction::executeNextPhase fails if allowPartialResults=false when it shouldn't Apr 2, 2019

astefan added the :Search/Search Search-related issues that do not fall into other categories label Apr 3, 2019

amirhadadi changed the title ~~AbstractSearchAsyncAction::executeNextPhase fails if allowPartialResults=false when it shouldn't~~ Queries using allowPartialResults=false involving only successful retries fail with status 503 Apr 5, 2019

amirhadadi changed the title ~~Queries using allowPartialResults=false involving only successful retries fail with status 503~~ Queries using allowPartialSearchResults =false involving only successful retries fail with status 503 Apr 5, 2019

amirhadadi changed the title ~~Queries using allowPartialSearchResults =false involving only successful retries fail with status 503~~ Queries using allowPartialSearchResults=false involving only successful retries fail with status 503 Apr 5, 2019

jimczi mentioned this issue Jun 11, 2019

SearchRequest#allowPartialSearchResults does not handle successful retries #43095

Merged

jimczi added the >bug label Jun 11, 2019

jimczi closed this as completed in #43095 Jun 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Queries using allowPartialSearchResults=false involving only successful retries fail with status 503 #40743

Queries using allowPartialSearchResults=false involving only successful retries fail with status 503 #40743

amirhadadi commented Apr 2, 2019 •

edited

Loading

elasticmachine commented Apr 3, 2019

jimczi commented Apr 3, 2019

markharwood commented May 3, 2019

amirhadadi commented May 30, 2019

jimczi commented Jun 11, 2019

Queries using allowPartialSearchResults=false involving only successful retries fail with status 503 #40743

Queries using allowPartialSearchResults=false involving only successful retries fail with status 503 #40743

Comments

amirhadadi commented Apr 2, 2019 • edited Loading

elasticmachine commented Apr 3, 2019

jimczi commented Apr 3, 2019

markharwood commented May 3, 2019

amirhadadi commented May 30, 2019

jimczi commented Jun 11, 2019

amirhadadi commented Apr 2, 2019 •

edited

Loading