Skip to content

Queries using allowPartialSearchResults=false involving only successful retries fail with status 503 #40743

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amirhadadi opened this issue Apr 2, 2019 · 5 comments · Fixed by #43095
Labels
>bug :Search/Search Search-related issues that do not fall into other categories

Comments

@amirhadadi
Copy link

amirhadadi commented Apr 2, 2019

Elasticsearch version 6.3.2

Plugins installed: []

JVM version: 1.8.144

OS version: Linux 3.13.0-88-generic #135-Ubuntu SMP x86_64 x86_64 x86_64 GNU/Linux

In AbstractSearchAsyncAction::executeNextPhase there's the following code:
if (allowPartialResults == false && shardFailures.get() != null )

This code assumes that shardFailures.get() != null indicates shard failures.
However, since shard failures can be retried and then nulled out in AbstractSearchAsyncAction::onShardSuccess, it's possible that shardFailures.get() consists of only null ShardSearchFailures. When that happens, executeNextPhase fails with "Partial shards failure".
In addition, the status code in this case is 503.

This is our query configuration:

SearchRequest{searchType=QUERY_THEN_FETCH, indices=[index], indicesOptions=IndicesOptions[id=38, ignore_unavailable=false, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, allow_aliases_to_multiple_indices=true, forbid_closed_indices=true, ignore_aliases=false], types=[], routing='null', preference='_local', requestCache=null, scroll=null, maxConcurrentShardRequests=30, batchedReduceSize=512, preFilterShardSize=128, allowPartialSearchResults=false

Steps to reproduce:

Provide logs (if relevant):
After a query fails (due to NPE in a custom java search script we use) with
org.elasticsearch.search.query.QueryPhaseExecutionException: Query Failed [Failed to execute main query]
and the query is retried on a different node and succeeds, the following appears in the log:

[2019-04-02T09:30:24,986][TRACE][o.e.a.s.TransportSearchAction] [esrec11d-10001-prod-nydc1.nydc1] got first-phase result from [t7GpBRj0TUCXKaYYwiMJVA][index][1]

[2019-04-02T09:30:24,990][TRACE][o.e.a.s.TransportSearchAction] [esrec11d-10001-prod-nydc1.nydc1] got first-phase result from [QBZwuD5MSLyDMD56SoZpWg][index][0]

[2019-04-02T09:30:24,990][DEBUG][o.e.a.s.TransportSearchAction] [esrec11d-10001-prod-nydc1.nydc1] 0 shards failed for phase: [query]

@amirhadadi amirhadadi changed the title AbstractSearchAsyncAction::executeNextPhase sometimes fails if allowPartialResults=false when it shouldn't AbstractSearchAsyncAction::executeNextPhase fails if allowPartialResults=false when it shouldn't Apr 2, 2019
@astefan astefan added the :Search/Search Search-related issues that do not fall into other categories label Apr 3, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

@jimczi
Copy link
Contributor

jimczi commented Apr 3, 2019

@markharwood can you take a look at this ?

@amirhadadi amirhadadi changed the title AbstractSearchAsyncAction::executeNextPhase fails if allowPartialResults=false when it shouldn't Queries using allowPartialResults=false involving only successful retries fail with status 503 Apr 5, 2019
@amirhadadi amirhadadi changed the title Queries using allowPartialResults=false involving only successful retries fail with status 503 Queries using allowPartialSearchResults =false involving only successful retries fail with status 503 Apr 5, 2019
@amirhadadi amirhadadi changed the title Queries using allowPartialSearchResults =false involving only successful retries fail with status 503 Queries using allowPartialSearchResults=false involving only successful retries fail with status 503 Apr 5, 2019
@markharwood
Copy link
Contributor

@jimczi I was surprised to see InitialSearchPhase retries searches:

If a shard request returns a failure this class handles the advance to the next replica of the shard until the shards replica iterator is exhausted.

Presumably we don't do this for all queries? I'm particularly thinking of the effect of killer queries on a cluster. Do you know what failures do or don't warrant a retry?

@amirhadadi
Copy link
Author

@markharwood @jimczi did you reach any conclusions regarding the retry policy?

jimczi added a commit to jimczi/elasticsearch that referenced this issue Jun 11, 2019
…tries

When set to false, allowPartialSearchResults option does not check if the
shard failures have been reseted to null. The atomic array, that is used to record
shard failures, is filled with a null value if a successful request on a shard happens
after a failure on a shard of another replica. In this case the atomic array is not empty
but contains only null values so this shouldn't be considered as a failure since all
shards are successful (some replicas have failed but the retries on another replica succeeded).
This change fixes this bug by checking the content of the atomic array and fails the request only
if allowPartialSearchResults is set to false and at least one shard failure is not null.

Closes elastic#40743
@jimczi jimczi added the >bug label Jun 11, 2019
@jimczi
Copy link
Contributor

jimczi commented Jun 11, 2019

Sorry @amirhadadi this felt through the cracks. This is indeed a bug so I opened #43095

jimczi added a commit that referenced this issue Jun 12, 2019
…tries (#43095)

When set to false, allowPartialSearchResults option does not check if the
shard failures have been reseted to null. The atomic array, that is used to record
shard failures, is filled with a null value if a successful request on a shard happens
after a failure on a shard of another replica. In this case the atomic array is not empty
but contains only null values so this shouldn't be considered as a failure since all
shards are successful (some replicas have failed but the retries on another replica succeeded).
This change fixes this bug by checking the content of the atomic array and fails the request only
if allowPartialSearchResults is set to false and at least one shard failure is not null.

Closes #40743
jimczi added a commit that referenced this issue Jun 12, 2019
…tries (#43095)

When set to false, allowPartialSearchResults option does not check if the
shard failures have been reseted to null. The atomic array, that is used to record
shard failures, is filled with a null value if a successful request on a shard happens
after a failure on a shard of another replica. In this case the atomic array is not empty
but contains only null values so this shouldn't be considered as a failure since all
shards are successful (some replicas have failed but the retries on another replica succeeded).
This change fixes this bug by checking the content of the atomic array and fails the request only
if allowPartialSearchResults is set to false and at least one shard failure is not null.

Closes #40743
jimczi added a commit that referenced this issue Jun 12, 2019
…tries (#43095)

When set to false, allowPartialSearchResults option does not check if the
shard failures have been reseted to null. The atomic array, that is used to record
shard failures, is filled with a null value if a successful request on a shard happens
after a failure on a shard of another replica. In this case the atomic array is not empty
but contains only null values so this shouldn't be considered as a failure since all
shards are successful (some replicas have failed but the retries on another replica succeeded).
This change fixes this bug by checking the content of the atomic array and fails the request only
if allowPartialSearchResults is set to false and at least one shard failure is not null.

Closes #40743
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search/Search Search-related issues that do not fall into other categories
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants