Shard Search Scroll failures consistency #62061

jimczi · 2020-09-07T13:51:07Z

Today some uncaught shard failures such as RejectedExecutionException skips the release of shard context
and let subsequent scroll requests access the same shard context again. Depending on how the other shards advanced,
this behavior can lead to missing data since scrolls always move forward.
In order to avoid hidden data loss, this commit ensures that we always release the context of shard search scroll requests whenever a failure occurs locally.
The shard search context will no longer exist in subsequent scroll requests which will lead to consistent shard failures in the responses.

This change also modifies the retry tests of the reindex feature. Reindex retries scroll search request that contains a shard failure and move on whenever the failure disappears. That is not compatible with how scrolls work and can lead to missing data as explained above.
That means that reindex will now report scroll failures when search rejection happen during the operation instead of skipping document silently.

Finally this change removes an old TODO that was fulfilled with #61062.

Relates #61062

Note for reviewer: The last commit contains a fix for #62046 and #62056 since these tests failed in CI when checking this PR. The change ensures that we create a single searcher for scrolls.

Closes #62046
Closes #62056

elasticmachine · 2020-09-07T13:51:09Z

Pinging @elastic/es-distributed (:Distributed/Reindex)

elasticmachine · 2020-09-07T13:51:09Z

Pinging @elastic/es-search (:Search/Search)

Today some uncaught shard failures such as RejectedExecutionException skips the release of shard context and let subsequent scroll requests access the same shard context again. Depending on how the other shards advanced, this behavior can lead to missing data since scrolls always move forward. In order to avoid hidden data loss, this commit ensures that we always release the context of shard search scroll requests whenever a failure occurs locally. The shard search context will no longer exist in subsequent scroll requests which will lead to consistent shard failures in the responses. This change also modifies the retry tests of the reindex feature. Reindex retries scroll search request that contains a shard failure and move on whenever the failure disappears. That is not compatible with how scrolls work and can lead to missing data as explained above. That means that reindex will now report scroll failures when search rejection happen during the operation instead of skipping document silently. Finally this change removes an old TODO that was fulfilled with elastic#61062.

This commit ensures that the searcher that we create for scrolls is initialized only once.t

nik9000

The reindex tests look right to me. I don't know the rest of the code well enough to be sure about it but I think that is most for @dnhatn .

dnhatn

LGTM with a small comment

server/src/main/java/org/elasticsearch/search/SearchService.java

Previously, we close related search contexts if the keep_alive of a scroll is too large. But we accidentally change this behavior in #62061.

Today some uncaught shard failures such as RejectedExecutionException skips the release of shard context and let subsequent scroll requests access the same shard context again. Depending on how the other shards advanced, this behavior can lead to missing data since scrolls always move forward. In order to avoid hidden data loss, this commit ensures that we always release the context of shard search scroll requests whenever a failure occurs locally. The shard search context will no longer exist in subsequent scroll requests which will lead to consistent shard failures in the responses. This change also modifies the retry tests of the reindex feature. Reindex retries scroll search request that contains a shard failure and move on whenever the failure disappears. That is not compatible with how scrolls work and can lead to missing data as explained above. That means that reindex will now report scroll failures when search rejection happen during the operation instead of skipping document silently. Finally this change removes an old TODO that was fulfilled with elastic#61062.

…2179) Previously, we close related search contexts if the keep_alive of a scroll is too large. But we accidentally change this behavior in elastic#62061.

Today some uncaught shard failures such as RejectedExecutionException skips the release of shard context and let subsequent scroll requests access the same shard context again. Depending on how the other shards advanced, this behavior can lead to missing data since scrolls always move forward. In order to avoid hidden data loss, this commit ensures that we always release the context of shard search scroll requests whenever a failure occurs locally. The shard search context will no longer exist in subsequent scroll requests which will lead to consistent shard failures in the responses. This change also modifies the retry tests of the reindex feature. Reindex retries scroll search request that contains a shard failure and move on whenever the failure disappears. That is not compatible with how scrolls work and can lead to missing data as explained above. That means that reindex will now report scroll failures when search rejection happen during the operation instead of skipping document silently. Finally this change removes an old TODO that was fulfilled with #61062.

Previously, we close related search contexts if the keep_alive of a scroll is too large. But we accidentally change this behavior in #62061.

dnhatn · 2020-09-10T23:30:53Z

I've backported this in 3fc35aa.

jimczi added >bug :Search/Search Search-related issues that do not fall into other categories v8.0.0 :Distributed Indexing/Reindex Issues relating to reindex that are not caused by issues further down v7.10.0 labels Sep 7, 2020

jimczi requested review from nik9000 and dnhatn September 7, 2020 13:51

elasticmachine added Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. Team:Search Meta label for search team labels Sep 7, 2020

jimczi added 2 commits September 8, 2020 01:09

Make the searcher for legacy reader context final

1f12505

This commit ensures that the searcher that we create for scrolls is initialized only once.t

jimczi force-pushed the scroll_failures branch from e4d690a to 1f12505 Compare September 7, 2020 23:10

nik9000 reviewed Sep 8, 2020

View reviewed changes

dnhatn approved these changes Sep 8, 2020

View reviewed changes

server/src/main/java/org/elasticsearch/search/SearchService.java Outdated Show resolved Hide resolved

jimczi added 2 commits September 9, 2020 01:27

add keepAlive in markAsUsed

7d92769

Merge branch 'master' into scroll_failures

6a569c3

jimczi merged commit aefca5e into elastic:master Sep 9, 2020

jimczi deleted the scroll_failures branch September 9, 2020 07:14

henningandersen mentioned this pull request Sep 9, 2020

Fix cluster health when closing #61709

Merged

dnhatn mentioned this pull request Sep 9, 2020

Release search context when scroll keep_alive is too large #62179

Merged

dnhatn added a commit that referenced this pull request Sep 9, 2020

Release search context when scroll keep_alive is too large (#62179)

4e974bc

Previously, we close related search contexts if the keep_alive of a scroll is too large. But we accidentally change this behavior in #62061.

dnhatn added the backport pending label Sep 10, 2020

dnhatn added a commit that referenced this pull request Sep 10, 2020

Release search context when scroll keep_alive is too large (#62179)

063a6d0

Previously, we close related search contexts if the keep_alive of a scroll is too large. But we accidentally change this behavior in #62061.

dnhatn removed the backport pending label Sep 10, 2020

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shard Search Scroll failures consistency #62061

Shard Search Scroll failures consistency #62061

jimczi commented Sep 7, 2020 •

edited

Loading

elasticmachine commented Sep 7, 2020

elasticmachine commented Sep 7, 2020

nik9000 left a comment

dnhatn left a comment

dnhatn commented Sep 10, 2020

Shard Search Scroll failures consistency #62061

Shard Search Scroll failures consistency #62061

Conversation

jimczi commented Sep 7, 2020 • edited Loading

elasticmachine commented Sep 7, 2020

elasticmachine commented Sep 7, 2020

nik9000 left a comment

Choose a reason for hiding this comment

dnhatn left a comment

Choose a reason for hiding this comment

dnhatn commented Sep 10, 2020

jimczi commented Sep 7, 2020 •

edited

Loading