Skip to content

Shard Search Scroll failures consistency #62061

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Sep 9, 2020
Merged

Conversation

jimczi
Copy link
Contributor

@jimczi jimczi commented Sep 7, 2020

Today some uncaught shard failures such as RejectedExecutionException skips the release of shard context
and let subsequent scroll requests access the same shard context again. Depending on how the other shards advanced,
this behavior can lead to missing data since scrolls always move forward.
In order to avoid hidden data loss, this commit ensures that we always release the context of shard search scroll requests whenever a failure occurs locally.
The shard search context will no longer exist in subsequent scroll requests which will lead to consistent shard failures in the responses.

This change also modifies the retry tests of the reindex feature. Reindex retries scroll search request that contains a shard failure and move on whenever the failure disappears. That is not compatible with how scrolls work and can lead to missing data as explained above.
That means that reindex will now report scroll failures when search rejection happen during the operation instead of skipping document silently.

Finally this change removes an old TODO that was fulfilled with #61062.

Relates #61062

Note for reviewer: The last commit contains a fix for #62046 and #62056 since these tests failed in CI when checking this PR. The change ensures that we create a single searcher for scrolls.

Closes #62046
Closes #62056

@jimczi jimczi added >bug :Search/Search Search-related issues that do not fall into other categories v8.0.0 :Distributed Indexing/Reindex Issues relating to reindex that are not caused by issues further down v7.10.0 labels Sep 7, 2020
@jimczi jimczi requested review from nik9000 and dnhatn September 7, 2020 13:51
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Reindex)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Search)

@elasticmachine elasticmachine added Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. Team:Search Meta label for search team labels Sep 7, 2020
Today some uncaught shard failures such as RejectedExecutionException skips the release of shard context
and let subsequent scroll requests access the same shard context again. Depending on how the other shards advanced,
this behavior can lead to missing data since scrolls always move forward.
In order to avoid hidden data loss, this commit ensures that we always release the context of shard search scroll requests whenever a failure
occurs locally. The shard search context will no longer exist in subsequent scroll requests which will lead to consistent shard failures
in the responses.
This change also modifies the retry tests of the reindex feature. Reindex retries scroll search request that contains a shard failure and
move on whenever the failure disappears. That is not compatible with how scrolls work and can lead to missing data as explained above.
That means that reindex will now report scroll failures when search rejection happen during the operation instead of skipping document
silently.
Finally this change removes an old TODO that was fulfilled with elastic#61062.
This commit ensures that the searcher that we create for scrolls is initialized only once.t
Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reindex tests look right to me. I don't know the rest of the code well enough to be sure about it but I think that is most for @dnhatn .

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a small comment

@jimczi jimczi merged commit aefca5e into elastic:master Sep 9, 2020
@jimczi jimczi deleted the scroll_failures branch September 9, 2020 07:14
dnhatn added a commit that referenced this pull request Sep 9, 2020
Previously, we close related search contexts if the keep_alive of a scroll is too large. 
But we accidentally change this behavior in #62061.
dnhatn pushed a commit to dnhatn/elasticsearch that referenced this pull request Sep 10, 2020
Today some uncaught shard failures such as RejectedExecutionException skips the release of shard context
and let subsequent scroll requests access the same shard context again. Depending on how the other shards advanced,
this behavior can lead to missing data since scrolls always move forward.
In order to avoid hidden data loss, this commit ensures that we always release the context of shard search scroll requests whenever a failure
occurs locally. The shard search context will no longer exist in subsequent scroll requests which will lead to consistent shard failures
in the responses.
This change also modifies the retry tests of the reindex feature. Reindex retries scroll search request that contains a shard failure and
move on whenever the failure disappears. That is not compatible with how scrolls work and can lead to missing data as explained above.
That means that reindex will now report scroll failures when search rejection happen during the operation instead of skipping document
silently.
Finally this change removes an old TODO that was fulfilled with elastic#61062.
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Sep 10, 2020
…2179)

Previously, we close related search contexts if the keep_alive of a scroll is too large. 
But we accidentally change this behavior in elastic#62061.
dnhatn pushed a commit that referenced this pull request Sep 10, 2020
Today some uncaught shard failures such as RejectedExecutionException skips the release of shard context
and let subsequent scroll requests access the same shard context again. Depending on how the other shards advanced,
this behavior can lead to missing data since scrolls always move forward.
In order to avoid hidden data loss, this commit ensures that we always release the context of shard search scroll requests whenever a failure
occurs locally. The shard search context will no longer exist in subsequent scroll requests which will lead to consistent shard failures
in the responses.
This change also modifies the retry tests of the reindex feature. Reindex retries scroll search request that contains a shard failure and
move on whenever the failure disappears. That is not compatible with how scrolls work and can lead to missing data as explained above.
That means that reindex will now report scroll failures when search rejection happen during the operation instead of skipping document
silently.
Finally this change removes an old TODO that was fulfilled with #61062.
dnhatn added a commit that referenced this pull request Sep 10, 2020
Previously, we close related search contexts if the keep_alive of a scroll is too large. 
But we accidentally change this behavior in #62061.
@dnhatn
Copy link
Member

dnhatn commented Sep 10, 2020

I've backported this in 3fc35aa.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Indexing/Reindex Issues relating to reindex that are not caused by issues further down :Search/Search Search-related issues that do not fall into other categories Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. Team:Search Meta label for search team v7.10.0 v8.0.0-alpha1
Projects
None yet
5 participants