-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Elasticsearch cluster blocked by frozen task #35338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hello @KeyZer, Thanks for reaching out. I'd like to understand better is this a feature request (autoexpiration of old tasks) or a general question? |
I would classify it as a bug report. The cluster should never end up in a state where no changes can be done to the cluster state because of a high priority pending task has not been completed for X days. The task probably got frozen because there was an issue with creating a shard (see log file) but the cluster keeps waiting for the task to complete. Autoexpiration/heart beats of tasks are a possible solution to the issue. |
Pinging @elastic/es-distributed |
The issue here has nothing to do with task priorities or cancelling / auto-expiring tasks. We execute the cluster state update tasks using a single threaded executor, and executing of a task here seems to have gotten stuck (see "executing": true), indefinitely blocking the single thread of the executor. It would be interesting to find out where this thread is hanging. Can you provide hot_threads or jstack output of the master node? |
After we restart the master node in the cluster the pending_tasks cleared so running hot_threads or jstack will not help now unfortunately. I have all the elasticsearch logfiles from that day if it helps. |
I don't expect the log files to help unless you see a warning of the form |
There was no errors like that in the log file. If it happens again I will get a stack of all threads. |
No further feedback received. @KeyZer, if you have the requested |
This change: - Adds functionality to invalidate all (refresh+access) tokens for all users of a realm - Adds functionality to invalidate all (refresh+access)tokens for a user in all realms - Adds functionality to invalidate all (refresh+access) tokens for a user in a specific realm - Changes the response format for the invalidate token API to contain information about the number of the invalidated tokens and possible errors that were encountered. - Updates the API Documentation Relates: #35338
Elasticsearch version (
bin/elasticsearch --version
):Version: 6.3.0, Build: default/deb/424e937/2018-06-11T23:38:03.357887Z, JVM: 1.8.0_162
Plugins installed: [analysis-icu]
JVM version (
java -version
):java version "1.8.0_162"
Java(TM) SE Runtime Environment (build 1.8.0_162-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.162-b12, mixed mode)
OS version (
uname -a
if on a Unix-like system):Linux search01 4.4.0-1070-aws #80-Ubuntu SMP Thu Oct 4 13:56:07 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
An urgent cluster task blocked all other tasks that makes changes the cluster state (node join/leaves, index setting change...). Basically it force the cluster into a read only state and is impossible to scale up or down since the node join tasks also gets blocked by the frozen high priority task. There should at least be a timeout if a task takes many days to complete (see below for output of GET _cluster/pending_tasks).
We solved this by migrating to a new cluster but that might not always be possible. After that we migrated to the new cluster we did a restart of the master node in the failing cluster and the failing cluster started to work as normal again after a while.
Steps to reproduce:
We have not been able to reproduce the issue, this happened after 6 months in production where we have increased the load on the cluster. It might have been our autoscaling of the Elasticsearch cluster that triggered the issue but it have been running for many months without issue. The only change we did the last few weeks was increasing the read load and started to adding nodes more rapidly to the cluster (two at the same time).
Provide logs (if relevant):
Logfile from the master node
The text was updated successfully, but these errors were encountered: