Monitoring plugin indices stuck in failed recovery status #19275

krystalcode · 2016-07-05T22:03:28Z

Elasticsearch version:

5.0.0-alpha3

JVM version:

The docker image inherits from the official java:8-jre image.

OS version:

Official docker image (https://hub.docker.com/_/elasticsearch/, Debian Jessie) running on a Fedora 23 host.

Description of the problem including expected versus actual behavior:

Shards created by the Monitoring X-Pack plugin are stuck trying to recover. Kibana reports "Elasticsearch is still initializing the Monitoring indices". The logs are given below, notably "marking and sending shard failed due to [failed recovery]" and "IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine];".

This has happened twice (out of 2 times tried) on a testing environment following the steps given below. A few things to note:

This happens only on the shards created by the monitoring plugin, custom indices and kibana's indices are properly initialized and all data seem to be in place and can be queried.
When the environment is first set up, all looks ok. The problem occurs when it is shut down and restarted.
Similar issues indicate data corruption. I'm not sure why this would happen only on the Monitoring indices though. Related issue: Tight loop logging startup error creating huge log file #2598
Elastic Search continuously tries to recover the index, logging the same entry 1-7 times per second. This results in gigabytes of logs in small amount of time, while the machine becomes sluggish for quite a while due to high memory usage, possibly related to continuous efforts for index recovery. Pull request Check if the index can be opened and is not corrupted on state listing #11269 released in 1.6 is supposed to fix this, but it still occurs in 5.0.0-alpha3, related issue: Tried to recover failed shards of index endlessly #11681.
Even if there is a reason for Elastic Search to retry several times per second, not infinitely though, there is no reason to clutter the logs up to 7 times per second. Related issue: Entries repeatedly logged with sub-second frequency filling up disk space #19164.

I do not recall having the same problem with Marvel and ElasticSearch 2 when following the same steps.

Any suggestions on how to figure out if this is due to data corruption, and why only on the monitoring plugin indices? Not sure if it is related, but other indices would have really low write rate, since this is a test environment, not sure how many writes per second the monitoring plugin would do.

Steps to reproduce:

Initialise a single node Elastic Search cluster in a docker image, connected to another docker image running Kibana. X-Pack plugin should be installed in both docker images.
Give some time for initializing the indices, and load Kibana - all should be looking good, including the Monitoring plugin. Use the cluster for a while, send some records to Elastic Search.
Stop the cluster, for example with "docker-compose stop".
Restart the cluster. The monitoring indices can't be recovered.

Provide logs (if relevant):

{"log":"[2016-07-05 20:56:55,776][INFO ][gateway                  ] [Flying Tiger] recovered [18] indices into cluster_state\n","stream":"stdout","time":"2016-07-05T20:56:55.776542565Z"}
{"log":"[2016-07-05 20:57:05,422][INFO ][monitor.jvm              ] [Flying Tiger] [gc][35] overhead, spent [369ms] collecting in the last [1.2s]\n","stream":"stdout","time":"2016-07-05T20:57:05.468691606Z"}
{"log":"[2016-07-05 20:57:26,236][WARN ][indices.cluster          ] [Flying Tiger] [[.monitoring-data-2][0]] marking and sending shard failed due to [failed recovery]\n","stream":"stdout","time":"2016-07-05T20:57:26.238632342Z"}
{"log":"RecoveryFailedException[[.monitoring-data-2][0]: Recovery failed from null into {Flying Tiger}{CgS1ucvZSay5Rjw07LqhMQ}{172.17.0.6}{172.17.0.6:9300}]; nested: IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: EOFException;\n","stream":"stdout","time":"2016-07-05T20:57:26.238689525Z"}
{"log":"\u0009at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$0(IndexShard.java:1369)\n","stream":"stdout","time":"2016-07-05T20:57:26.238706279Z"}
{"log":"\u0009at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:392)\n","stream":"stdout","time":"2016-07-05T20:57:26.238720854Z"}
{"log":"\u0009at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)\n","stream":"stdout","time":"2016-07-05T20:57:26.23873453Z"}
{"log":"\u0009at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)\n","stream":"stdout","time":"2016-07-05T20:57:26.238750303Z"}
{"log":"\u0009at java.lang.Thread.run(Thread.java:745)\n","stream":"stdout","time":"2016-07-05T20:57:26.238766104Z"}
{"log":"Caused by: [.monitoring-data-2/-4ZQISQpRQu5B58rxnkYgg][[.monitoring-data-2][0]] IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: EOFException;\n","stream":"stdout","time":"2016-07-05T20:57:26.238789176Z"}
{"log":"\u0009at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:229)\n","stream":"stdout","time":"2016-07-05T20:57:26.238805723Z"}
{"log":"\u0009at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:78)\n","stream":"stdout","time":"2016-07-05T20:57:26.238818972Z"}
{"log":"\u0009at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:123)\n","stream":"stdout","time":"2016-07-05T20:57:26.238834596Z"}
{"log":"\u0009at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:76)\n","stream":"stdout","time":"2016-07-05T20:57:26.238849856Z"}
{"log":"\u0009at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1119)\n","stream":"stdout","time":"2016-07-05T20:57:26.238862794Z"}
{"log":"\u0009at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$0(IndexShard.java:1365)\n","stream":"stdout","time":"2016-07-05T20:57:26.23887556Z"}
{"log":"\u0009... 4 more\n","stream":"stdout","time":"2016-07-05T20:57:26.238889044Z"}
{"log":"Caused by: [.monitoring-data-2/-4ZQISQpRQu5B58rxnkYgg][[.monitoring-data-2][0]] EngineCreationFailureException[failed to create engine]; nested: EOFException;\n","stream":"stdout","time":"2016-07-05T20:57:26.23890146Z"}
{"log":"\u0009at org.elasticsearch.index.engine.InternalEngine.\u003cinit\u003e(InternalEngine.java:139)\n","stream":"stdout","time":"2016-07-05T20:57:26.238917752Z"}
{"log":"\u0009at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)\n","stream":"stdout","time":"2016-07-05T20:57:26.238934861Z"}
{"log":"\u0009at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1430)\n","stream":"stdout","time":"2016-07-05T20:57:26.238947889Z"}
{"log":"\u0009at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1414)\n","stream":"stdout","time":"2016-07-05T20:57:26.238960702Z"}
{"log":"\u0009at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:928)\n","stream":"stdout","time":"2016-07-05T20:57:26.238972961Z"}
{"log":"\u0009at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:890)\n","stream":"stdout","time":"2016-07-05T20:57:26.238985065Z"}
{"log":"\u0009at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:225)\n","stream":"stdout","time":"2016-07-05T20:57:26.239019441Z"}
{"log":"\u0009... 9 more\n","stream":"stdout","time":"2016-07-05T20:57:26.239032857Z"}
{"log":"Caused by: java.io.EOFException\n","stream":"stdout","time":"2016-07-05T20:57:26.239044567Z"}
{"log":"\u0009at org.apache.lucene.store.InputStreamDataInput.readByte(InputStreamDataInput.java:37)\n","stream":"stdout","time":"2016-07-05T20:57:26.239056161Z"}
{"log":"\u0009at org.apache.lucene.store.DataInput.readInt(DataInput.java:101)\n","stream":"stdout","time":"2016-07-05T20:57:26.239068626Z"}
{"log":"\u0009at org.apache.lucene.store.DataInput.readLong(DataInput.java:157)\n","stream":"stdout","time":"2016-07-05T20:57:26.239084091Z"}
{"log":"\u0009at org.elasticsearch.index.translog.Checkpoint.\u003cinit\u003e(Checkpoint.java:52)\n","stream":"stdout","time":"2016-07-05T20:57:26.23910018Z"}
{"log":"\u0009at org.elasticsearch.index.translog.Checkpoint.read(Checkpoint.java:81)\n","stream":"stdout","time":"2016-07-05T20:57:26.239113599Z"}
{"log":"\u0009at org.elasticsearch.index.translog.Translog.recoverFromFiles(Translog.java:238)\n","stream":"stdout","time":"2016-07-05T20:57:26.239125431Z"}
{"log":"\u0009at org.elasticsearch.index.translog.Translog.\u003cinit\u003e(Translog.java:180)\n","stream":"stdout","time":"2016-07-05T20:57:26.239137437Z"}
{"log":"\u0009at org.elasticsearch.index.engine.InternalEngine.openTranslog(InternalEngine.java:229)\n","stream":"stdout","time":"2016-07-05T20:57:26.239149796Z"}
{"log":"\u0009at org.elasticsearch.index.engine.InternalEngine.\u003cinit\u003e(InternalEngine.java:136)\n","stream":"stdout","time":"2016-07-05T20:57:26.239164805Z"}
{"log":"\u0009... 15 more\n","stream":"stdout","time":"2016-07-05T20:57:26.239179966Z"}
{"log":"[2016-07-05 20:57:26,238][WARN ][cluster.action.shard     ] [Flying Tiger] [.monitoring-data-2][0] received shard failed for target shard [[.monitoring-data-2][0], node[CgS1ucvZSay5Rjw07LqhMQ], [P], s[INITIALIZING], a[id=kB0PjIVzSkSdXq_MYghS-A], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-07-05T20:56:47.076Z]]], source shard [[.monitoring-data-2][0], node[CgS1ucvZSay5Rjw07LqhMQ], [P], s[INITIALIZING], a[id=kB0PjIVzSkSdXq_MYghS-A], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-07-05T20:56:47.076Z]]], message [failed recovery], failure [RecoveryFailedException[[.monitoring-data-2][0]: Recovery failed from null into {Flying Tiger}{CgS1ucvZSay5Rjw07LqhMQ}{172.17.0.6}{172.17.0.6:9300}]; nested: IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: EOFException; ]\n","stream":"stdout","time":"2016-07-05T20:57:26.251208298Z"}
{"log":"RecoveryFailedException[[.monitoring-data-2][0]: Recovery failed from null into {Flying Tiger}{CgS1ucvZSay5Rjw07LqhMQ}{172.17.0.6}{172.17.0.6:9300}]; nested: IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: EOFException;\n","stream":"stdout","time":"2016-07-05T20:57:26.25132659Z"}
{"log":"\u0009at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$0(IndexShard.java:1369)\n","stream":"stdout","time":"2016-07-05T20:57:26.251345996Z"}
{"log":"\u0009at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:392)\n","stream":"stdout","time":"2016-07-05T20:57:26.25136007Z"}
{"log":"\u0009at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)\n","stream":"stdout","time":"2016-07-05T20:57:26.251373122Z"}
{"log":"\u0009at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)\n","stream":"stdout","time":"2016-07-05T20:57:26.251386727Z"}
{"log":"\u0009at java.lang.Thread.run(Thread.java:745)\n","stream":"stdout","time":"2016-07-05T20:57:26.251399962Z"}
{"log":"Caused by: [.monitoring-data-2/-4ZQISQpRQu5B58rxnkYgg][[.monitoring-data-2][0]] IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: EOFException;\n","stream":"stdout","time":"2016-07-05T20:57:26.251413566Z"}
{"log":"\u0009at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:229)\n","stream":"stdout","time":"2016-07-05T20:57:26.251446542Z"}
{"log":"\u0009at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:78)\n","stream":"stdout","time":"2016-07-05T20:57:26.251462264Z"}
{"log":"\u0009at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:123)\n","stream":"stdout","time":"2016-07-05T20:57:26.251474711Z"}
{"log":"\u0009at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:76)\n","stream":"stdout","time":"2016-07-05T20:57:26.251487195Z"}
{"log":"\u0009at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1119)\n","stream":"stdout","time":"2016-07-05T20:57:26.251499122Z"}
{"log":"\u0009at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$0(IndexShard.java:1365)\n","stream":"stdout","time":"2016-07-05T20:57:26.25151102Z"}
{"log":"\u0009... 4 more\n","stream":"stdout","time":"2016-07-05T20:57:26.251523225Z"}
{"log":"Caused by: [.monitoring-data-2/-4ZQISQpRQu5B58rxnkYgg][[.monitoring-data-2][0]] EngineCreationFailureException[failed to create engine]; nested: EOFException;\n","stream":"stdout","time":"2016-07-05T20:57:26.251535417Z"}
{"log":"\u0009at org.elasticsearch.index.engine.InternalEngine.\u003cinit\u003e(InternalEngine.java:139)\n","stream":"stdout","time":"2016-07-05T20:57:26.251548104Z"}
{"log":"\u0009at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)\n","stream":"stdout","time":"2016-07-05T20:57:26.251560111Z"}
{"log":"\u0009at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1430)\n","stream":"stdout","time":"2016-07-05T20:57:26.251572432Z"}
{"log":"\u0009at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1414)\n","stream":"stdout","time":"2016-07-05T20:57:26.251584408Z"}
{"log":"\u0009at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:928)\n","stream":"stdout","time":"2016-07-05T20:57:26.251596442Z"}
{"log":"\u0009at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:890)\n","stream":"stdout","time":"2016-07-05T20:57:26.251608984Z"}
{"log":"\u0009at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:225)\n","stream":"stdout","time":"2016-07-05T20:57:26.251621031Z"}
{"log":"\u0009... 9 more\n","stream":"stdout","time":"2016-07-05T20:57:26.251633349Z"}
{"log":"Caused by: java.io.EOFException\n","stream":"stdout","time":"2016-07-05T20:57:26.251644844Z"}
{"log":"\u0009at org.apache.lucene.store.InputStreamDataInput.readByte(InputStreamDataInput.java:37)\n","stream":"stdout","time":"2016-07-05T20:57:26.251656861Z"}
{"log":"\u0009at org.apache.lucene.store.DataInput.readInt(DataInput.java:101)\n","stream":"stdout","time":"2016-07-05T20:57:26.251668845Z"}
{"log":"\u0009at org.apache.lucene.store.DataInput.readLong(DataInput.java:157)\n","stream":"stdout","time":"2016-07-05T20:57:26.251680411Z"}
{"log":"\u0009at org.elasticsearch.index.translog.Checkpoint.\u003cinit\u003e(Checkpoint.java:52)\n","stream":"stdout","time":"2016-07-05T20:57:26.251692468Z"}
{"log":"\u0009at org.elasticsearch.index.translog.Checkpoint.read(Checkpoint.java:81)\n","stream":"stdout","time":"2016-07-05T20:57:26.251704403Z"}
{"log":"\u0009at org.elasticsearch.index.translog.Translog.recoverFromFiles(Translog.java:238)\n","stream":"stdout","time":"2016-07-05T20:57:26.251719643Z"}
{"log":"\u0009at org.elasticsearch.index.translog.Translog.\u003cinit\u003e(Translog.java:180)\n","stream":"stdout","time":"2016-07-05T20:57:26.251731953Z"}
{"log":"\u0009at org.elasticsearch.index.engine.InternalEngine.openTranslog(InternalEngine.java:229)\n","stream":"stdout","time":"2016-07-05T20:57:26.251744128Z"}
{"log":"\u0009at org.elasticsearch.index.engine.InternalEngine.\u003cinit\u003e(InternalEngine.java:136)\n","stream":"stdout","time":"2016-07-05T20:57:26.25175681Z"}
{"log":"\u0009... 15 more\n","stream":"stdout","time":"2016-07-05T20:57:26.251779075Z"}

The text was updated successfully, but these errors were encountered:

jasontedor · 2016-07-05T22:13:24Z

Official docker image (https://hub.docker.com/_/elasticsearch/, Debian Jessie) running on a Fedora 23 host.

There is not an official Docker image that is affiliated with Elastic. To be clear, the "official" Docker image on Docker Hub is not affiliated with Elastic.

clintongormley · 2016-07-06T11:39:53Z

It looks like your monitoring index is corrupt for some reason. What I don't understand is why this keeps trying to recover. It should try 5 times then stop, given that #18467 is also in alpha3.

/cc @s1monw

clintongormley · 2016-07-06T11:45:46Z

Was this a fresh install? The monitoring index was created the first time you started Elasticsearch? Was there a disk full event?

clintongormley · 2016-07-06T11:46:29Z

Also, i don't understand why the log message says Recovery failed from null into {Flying Tiger}. Why null?

clintongormley · 2016-07-06T11:52:23Z

@krystalcode Any chance we can have to logs from the first startup through shutdown and the second startup to the point where you start seeing this message?

clintongormley · 2016-07-06T12:12:11Z

@krystalcode I've just failed to replicate this without Docker. Any chance you could try without Docker to see if it still happens?

krystalcode · 2016-07-07T00:18:38Z

@clintongormley yes this was a fresh install (or better a fresh launch of docker image), with elastic search data volume mounted on a clean directory. The monitoring index must have been created when I first started elastic search since it certainly did not exist before, and Kibana did not report any problems with the monitoring plugin the first time - so it must have been successfully created. I will look for more logs tomorrow, even though I don't think there is more information apart from 1+ gigabyte of the log provided, repeated. I will also try to recreate the situation with and without docker and report back.

krystalcode · 2016-07-07T00:25:23Z

Also, no there was no disk full event, and I do not know what "from null" should normally be, so I can't interpret it.

s1monw · 2016-07-07T09:39:10Z

@clintongormley the null is not important, it just means the source node is null since it's a store recovery. we can improve that

clintongormley · 2016-07-07T13:11:50Z

@krystalcode so far we have been unable to reproduce this. Any chance you can check again starting from a clean docker instance and see if you can reliably reproduce?

krystalcode · 2016-07-11T04:14:31Z

I tried to reproduce this 2 more times, but I haven't been able to either. Unfortunately I haven't kept the logs from the 2 times I had this failure, because they were larger than 2 gigabytes and I deleted them.

I think we can assume that there was a disk failure that caused data corruption. The fact that it happened consistently on the monitoring plugin indexes, which is why I raised this issue, might be because the monitoring indexes had more data compared to other indexes on them so it was more likely that they would get corrupted.

Is there any need to discuss the fact that elastic search failed to stop reallocating the shard (as pointed out #19164 (comment))? If not, or if this discussion belongs to #19164, we can close this issue.

clintongormley · 2016-07-11T13:48:23Z

thanks for the feedback @krystalcode

krystalcode mentioned this issue Jul 5, 2016

Entries repeatedly logged with sub-second frequency filling up disk space #19164

Closed

clintongormley added discuss :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Jul 6, 2016

clintongormley added the Pioneer Program label Jul 6, 2016

clintongormley added feedback_needed and removed discuss labels Jul 7, 2016

clintongormley closed this as completed Jul 11, 2016

clintongormley removed the Pioneer Program label May 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitoring plugin indices stuck in failed recovery status #19275

Monitoring plugin indices stuck in failed recovery status #19275

krystalcode commented Jul 5, 2016

jasontedor commented Jul 5, 2016

clintongormley commented Jul 6, 2016 •

edited

Loading

clintongormley commented Jul 6, 2016

clintongormley commented Jul 6, 2016

clintongormley commented Jul 6, 2016

clintongormley commented Jul 6, 2016

krystalcode commented Jul 7, 2016

krystalcode commented Jul 7, 2016

s1monw commented Jul 7, 2016

clintongormley commented Jul 7, 2016

krystalcode commented Jul 11, 2016

clintongormley commented Jul 11, 2016

Monitoring plugin indices stuck in failed recovery status #19275

Monitoring plugin indices stuck in failed recovery status #19275

Comments

krystalcode commented Jul 5, 2016

jasontedor commented Jul 5, 2016

clintongormley commented Jul 6, 2016 • edited Loading

clintongormley commented Jul 6, 2016

clintongormley commented Jul 6, 2016

clintongormley commented Jul 6, 2016

clintongormley commented Jul 6, 2016

krystalcode commented Jul 7, 2016

krystalcode commented Jul 7, 2016

s1monw commented Jul 7, 2016

clintongormley commented Jul 7, 2016

krystalcode commented Jul 11, 2016

clintongormley commented Jul 11, 2016

clintongormley commented Jul 6, 2016 •

edited

Loading