-
Notifications
You must be signed in to change notification settings - Fork 25.2k
ES shard fail to recovery due do number of docs differ #12661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@brwe any idea how the sync id can have differing numbers of documents? |
are you using delete by query by any chance and can you tell me more about your usage pattern, the ES version and about the events that happened before you closed your index? |
Before we close the index, the two nodes lost connection due to status ping timeout. The salve cannot connect the primary. The log is as below. [2015-08-04 04:02:09,424][WARN ][transport ] [ipattern-es01.iad1] Received response for a request that has timed out, sent [55643ms] ago, timed out [25642ms] ago, action [internal:discovery/zen/fd/master_ping], node [[ipattern-es02.iad1][CB3p3ZnZR_qUHGc05ZKK7Q][ipattern-es02.iad1][inet[/10.40.146.47:9300]]], id [15173706] |
And one more question. After indexing 12 million documents, even after old GC, the java heap usage is about 20G. Is it normal? Unlike common log files, our document is a litter complicated and average size is about 500K. And monitoring with Marval, I found the data under "Index Statistics->memory->LUCENE MEMORY" keeps on increasing. It is now about 30G. Does it related with Java heap occupation? What is the LUCENE MEMORY for? How to mitigate the memory issue? We use Elasticsearch 1.6.0. Two data nodes with 5 shards and 1 replica. Now the total index size in each node is about 5TB. |
What does your cluster look like, how many nodes, are they in the same datacenter? Also you're better off putting that second question on https://discuss.elastic.co :) |
Hi markwalkom, Thanks for your suggestion. I already submit it to https://discuss.elastic.co/t/large-index-size-cause-high-java-heap-occupation/26936 |
@kenny-ye Thanks a lot for the logs! This line:
might indicate that you were running into #7572 . I assume elasticsearch issued a synced flush somewhen after primary and replica got out of sync. Synced flush does not check the number of documents in primary and replica before executing. It only makes sure that no indexing is going on while flushing. Such an inconsistency between primary and replica remains undetected until recovery starts. To allocate the replica, can you try the following:
This will remove the sync id from the primary and make sure elasticsearch will not try to recover the files on the replica copy that misses the documents.
this should then cause the replica to be recovered from the primary which has the documents. |
No further info. Closing |
Ran into the same error with replica shards on one of my indexes. My fix was to change index setting to |
Just had the same problem after restarting one of my three nodes. |
Happened to us as well on a 24 nodes cluster running 2.3.5, this cluster is receiving mostly bulk index requests. It looks like the problem happened after many nodes dropped out from the cluster caused by some network hiccups. |
The replica fix is ridiculous though, there should be a better way to fix it, than to toggle replicas for all broken indexes. That a disk-full cluster can even lead to broken shards is odd in itself. |
@tomsommer this is a 2 years old issue. I'm not sure what your issue is and which version it relates to but chances are high it's a different one. Can you please either open a new issue with a good reproduction / description or go to the discuss forums and post there so we can figure out what's going on. |
We use two ES data nodes with setting -Xmx30g -Xms30g. The two ES servers have 128G physical memory and 24 CPU cores.
After bulking insert for about 2 weeks, totally we indexed about 10 million document (5 shards, 1 replica). The index size is about 4.7TB in disk. Then ES server comes into frequently long old GC and even after old GC, the heap occupation is above 20G.
I tried to release the heap usage by stopping the bulk insert and close the index. Then I open the index again, but the cluster goes to yellow. One replica shard failed to recovery because of document number mismatch with primary shard. How to solve this issue to recover the failed shard?
The server log:
[2015-08-05 03:30:26,461][WARN ][indices.cluster ] [ipattern-es01.iad1] [[ipattern-2015-07][3]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [ipattern-2015-07][3]: Recovery failed from [ipattern-es02.iad1][CB3p3ZnZR_qUHGc05ZKK7Q][ipattern-es02.iad1][inet[/10.40.146.47:9300]] into [ipattern-es01.iad1][gqF0BDR_S9SAlSIfGZVx_g][ipattern-es01.iad1][inet[ipattern-es01.iad1/10.40.146.46:9300]]
at org.elasticsearch.indices.recovery.RecoveryTarget.doRecovery(RecoveryTarget.java:280)
at org.elasticsearch.indices.recovery.RecoveryTarget.access$700(RecoveryTarget.java:70)
at org.elasticsearch.indices.recovery.RecoveryTarget$RecoveryRunner.doRun(RecoveryTarget.java:561)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.RemoteTransportException: [ipattern-es02.iad1][inet[/10.40.146.47:9300]][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: [ipattern-2015-07][3] Phase[1] Execution failed
at org.elasticsearch.index.engine.InternalEngine.recover(InternalEngine.java:898)
at org.elasticsearch.index.shard.IndexShard.recover(IndexShard.java:780)
at org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:125)
at org.elasticsearch.indices.recovery.RecoverySource.access$200(RecoverySource.java:49)
at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:146)
at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:132)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:279)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: [ipattern-2015-07][3] Failed to transfer [0] files with total size of [0b]
at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:431)
at org.elasticsearch.index.engine.InternalEngine.recover(InternalEngine.java:893)
... 10 more
Caused by: java.lang.IllegalStateException: try to recover [ipattern-2015-07][3] from primary shard with sync id but number of docs differ: 2435163 (ipattern-es02.iad1, primary) vs 2435160(ipattern-es01.iad1)
at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:177)
... 11 more
The text was updated successfully, but these errors were encountered: