-
Notifications
You must be signed in to change notification settings - Fork 25.2k
After upgrade from 1.0.* to 1.4.1, checksum check fails after restart causing cluster to go red after yellow state #8805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
we do check checksums for small files on startup and larger files are checksummed on merge. If you delete the checksum marker you are just bringing back your corrupted shard. if you have primaries that are corrupted (btw. they might have been corrupted for a long time already but old ES version didn't check this. What I am wondering about is why you now see this since 1.0.x didn't even write checksums on the lucene level so the on-merge theory is wrong. I also don't see the problem in the logs. ES |
When the cluster was running before we didn't observe any checksum errors, except for 3 indexes which really had checksum errors (and lucene's checkindex script also detected them). We then restarted the cluster, and it turned yellow on those indexes, so all primaries were successfully allocated and the cluster was in YELLOW state. Then ES started doing the recovery on the non primaries to bring up all replicas, which caused the primaries to be marked as corrupted I assume, as they were not loaded anymore, marking the index as RED. Lucene's fixindex script doesn't find any corruption in the indexes, so I hardly doubt that they are corrupted. Also as it was really on machines which only have 2 shards on them, I hardly doubt that it's a hardware error as it occured for a lot of shards (about 30) We had the same issue on our test cluster before (but there we suspected just an error on our side), so hopefully we can reproduce it there with a simpler test case. |
We are still trying the reproduce the issue isolated. I'm not sure if has something to do with it, but just to clarify we are using two data directories. Currently it seems that the issue comes when the primary goes down, and the replica takes over. Not sure if it occurs immediately or after the replica gets promoted as a new primary and streams the data to a new replica. When closing the indices and running CheckIndex on the data files, no error is found, but the node complains about wrong checksums after reopening the index. Deleting the checksums as well as the corrupted file seems to resolve the issue in this case. It seems that there is an issue with the legacy checksums, maybe not being updated correctly or it may be related to the deleted documents. |
This has also happened to our team when upgrading from 1.1.2 to 1.4.1. |
The issue occured on our cluster for >100 shards. Here is an example of one index which is not corrupted (according to checkindex) but fails to recover because of the checksum. The affected files are _e.cfe and _e.cfs. What is strange is that both files are never than the checksum file. https://gist.github.com/miccon/b8df3402bdf32bdf6366 We solved the issue on our side by deleting the checksums as well as the corrupted file and updating the indices. Since then the issue did not reappear, but it seems like a bug related to the legacy checksum and replication. |
@miccon could you please elaborate exactly your solution in |
In the data directory you find the checksum-xxx file containing the checksums. As well as the corrupted-xxx file, which tells elasticsearch that it should not reopen the index as its broken. If you then delete (you don't need to start or stop the node or close the index in this case) first the checksums file and then the corrupted file, elastisearch should open the index. Please note that if your index is indeed broken and need repair, it will fail again sooner or later. But as noted above, we also had the issue with indices that had no issue (successfully ran lucene checkindex on these) and here the problem has been solved, although upon replica allocation the whole index will be copied over (as the checksum file is missing). |
thx @miccon real quick, why the order in
and number of corrupt files we have:
would a |
The corrupted file will be recreated if the checksum file is still present. |
thx again |
This happend for us on 1.4.0 to 1.4.2 upgrade We deleted _checksum files and corruption files We notice that we have really old lucene versions on our shards, from 3.6.2 to 4.10.2 I guess we need to upgrade all shards? We dont have the options to reindex |
Would optimizing the index to 1 segment drop the older versions? |
The issue occured to us also upgrading from the 1.2.1 to 1.4.2. We fixed it like others... Deleting _checksum files and corruption files but i'm afraid to restart our cluster... |
@francoisforster yes |
Closing as this has been resolved several versions ago |
We did upgrade our cluster from 1.0.* to 1.4.1.
After the upgrade, we indeed had 3 shards correctly identified as broken (checksum check failed), which we fixed. (we fixed the index and saw that it had errors). Before we had to restart, the cluster state was nearly completely green.
Then we had to restart our cluster again:
We ran checkindex on the shards on disk and they were not corrupt. An hardware error is also very unlikely since these servers only have 2-3 shards on them, so there must have been many hardware errors which is unlikely.
Could it be related to the merging of old and new segments? (we didn't observer this on shards where we didn't index to)? At the moment we delete the checksum and the marker file? What should we do?
Master log:
[2014-12-07 00:03:59,192][WARN ][cluster.action.shard ] [master] [index1][6] received shard failed for [index1][6], node[QPUH7WcyT3SSBuYjCvKHaQ], [P], s[STARTED], indexUUID [vxjN24PlRROou8Y-W6ObPw], reason [engine failure, message [corrupt file detected source: [recovery phase 1]][RecoverFilesRecoveryException[[index1][6] Failed to transfer [86] files with total size of [80.3gb]]; nested: CorruptIndexException[checksum failed (hardware problem?) : expected=xzqmes actual=a2zr8o resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@40ea4a1e)]; ]] {elasticsearch[master][[transport_server_worker.default]][T#11]{New I/O worker #28}}
[2014-12-07 00:03:59,265][WARN ][cluster.action.shard ] [master] [index2][8] received shard failed for [index2][8], node[n-kMHaf-QH2LTjncGjmkLw], [P], s[STARTED], indexUUID [VU0RN4QtRo2Ciae8b6oT7w], reason [engine failure, message [corrupt file detected source: [recovery phase 1]][RecoverFilesRecoveryException[[index2][8] Failed to transfer [110] files with total size of [82.1gb]]; nested: CorruptIndexException[checksum failed (hardware problem?) : expected=wtmawb actual=85psa3 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@1a87889d)]; ]] {elasticsearch[master][[transport_server_worker.default]][T#3]{New I/O worker #20}}
[2014-12-07 00:03:59,660][WARN ][cluster.action.shard ] [master] [index3][7] received shard failed for [index3][7], node[pHKQxOBYTuqReDWDStP6JQ], [P], s[STARTED], indexUUID [v_5dSwWwQQ-ylb000A9s5Q], reason [engine failure, message [corrupt file detected source: [recovery phase 1]][RecoverFilesRecoveryException[[index3][7] Failed to transfer [113] files with total size of [82.7gb]]; nested: CorruptIndexException[checksum failed (hardware problem?) : expected=1ut1u4d actual=dsmbzp resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@2e3275f)]; ]] {elasticsearch[master][[transport_server_worker.default]][T#10]{New I/O worker #27}}
[2014-12-07 00:03:59,822][WARN ][cluster.action.shard ] [master] [index4][7] received shard failed for [index4][7], node[Petlv8BJTXeAldR66ar_RQ], [P], s[STARTED], indexUUID [EVdW2JJLSwmhCQcQ9zWiuQ], reason [engine failure, message [corrupt file detected source: [recovery phase 1]][RecoverFilesRecoveryException[[index4][7] Failed to transfer [133] files with total size of [81.8gb]]; nested: CorruptIndexException[checksum failed (hardware problem?) : expected=er2pdw actual=1a0wwft resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@77e5ccc1)]; ]] {elasticsearch[master][[transport_server_worker.default]][T#3]{New I/O worker #20}}
[2014-12-07 00:03:59,839][WARN ][cluster.action.shard ] [master] [index5][1] received shard failed for [index5][1], node[iCaUmle9SZOeK5z_VqAwwQ], [P], s[STARTED], indexUUID [_SdOrcFJSj6I8jI3Rxus0Q], reason [engine failure, message [corrupt file detected source: [recovery phase 1]][RecoverFilesRecoveryException[[index5][1] Failed to transfer [136] files with total size of [81.8gb]]; nested: CorruptIndexException[checksum failed (hardware problem?) : expected=1s2u9d3 actual=cggxwd resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@1d4e624)]; ]] {elasticsearch[master][[transport_server_worker.default]][T#6]{New I/O worker #23}}
[2014-12-07 00:03:59,975][WARN ][cluster.action.shard ] [master] [index6][0] received shard failed for [index6][0], node[t5ieNHyPScOzEew0Rd0EcA], [P], s[STARTED], indexUUID [2lk2p8AKQSSxFF_iLffPUA], reason [engine failure, message [corrupt file detected source: [recovery phase 1]][RecoverFilesRecoveryException[[index6][0] Failed to transfer [120] files with total size of [82gb]]; nested: CorruptIndexException[checksum failed (hardware problem?) : expected=mgcerl actual=14hf0lf resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@661abd91)]; ]] {elasticsearch[master][[transport_server_worker.default]][T#15]{New I/O worker #32}}
[2014-12-07 00:04:00,251][WARN ][cluster.action.shard ] [master] [index7][8] received shard failed for [index7][8], node[GwY2MBlwRHWbKYOgoAqBiA], [P], s[STARTED], indexUUID [bYPu5KhiTYqumxrzRh7OZg], reason [engine failure, message [corrupt file detected source: [recovery phase 1]][RecoverFilesRecoveryException[[index7][8] Failed to transfer [176] files with total size of [77.6gb]]; nested: CorruptIndexException[checksum failed (hardware problem?) : expected=1xnasey actual=pmztvp resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@1e3484f)]; ]] {elasticsearch[master][[transport_server_worker.default]][T#12]{New I/O worker #29}}
The text was updated successfully, but these errors were encountered: