After upgrade from 1.0.* to 1.4.1, checksum check fails after restart causing cluster to go red after yellow state #8805

bluelu · 2014-12-07T09:29:25Z

We did upgrade our cluster from 1.0.* to 1.4.1.

After the upgrade, we indeed had 3 shards correctly identified as broken (checksum check failed), which we fixed. (we fixed the index and saw that it had errors). Before we had to restart, the cluster state was nearly completely green.

Then we had to restart our cluster again:

The cluster state turned yellow (all primaries allocated)
Then it turned red again, caused by the allocation of the non primary shards for some indexes.
The checksum check failed on about 1/50th of our shards which we indexed data to:
We ran checkindex on the shards on disk and they were not corrupt. An hardware error is also very unlikely since these servers only have 2-3 shards on them, so there must have been many hardware errors which is unlikely.
We deleted the checksum and the marker file and ES loaded the shards again automatically.

Could it be related to the merging of old and new segments? (we didn't observer this on shards where we didn't index to)? At the moment we delete the checksum and the marker file? What should we do?

Master log:
[2014-12-07 00:03:59,192][WARN ][cluster.action.shard ] [master] [index1][6] received shard failed for [index1][6], node[QPUH7WcyT3SSBuYjCvKHaQ], [P], s[STARTED], indexUUID [vxjN24PlRROou8Y-W6ObPw], reason [engine failure, message [corrupt file detected source: [recovery phase 1]][RecoverFilesRecoveryException[[index1][6] Failed to transfer [86] files with total size of [80.3gb]]; nested: CorruptIndexException[checksum failed (hardware problem?) : expected=xzqmes actual=a2zr8o resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@40ea4a1e)]; ]] {elasticsearch[master][[transport_server_worker.default]][T#11]{New I/O worker #28}}
[2014-12-07 00:03:59,265][WARN ][cluster.action.shard ] [master] [index2][8] received shard failed for [index2][8], node[n-kMHaf-QH2LTjncGjmkLw], [P], s[STARTED], indexUUID [VU0RN4QtRo2Ciae8b6oT7w], reason [engine failure, message [corrupt file detected source: [recovery phase 1]][RecoverFilesRecoveryException[[index2][8] Failed to transfer [110] files with total size of [82.1gb]]; nested: CorruptIndexException[checksum failed (hardware problem?) : expected=wtmawb actual=85psa3 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@1a87889d)]; ]] {elasticsearch[master][[transport_server_worker.default]][T#3]{New I/O worker #20}}
[2014-12-07 00:03:59,660][WARN ][cluster.action.shard ] [master] [index3][7] received shard failed for [index3][7], node[pHKQxOBYTuqReDWDStP6JQ], [P], s[STARTED], indexUUID [v_5dSwWwQQ-ylb000A9s5Q], reason [engine failure, message [corrupt file detected source: [recovery phase 1]][RecoverFilesRecoveryException[[index3][7] Failed to transfer [113] files with total size of [82.7gb]]; nested: CorruptIndexException[checksum failed (hardware problem?) : expected=1ut1u4d actual=dsmbzp resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@2e3275f)]; ]] {elasticsearch[master][[transport_server_worker.default]][T#10]{New I/O worker #27}}
[2014-12-07 00:03:59,822][WARN ][cluster.action.shard ] [master] [index4][7] received shard failed for [index4][7], node[Petlv8BJTXeAldR66ar_RQ], [P], s[STARTED], indexUUID [EVdW2JJLSwmhCQcQ9zWiuQ], reason [engine failure, message [corrupt file detected source: [recovery phase 1]][RecoverFilesRecoveryException[[index4][7] Failed to transfer [133] files with total size of [81.8gb]]; nested: CorruptIndexException[checksum failed (hardware problem?) : expected=er2pdw actual=1a0wwft resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@77e5ccc1)]; ]] {elasticsearch[master][[transport_server_worker.default]][T#3]{New I/O worker #20}}
[2014-12-07 00:03:59,839][WARN ][cluster.action.shard ] [master] [index5][1] received shard failed for [index5][1], node[iCaUmle9SZOeK5z_VqAwwQ], [P], s[STARTED], indexUUID [_SdOrcFJSj6I8jI3Rxus0Q], reason [engine failure, message [corrupt file detected source: [recovery phase 1]][RecoverFilesRecoveryException[[index5][1] Failed to transfer [136] files with total size of [81.8gb]]; nested: CorruptIndexException[checksum failed (hardware problem?) : expected=1s2u9d3 actual=cggxwd resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@1d4e624)]; ]] {elasticsearch[master][[transport_server_worker.default]][T#6]{New I/O worker #23}}
[2014-12-07 00:03:59,975][WARN ][cluster.action.shard ] [master] [index6][0] received shard failed for [index6][0], node[t5ieNHyPScOzEew0Rd0EcA], [P], s[STARTED], indexUUID [2lk2p8AKQSSxFF_iLffPUA], reason [engine failure, message [corrupt file detected source: [recovery phase 1]][RecoverFilesRecoveryException[[index6][0] Failed to transfer [120] files with total size of [82gb]]; nested: CorruptIndexException[checksum failed (hardware problem?) : expected=mgcerl actual=14hf0lf resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@661abd91)]; ]] {elasticsearch[master][[transport_server_worker.default]][T#15]{New I/O worker #32}}
[2014-12-07 00:04:00,251][WARN ][cluster.action.shard ] [master] [index7][8] received shard failed for [index7][8], node[GwY2MBlwRHWbKYOgoAqBiA], [P], s[STARTED], indexUUID [bYPu5KhiTYqumxrzRh7OZg], reason [engine failure, message [corrupt file detected source: [recovery phase 1]][RecoverFilesRecoveryException[[index7][8] Failed to transfer [176] files with total size of [77.6gb]]; nested: CorruptIndexException[checksum failed (hardware problem?) : expected=1xnasey actual=pmztvp resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@1e3484f)]; ]] {elasticsearch[master][[transport_server_worker.default]][T#12]{New I/O worker #29}}

s1monw · 2014-12-07T10:16:22Z

we do check checksums for small files on startup and larger files are checksummed on merge. If you delete the checksum marker you are just bringing back your corrupted shard. if you have primaries that are corrupted (btw. they might have been corrupted for a long time already but old ES version didn't check this. What I am wondering about is why you now see this since 1.0.x didn't even write checksums on the lucene level so the on-merge theory is wrong. I also don't see the problem in the logs. ES 1.4.1 now checks old files checksums with the old Adler32 checksums from ES which can explain your problems but apparently they are only happening on relocation / recovery right? Can you tell if it happened during a recovery or a relocation? is your primary affected too? if so I don't think it's easy to recover without reindexing to be honest...

bluelu · 2014-12-07T15:18:18Z

When the cluster was running before we didn't observe any checksum errors, except for 3 indexes which really had checksum errors (and lucene's checkindex script also detected them).

We then restarted the cluster, and it turned yellow on those indexes, so all primaries were successfully allocated and the cluster was in YELLOW state. Then ES started doing the recovery on the non primaries to bring up all replicas, which caused the primaries to be marked as corrupted I assume, as they were not loaded anymore, marking the index as RED.

Lucene's fixindex script doesn't find any corruption in the indexes, so I hardly doubt that they are corrupted. Also as it was really on machines which only have 2 shards on them, I hardly doubt that it's a hardware error as it occured for a lot of shards (about 30)

We had the same issue on our test cluster before (but there we suspected just an error on our side), so hopefully we can reproduce it there with a simpler test case.

ghost · 2014-12-09T19:23:45Z

We are still trying the reproduce the issue isolated. I'm not sure if has something to do with it, but just to clarify we are using two data directories.

Currently it seems that the issue comes when the primary goes down, and the replica takes over. Not sure if it occurs immediately or after the replica gets promoted as a new primary and streams the data to a new replica. When closing the indices and running CheckIndex on the data files, no error is found, but the node complains about wrong checksums after reopening the index. Deleting the checksums as well as the corrupted file seems to resolve the issue in this case. It seems that there is an issue with the legacy checksums, maybe not being updated correctly or it may be related to the deleted documents.

tomcashman · 2014-12-10T10:42:13Z

This has also happened to our team when upgrading from 1.1.2 to 1.4.1.

ghost · 2014-12-10T14:10:18Z

The issue occured on our cluster for >100 shards.

Here is an example of one index which is not corrupted (according to checkindex) but fails to recover because of the checksum. The affected files are _e.cfe and _e.cfs. What is strange is that both files are never than the checksum file.

https://gist.github.com/miccon/b8df3402bdf32bdf6366

We solved the issue on our side by deleting the checksums as well as the corrupted file and updating the indices. Since then the issue did not reappear, but it seems like a bug related to the legacy checksum and replication.

cywjackson · 2015-01-07T11:34:52Z

@miccon could you please elaborate exactly your solution in We solved the issue on our side by deleting the checksums as well as the corrupted file and updating the indices. ?Ie what files to remove, and the steps (stop / start /disable allocation? what exactly did you update on the indices?) It looks to me I am having the same issue, please see #7430 (comment) . Much appreciated

ghost · 2015-01-07T12:30:01Z

In the data directory you find the checksum-xxx file containing the checksums. As well as the corrupted-xxx file, which tells elasticsearch that it should not reopen the index as its broken. If you then delete (you don't need to start or stop the node or close the index in this case) first the checksums file and then the corrupted file, elastisearch should open the index.

Please note that if your index is indeed broken and need repair, it will fail again sooner or later. But as noted above, we also had the issue with indices that had no issue (successfully ran lucene checkindex on these) and here the problem has been solved, although upon replica allocation the whole index will be copied over (as the checksum file is missing).

cywjackson · 2015-01-07T13:03:32Z

thx @miccon real quick, why the order in first the checksums file and then the corrupted file,
right now number of checksum files we have

usw2a-search5-prod. 360
usw2a-search4-prod. 380
usw2c-search6-prod. 401
usw2c-search7-prod. 406
usw2b-search1-prod. 385
usw2b-search2-prod. 472

and number of corrupt files we have:

usw2a-search5-prod. 33
usw2a-search4-prod. 0
usw2b-search1-prod. 9
usw2c-search6-prod. 12
usw2c-search7-prod. 27
usw2b-search2-prod. 5

would a find ... -exec rm {} on them work? (i could run 2 cmd, 1st the checksums then the corrupted if the order matters...)

ghost · 2015-01-07T13:23:02Z

The corrupted file will be recreated if the checksum file is still present.

cywjackson · 2015-01-07T13:40:42Z

thx again

johd01 · 2015-01-16T11:36:54Z

This happend for us on 1.4.0 to 1.4.2 upgrade

We deleted _checksum files and corruption files

We notice that we have really old lucene versions on our shards, from 3.6.2 to 4.10.2

I guess we need to upgrade all shards? We dont have the options to reindex

francoisforster · 2015-01-28T22:35:20Z

Would optimizing the index to 1 segment drop the older versions?

gregoryb · 2015-01-29T00:11:59Z

The issue occured to us also upgrading from the 1.2.1 to 1.4.2. We fixed it like others... Deleting _checksum files and corruption files but i'm afraid to restart our cluster...

clintongormley · 2015-01-29T14:58:12Z

Would optimizing the index to 1 segment drop the older versions?

@francoisforster yes

clintongormley · 2015-11-21T22:11:50Z

Closing as this has been resolved several versions ago

cywjackson mentioned this issue Jan 7, 2015

Internal: Indexes unuseable after upgrade from 0.2 to 1.3 and cluster restart #7430

Closed

clintongormley closed this as completed Nov 21, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After upgrade from 1.0.* to 1.4.1, checksum check fails after restart causing cluster to go red after yellow state #8805

After upgrade from 1.0.* to 1.4.1, checksum check fails after restart causing cluster to go red after yellow state #8805

bluelu commented Dec 7, 2014

s1monw commented Dec 7, 2014

bluelu commented Dec 7, 2014

ghost commented Dec 9, 2014

tomcashman commented Dec 10, 2014

ghost commented Dec 10, 2014

cywjackson commented Jan 7, 2015

ghost commented Jan 7, 2015

cywjackson commented Jan 7, 2015

ghost commented Jan 7, 2015

cywjackson commented Jan 7, 2015

johd01 commented Jan 16, 2015

francoisforster commented Jan 28, 2015

gregoryb commented Jan 29, 2015

clintongormley commented Jan 29, 2015

clintongormley commented Nov 21, 2015

After upgrade from 1.0.* to 1.4.1, checksum check fails after restart causing cluster to go red after yellow state #8805

After upgrade from 1.0.* to 1.4.1, checksum check fails after restart causing cluster to go red after yellow state #8805

Comments

bluelu commented Dec 7, 2014

s1monw commented Dec 7, 2014

bluelu commented Dec 7, 2014

ghost commented Dec 9, 2014

tomcashman commented Dec 10, 2014

ghost commented Dec 10, 2014

cywjackson commented Jan 7, 2015

ghost commented Jan 7, 2015

cywjackson commented Jan 7, 2015

ghost commented Jan 7, 2015

cywjackson commented Jan 7, 2015

johd01 commented Jan 16, 2015

francoisforster commented Jan 28, 2015

gregoryb commented Jan 29, 2015

clintongormley commented Jan 29, 2015

clintongormley commented Nov 21, 2015