-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Internal: Indexes unuseable after upgrade from 0.2 to 1.3 and cluster restart #7430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@philnate lemme ask a view questions:
|
the script to reproduce the issue (just steps, script is slowly synchronous, but doesn't wait for ES starts...) https://gist.github.com/philnate/cfee1d171022b9eb3b23 Hope this helps |
Updating from 0.20 to 1.0.3 is working. But trying to migrate to any newer version than 1.0.x is giving me the described error. So this issue seems to be introduced with 1.1.0 being it the newer Lucene version used or ES itself. |
This is a Lucene issue; I managed to make a standalone test exposing it: https://issues.apache.org/jira/browse/LUCENE-5907 |
The Lucene issue is fixed, and will be included in Lucene 4.10. |
@mikemccand thank you for your fast solution. I've patched ES 1.3.2 with the fix and a quick test seemed to be ok. Will do some addtional testing. |
Thanks for testing @philnate |
@philnate thanks so much for verifying!! that is super helpful |
Seems to be all fine. A colleague did some extended testing and everything was good. @s1monw do you plan to consume lucene 4.10 with some upcoming release of ES 1.3? That would allow us to directly upgrade to the latest and greatest ES, not migrating first to 1.0 and then to 1.3. |
Lucene 4.9.1 is released, and I upgraded 1.3.x to it: c998d8d |
it seems we face this issue when upgrade 1.1.2 to 1.3.6 (6-node cluster, rolling upgrade). Here is a count from 1 of the logs:
so most of the error coming from that .si files we also seems to be hitting issue like There are We have > 3000 shards total so its hard to say "which phrase" or "primary/replica"... I am very curious to know exactly what steps are Currently I am attempting to recovery this by setting |
So what do we do when this has already happened? Is this error recoverable? Is there a way for us to recreate the missing .si file? Currently, it is happening to two of my indices, but just for one shard (out of 5). Even if we can somehow retain data on the other 4 shards would be preferred. |
We recently tried to upgrade a ES cluster from 0.2 to 1.3. The actual upgrade worked out fine, but once we restarted the whole cluster, we saw those warnings for all shards (constantly repeating):
When we shutdown the cluster a couple minutes after bringing it up, with the new version, we saw this behavior just for the newest index. After about an hour the behavior would be the same for other indexes after a cluster restart.
We found out that the indexes are updated and on shutdown nearly all segment info (*.si) files are deleted (those which have a corresponding marker _upgraded.si). Those si files surviving seemed to be not upgraded (at least they don't have those marker files). And there content is like this or this:
While those updated contain afterwards this kind of information:
We could force the same behavior triggering an optimize for a given index. By restarting one node at a time and waiting till it fully integrated into the cluster we were able to restore the deleted si files through other nodes (including the _upgraded.si marker files). Afterwards the si files where safe and didn't got deleted.
To me it looks like either ES or Lucene is memorizing to delete the _upgraded.si files on VM shutdown but by accident deletes the actual si files as well.
The text was updated successfully, but these errors were encountered: