-
Notifications
You must be signed in to change notification settings - Fork 25.2k
es server always restart because of reading metadata file incorrectly #37286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Pinging @elastic/es-distributed |
@kkewwei On node startup, Elasticsearch node reads metadata from the disk. If the metadata file is corrupted, node startup will fail. In this case, it seems that |
it does not appear with ES2 in no circumstances. and it indeed arises with number of replicas = 0. a little files are corrupted and most files is good,maybe formatting data is not a good idea because of the importance of data. As a treatment, I delete the corruped file and it works well. Can we improve the process by skipping the corruped files on node startup? if this is the case, we can recovery most of data. |
@andrershov as part of #32006, do you think it would be useful to have a command-line tool that would allow a node to recover all index metadata except corrupted ones? In particular, rewrite the cluster state manifest file to remove a "corrupted" index? |
@ywelsch I think it's a bigger discussion: not only index metadata, but global metadata and manifest itself could be affected. Do we want to recover from this kind of situations? |
I think we can treat this in a similar way as shard corruptions, for which we currently have a command-line tool. We could do a best-effort recovery of the metadata, with plenty of warnings. I don't think we should do this automatically at startup, however, but require an explicit administrative step. I also think that there are two levels of severeness here: master-eligible or non-master-eligible node. When it comes to a non-master-eligible nodes, temporarily removing the metadata might be relatively harmless, as when joining the cluster, this metadata will be recreated on the node. For master-eligible nodes, it is trickier as the revised metadata might now be published to other nodes, overriding the intact metadata that they might have. |
The storage format for metadata has changed in recent ES version (7.6+), and is now Lucene-based. Given how few cases we have seen of metadata corruption (due to faulty hardware), I don't see the need to build automated tooling to support this (instead it should be treated as a full node failure). |
ES_VERSION: 5.6.8
JVM version : JDK1.8.0_112
OS version : linux
Description of the problem including expected versus actual behavior:
It is often not appearing. when the machine is turned off because of hardware malfunction, the es server left the cluster for a long time passively. Aftertime the machine is ok and the es server restarts, the es service can automatically identify metadata as planed, but it reports those error logs and down. as the result es keeps cycling restarting and down
Provide logs (if relevant):
at org.elasticsearch.gateway.GatewayMetaState.(Unknown Source)
while locating org.elasticsearch.gateway.GatewayMetaState
for parameter 4 at org.elasticsearch.gateway.GatewayService.(Unknown Source)
while locating org.elasticsearch.gateway.GatewayService
Caused by: ElasticsearchException[java.io.IOException: failed to read [id:0, legacy:false, file:/data1/es_data/lc_gh_3/nodes/0/indices/rDD73n7GR7uw_vE0Fk1FrA/_state/state-0.st]]; nested: IOException[failed to read [id:0, legacy:false, file:/data1/es_data/lc_gh_3/nodes/0/indices/rDD73n7GR7uw_vE0Fk1FrA/_state/state-0.st]]; nested: CorruptStateException[org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file truncated?): actual footer=758728244 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(path="/data1/es_data/lc_gh_3/nodes/0/indices/rDD73n7GR7uw_vE0Fk1FrA/_state/state-0.st")))]; nested: CorruptIndexException[codec footer mismatch (file truncated?): actual footer=758728244 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(path="/data1/es_data/lc_gh_3/nodes/0/indices/rDD73n7GR7uw_vE0Fk1FrA/_state/state-0.st")))];
at org.elasticsearch.ExceptionsHelper.maybeThrowRuntimeAndSuppress(ExceptionsHelper.java:190)
at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestState(MetaDataStateFormat.java:334)
at org.elasticsearch.common.util.IndexFolderUpgrader.upgrade(IndexFolderUpgrader.java:90)
at org.elasticsearch.common.util.IndexFolderUpgrader.upgradeIndicesIfNeeded(IndexFolderUpgrader.java:128)
at org.elasticsearch.gateway.GatewayMetaState.(GatewayMetaState.java:91)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.elasticsearch.common.inject.DefaultConstructionProxyFactory$1.newInstance(DefaultConstructionProxyFactory.java:49)
at org.elasticsearch.common.inject.ConstructorInjector.construct(ConstructorInjector.java:86)
at org.elasticsearch.common.inject.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:116)
at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter$1.call(ProviderToInternalFactoryAdapter.java:47)
at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:825)
at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:43)
at org.elasticsearch.common.inject.Scopes$1$1.get(Scopes.java:59)
at org.elasticsearch.common.inject.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:50)
at org.elasticsearch.common.inject.SingleParameterInjector.inject(SingleParameterInjector.java:42)
at org.elasticsearch.common.inject.SingleParameterInjector.getAll(SingleParameterInjector.java:66)
at org.elasticsearch.common.inject.ConstructorInjector.construct(ConstructorInjector.java:85)
at org.elasticsearch.common.inject.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:116)
at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter$1.call(ProviderToInternalFactoryAdapter.java:47)
at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:825)
at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:43)
at org.elasticsearch.common.inject.Scopes$1$1.get(Scopes.java:59)
at org.elasticsearch.common.inject.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:50)
at org.elasticsearch.common.inject.InjectorBuilder$1.call(InjectorBuilder.java:191)
at org.elasticsearch.common.inject.InjectorBuilder$1.call(InjectorBuilder.java:183)
at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:818)
at org.elasticsearch.common.inject.InjectorBuilder.loadEagerSingletons(InjectorBuilder.java:183)
at org.elasticsearch.common.inject.InjectorBuilder.loadEagerSingletons(InjectorBuilder.java:173)
at org.elasticsearch.common.inject.InjectorBuilder.injectDynamically(InjectorBuilder.java:161)
at org.elasticsearch.common.inject.InjectorBuilder.build(InjectorBuilder.java:96)
at org.elasticsearch.common.inject.Guice.createInjector(Guice.java:96)
at org.elasticsearch.common.inject.Guice.createInjector(Guice.java:70)
at org.elasticsearch.common.inject.ModulesBuilder.createInjector(ModulesBuilder.java:42)
at org.elasticsearch.node.Node.(Node.java:499)
at org.elasticsearch.node.Node.(Node.java:245)
at org.elasticsearch.bootstrap.Bootstrap$5.(Bootstrap.java:233)
at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:233)
at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:342)
at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:132)
at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:123)
at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:70)
The text was updated successfully, but these errors were encountered: