-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Failed data nodes due to Caused by: org.apache.lucene.store.AlreadyClosedException: translog is already closed #32526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Pinging @elastic/es-distributed |
Thanks for reporting this. I'm not sure about the @bleskes I've looked into how this would be possible, i.e., how we could end up with a translog that's closed, while the engine is not closed yet, and while there is no tragic event on the index writer. I think one candidate which might violate this is |
We have two more nodes (different nodes than last time) died with same fatal thread error. Anything we can do? Do you need additional information, logs? |
@TinLe this is pretty sneaky to fix and requires careful research (which we are doing). Thanks for offering to help. If there's anything we'll surely reach out. |
@bleskes Understood. We are seeing regular crashes of nodes here and there daily. Understandably, we are anxious to get this bug squashed :) |
#32674) We faced with the nasty race condition. See #32526 InternalEngine.failOnTragic method has thrown AssertionError. If you carefully look at if branches in this method, you will spot that its only possible, if either Lucene IndexWriterhas closed from inside or Translog, has closed from inside, but tragedy exception is not set. For now, let us concentrate on the Translog class. We found out that there are two methods in Translog - namely rollGeneration and trimOperations that are closing Translog in case of Exception without tragedy exception being set. This commit fixes these 2 methods. To fix it, we pull tragedyException from TranslogWriter up-to Translog class, because in these 2 methods IndexWriter could be innocent, but still Translog needs to be closed. Also, tragedyException is wrapped with TragicExceptionHolder to reuse CAS/addSuppresed functionality in Translog and TranslogWriter. Also to protect us in the future and make sure close method is never called from inside Translog special assertion examining stack trace is added. Since we're still targeting Java 8 for runtime - no StackWalker API is used in the implementation. In the stack-trace checking method, we're considering inner caller not only Translog methods but Translog child classes methods as well. It does mean that Translog is meant for extending it, but it's needed to be able to test this method. Closes #32526 (cherry picked from commit 0749b18)
#32674) We faced with the nasty race condition. See #32526 InternalEngine.failOnTragic method has thrown AssertionError. If you carefully look at if branches in this method, you will spot that its only possible, if either Lucene IndexWriterhas closed from inside or Translog, has closed from inside, but tragedy exception is not set. For now, let us concentrate on the Translog class. We found out that there are two methods in Translog - namely rollGeneration and trimOperations that are closing Translog in case of Exception without tragedy exception being set. This commit fixes these 2 methods. To fix it, we pull tragedyException from TranslogWriter up-to Translog class, because in these 2 methods IndexWriter could be innocent, but still Translog needs to be closed. Also, tragedyException is wrapped with TragicExceptionHolder to reuse CAS/addSuppresed functionality in Translog and TranslogWriter. Also to protect us in the future and make sure close method is never called from inside Translog special assertion examining stack trace is added. Since we're still targeting Java 8 for runtime - no StackWalker API is used in the implementation. In the stack-trace checking method, we're considering inner caller not only Translog methods but Translog child classes methods as well. It does mean that Translog is meant for extending it, but it's needed to be able to test this method. Closes #32526 (cherry picked from commit 0749b18)
This happened in version 7.5.2 but it was due to a misconfiguration on the server. The ulimit on the Linux box was set to low. https://www.elastic.co/guide/en/elasticsearch/reference/current/file-descriptors.html |
Elasticsearch version: 6.2.3
Plugins installed: [x-pack-deprecation,x-pack-graph,x-pack-logstash,x-pack-ml,x-pack-monitoring,x-pack-security,x-pack-upgrade,x-pack-watcher]
JVM version :
openjdk version "1.8.0_171"
OpenJDK Runtime Environment (build 1.8.0_171-b10)
OpenJDK 64-Bit Server VM (build 25.171-b10, mixed mode)
OS version :
3.10.0-862.3.3.el7.x86_64 #1 SMP Fri Jun 15 04:15:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
We have had 2 data nodes crashed and restarted today with the following in the log (see below).
I wonder if this is similar to #23099 and #25016
I do see a lot of these immediately in the logs preceeding the crash, several days worth of these since last time node was restarted.
[2018-07-31T17:22:21,640][ERROR][o.e.x.m.c.n.NodeStatsCollector] [dl1-dn00141-d2] collector [node_stats] timed out when collecting data [2018-07-31T17:23:11,641][ERROR][o.e.x.m.c.n.NodeStatsCollector] [dl1-dn00141-d2] collector [node_stats] timed out when collecting data [2018-07-31T17:24:01,642][ERROR][o.e.x.m.c.n.NodeStatsCollector] [dl1-dn00141-d2] collector [node_stats] timed out when collecting data [2018-07-31T17:24:31,642][ERROR][o.e.x.m.c.n.NodeStatsCollector] [dl1-dn00141-d2] collector [node_stats] timed out when collecting data [2018-07-31T17:25:21,643][ERROR][o.e.x.m.c.n.NodeStatsCollector] [dl1-dn00141-d2] collector [node_stats] timed out when collecting data
Steps to reproduce:
The cluster has been running normally, no changes.
55 nodes
780 indices
2976 primary shards
5642 replica shards
Provide logs (if relevant):
[2018-07-31T17:26:41,991][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [dl1-dn00141-d2] fatal error in thread [elasticsearch[dl1-dn00141-d2][bulk][T#12]], exiting java.lang.AssertionError: Unexpected AlreadyClosedException at org.elasticsearch.index.engine.InternalEngine.failOnTragicEvent(InternalEngine.java:1786) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.engine.InternalEngine.maybeFailEngine(InternalEngine.java:1803) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:905) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:738) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:707) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnReplica(IndexShard.java:680) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.bulk.TransportShardBulkAction.performOpOnReplica(TransportShardBulkAction.java:518) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.bulk.TransportShardBulkAction.performOnReplica(TransportShardBulkAction.java:480) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:466) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:72) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.onResponse(TransportReplicationAction.java:567) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.onResponse(TransportReplicationAction.java:530) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.shard.IndexShard$2.onResponse(IndexShard.java:2315) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.shard.IndexShard$2.onResponse(IndexShard.java:2293) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:238) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.shard.IndexShard.acquireReplicaOperationPermit(IndexShard.java:2292) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.doRun(TransportReplicationAction.java:641) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:513) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:493) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1555) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.2.3.jar:6.2.3] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_171] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_171] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171] Caused by: org.apache.lucene.store.AlreadyClosedException: translog is already closed at org.elasticsearch.index.translog.Translog.ensureOpen(Translog.java:1667) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.translog.Translog.add(Translog.java:508) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:887) ~[elasticsearch-6.2.3.jar:6.2.3] ... 24 more [2018-07-31T23:13:30,678][INFO ][o.e.n.Node ] [dl1-dn00141-d2] initializing ... [2018-07-31T23:13:30,824][INFO ][o.e.e.NodeEnvironment ] [dl1-dn00141-d2] using [4] data paths, mounts [[/data8 (/dev/sdh), /data6 (/dev/sdf), /data7 (/dev/sdg), /data5 (/dev/sde)]], net usable_space [32.2tb], net total_space [36.3tb], types [xfs] [2018-07-31T23:13:30,824][INFO ][o.e.e.NodeEnvironment ] [dl1-dn00141-d2] heap size [29gb], compressed ordinary object pointers [true] [2018-07-31T23:13:34,154][INFO ][o.e.n.Node ] [dl1-dn00141-d2] node name [dl1-dn00141-d2], node ID [MDYk1hFyS02s1Jxq6NNa7g] [2018-07-31T23:13:34,154][INFO ][o.e.n.Node ] [dl1-dn00141-d2] version[6.2.3], pid[27304], build[c59ff00/2018-03-13T10:06:29.741383Z], OS[Linux/3.10.0-862.3.3.el7.x86_64/amd64], JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/1.8.0_171/25.171-b10] [2018-07-31T23:13:34,154][INFO ][o.e.n.Node ] [dl1-dn00141-d2] JVM arguments [-Xms30g, -Xmx30g, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=65, -XX:+UseCMSInitiatingOccupancyOnly, -XX:ParallelGCThreads=10, -XX:ConcGCThreads=5, -XX:+DisableExplicitGC, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -Djdk.io.permissionsUseCanonicalPath=true, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Dlog4j.skipJansi=true, -XX:+HeapDumpOnOutOfMemoryError, -Xloggc:/data5/elasticsearch/logs/datanode2/gc.log, -XX:-UsePerfData, -XX:SurvivorRatio=4, -XX:NewSize=6g, -XX:MaxNewSize=6g, -XX:+UnlockDiagnosticVMOptions, -XX:ParGCCardsPerStrideChunk=32768, -XX:MaxTenuringThreshold=8, -XX:+PrintGCDetails, -XX:+PrintGCDateStamps, -XX:+PrintClassHistogram, -XX:+PrintTenuringDistribution, -XX:+PrintGCApplicationStoppedTime, -XX:+PrintPromotionFailure, -XX:PrintFLSStatistics=2, -XX:+UseGCLogFileRotation, -XX:NumberOfGCLogFiles=10, -XX:GCLogFileSize=512M, -Des.path.home=/usr/share/elasticsearch, -Des.path.conf=/etc/elasticsearch/datanode2]
The text was updated successfully, but these errors were encountered: