Failed data nodes due to Caused by: org.apache.lucene.store.AlreadyClosedException: translog is already closed #32526

TinLe · 2018-08-01T00:35:43Z

Elasticsearch version: 6.2.3

Plugins installed: [x-pack-deprecation,x-pack-graph,x-pack-logstash,x-pack-ml,x-pack-monitoring,x-pack-security,x-pack-upgrade,x-pack-watcher]

JVM version :
openjdk version "1.8.0_171"
OpenJDK Runtime Environment (build 1.8.0_171-b10)
OpenJDK 64-Bit Server VM (build 25.171-b10, mixed mode)

OS version :
3.10.0-862.3.3.el7.x86_64 #1 SMP Fri Jun 15 04:15:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

We have had 2 data nodes crashed and restarted today with the following in the log (see below).

I wonder if this is similar to #23099 and #25016

I do see a lot of these immediately in the logs preceeding the crash, several days worth of these since last time node was restarted.

[2018-07-31T17:22:21,640][ERROR][o.e.x.m.c.n.NodeStatsCollector] [dl1-dn00141-d2] collector [node_stats] timed out when collecting data [2018-07-31T17:23:11,641][ERROR][o.e.x.m.c.n.NodeStatsCollector] [dl1-dn00141-d2] collector [node_stats] timed out when collecting data [2018-07-31T17:24:01,642][ERROR][o.e.x.m.c.n.NodeStatsCollector] [dl1-dn00141-d2] collector [node_stats] timed out when collecting data [2018-07-31T17:24:31,642][ERROR][o.e.x.m.c.n.NodeStatsCollector] [dl1-dn00141-d2] collector [node_stats] timed out when collecting data [2018-07-31T17:25:21,643][ERROR][o.e.x.m.c.n.NodeStatsCollector] [dl1-dn00141-d2] collector [node_stats] timed out when collecting data

Steps to reproduce:

The cluster has been running normally, no changes.

55 nodes
780 indices
2976 primary shards
5642 replica shards

Provide logs (if relevant):

[2018-07-31T17:26:41,991][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [dl1-dn00141-d2] fatal error in thread [elasticsearch[dl1-dn00141-d2][bulk][T#12]], exiting java.lang.AssertionError: Unexpected AlreadyClosedException at org.elasticsearch.index.engine.InternalEngine.failOnTragicEvent(InternalEngine.java:1786) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.engine.InternalEngine.maybeFailEngine(InternalEngine.java:1803) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:905) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:738) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:707) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnReplica(IndexShard.java:680) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.bulk.TransportShardBulkAction.performOpOnReplica(TransportShardBulkAction.java:518) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.bulk.TransportShardBulkAction.performOnReplica(TransportShardBulkAction.java:480) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:466) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:72) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.onResponse(TransportReplicationAction.java:567) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.onResponse(TransportReplicationAction.java:530) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.shard.IndexShard$2.onResponse(IndexShard.java:2315) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.shard.IndexShard$2.onResponse(IndexShard.java:2293) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:238) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.shard.IndexShard.acquireReplicaOperationPermit(IndexShard.java:2292) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.doRun(TransportReplicationAction.java:641) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:513) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:493) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1555) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.2.3.jar:6.2.3] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_171] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_171] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171] Caused by: org.apache.lucene.store.AlreadyClosedException: translog is already closed at org.elasticsearch.index.translog.Translog.ensureOpen(Translog.java:1667) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.translog.Translog.add(Translog.java:508) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:887) ~[elasticsearch-6.2.3.jar:6.2.3] ... 24 more [2018-07-31T23:13:30,678][INFO ][o.e.n.Node ] [dl1-dn00141-d2] initializing ... [2018-07-31T23:13:30,824][INFO ][o.e.e.NodeEnvironment ] [dl1-dn00141-d2] using [4] data paths, mounts [[/data8 (/dev/sdh), /data6 (/dev/sdf), /data7 (/dev/sdg), /data5 (/dev/sde)]], net usable_space [32.2tb], net total_space [36.3tb], types [xfs] [2018-07-31T23:13:30,824][INFO ][o.e.e.NodeEnvironment ] [dl1-dn00141-d2] heap size [29gb], compressed ordinary object pointers [true] [2018-07-31T23:13:34,154][INFO ][o.e.n.Node ] [dl1-dn00141-d2] node name [dl1-dn00141-d2], node ID [MDYk1hFyS02s1Jxq6NNa7g] [2018-07-31T23:13:34,154][INFO ][o.e.n.Node ] [dl1-dn00141-d2] version[6.2.3], pid[27304], build[c59ff00/2018-03-13T10:06:29.741383Z], OS[Linux/3.10.0-862.3.3.el7.x86_64/amd64], JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/1.8.0_171/25.171-b10] [2018-07-31T23:13:34,154][INFO ][o.e.n.Node ] [dl1-dn00141-d2] JVM arguments [-Xms30g, -Xmx30g, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=65, -XX:+UseCMSInitiatingOccupancyOnly, -XX:ParallelGCThreads=10, -XX:ConcGCThreads=5, -XX:+DisableExplicitGC, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -Djdk.io.permissionsUseCanonicalPath=true, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Dlog4j.skipJansi=true, -XX:+HeapDumpOnOutOfMemoryError, -Xloggc:/data5/elasticsearch/logs/datanode2/gc.log, -XX:-UsePerfData, -XX:SurvivorRatio=4, -XX:NewSize=6g, -XX:MaxNewSize=6g, -XX:+UnlockDiagnosticVMOptions, -XX:ParGCCardsPerStrideChunk=32768, -XX:MaxTenuringThreshold=8, -XX:+PrintGCDetails, -XX:+PrintGCDateStamps, -XX:+PrintClassHistogram, -XX:+PrintTenuringDistribution, -XX:+PrintGCApplicationStoppedTime, -XX:+PrintPromotionFailure, -XX:PrintFLSStatistics=2, -XX:+UseGCLogFileRotation, -XX:NumberOfGCLogFiles=10, -XX:GCLogFileSize=512M, -Des.path.home=/usr/share/elasticsearch, -Des.path.conf=/etc/elasticsearch/datanode2]

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-08-01T08:59:25Z

Pinging @elastic/es-distributed

ywelsch · 2018-08-01T09:04:37Z

Thanks for reporting this. I'm not sure about the NodeStatsCollector errors, but the assertion error that tripped highlights an internal error.

@bleskes I've looked into how this would be possible, i.e., how we could end up with a translog that's closed, while the engine is not closed yet, and while there is no tragic event on the index writer. I think one candidate which might violate this is rollGeneration (and also the new trimOperations (which is not in the above version yet)).

TinLe · 2018-08-03T18:03:54Z

We have two more nodes (different nodes than last time) died with same fatal thread error.

Anything we can do? Do you need additional information, logs?

bleskes · 2018-08-05T11:42:48Z

@TinLe this is pretty sneaky to fix and requires careful research (which we are doing). Thanks for offering to help. If there's anything we'll surely reach out.

TinLe · 2018-08-10T17:53:50Z

@bleskes Understood. We are seeing regular crashes of nodes here and there daily. Understandably, we are anxious to get this bug squashed :)

#32674) We faced with the nasty race condition. See #32526 InternalEngine.failOnTragic method has thrown AssertionError. If you carefully look at if branches in this method, you will spot that its only possible, if either Lucene IndexWriterhas closed from inside or Translog, has closed from inside, but tragedy exception is not set. For now, let us concentrate on the Translog class. We found out that there are two methods in Translog - namely rollGeneration and trimOperations that are closing Translog in case of Exception without tragedy exception being set. This commit fixes these 2 methods. To fix it, we pull tragedyException from TranslogWriter up-to Translog class, because in these 2 methods IndexWriter could be innocent, but still Translog needs to be closed. Also, tragedyException is wrapped with TragicExceptionHolder to reuse CAS/addSuppresed functionality in Translog and TranslogWriter. Also to protect us in the future and make sure close method is never called from inside Translog special assertion examining stack trace is added. Since we're still targeting Java 8 for runtime - no StackWalker API is used in the implementation. In the stack-trace checking method, we're considering inner caller not only Translog methods but Translog child classes methods as well. It does mean that Translog is meant for extending it, but it's needed to be able to test this method. Closes #32526 (cherry picked from commit 0749b18)

p0onage · 2021-07-28T11:55:13Z

This happened in version 7.5.2 but it was due to a misconfiguration on the server. The ulimit on the Linux box was set to low.

https://www.elastic.co/guide/en/elasticsearch/reference/current/file-descriptors.html

ywelsch added the :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. label Aug 1, 2018

andrershov mentioned this issue Aug 7, 2018

All Translog inner closes should happen after tragedy exception is set #32674

Merged

andrershov closed this as completed in 0749b18 Aug 20, 2018

jasontedor mentioned this issue Jul 25, 2019

fatal error in thread java lang Assertion Error Unexpected Already Closed Exception #44843

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed data nodes due to Caused by: org.apache.lucene.store.AlreadyClosedException: translog is already closed #32526

Failed data nodes due to Caused by: org.apache.lucene.store.AlreadyClosedException: translog is already closed #32526

TinLe commented Aug 1, 2018 •

edited

Loading

elasticmachine commented Aug 1, 2018

ywelsch commented Aug 1, 2018

TinLe commented Aug 3, 2018

bleskes commented Aug 5, 2018

TinLe commented Aug 10, 2018

p0onage commented Jul 28, 2021

Failed data nodes due to Caused by: org.apache.lucene.store.AlreadyClosedException: translog is already closed #32526

Failed data nodes due to Caused by: org.apache.lucene.store.AlreadyClosedException: translog is already closed #32526

Comments

TinLe commented Aug 1, 2018 • edited Loading

elasticmachine commented Aug 1, 2018

ywelsch commented Aug 1, 2018

TinLe commented Aug 3, 2018

bleskes commented Aug 5, 2018

TinLe commented Aug 10, 2018

p0onage commented Jul 28, 2021

TinLe commented Aug 1, 2018 •

edited

Loading