Skip to content

Failed data nodes due to Caused by: org.apache.lucene.store.AlreadyClosedException: translog is already closed #32526

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TinLe opened this issue Aug 1, 2018 · 6 comments
Labels
:Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard.

Comments

@TinLe
Copy link

TinLe commented Aug 1, 2018

Elasticsearch version: 6.2.3

Plugins installed: [x-pack-deprecation,x-pack-graph,x-pack-logstash,x-pack-ml,x-pack-monitoring,x-pack-security,x-pack-upgrade,x-pack-watcher]

JVM version :
openjdk version "1.8.0_171"
OpenJDK Runtime Environment (build 1.8.0_171-b10)
OpenJDK 64-Bit Server VM (build 25.171-b10, mixed mode)

OS version :
3.10.0-862.3.3.el7.x86_64 #1 SMP Fri Jun 15 04:15:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

We have had 2 data nodes crashed and restarted today with the following in the log (see below).

I wonder if this is similar to #23099 and #25016

I do see a lot of these immediately in the logs preceeding the crash, several days worth of these since last time node was restarted.

[2018-07-31T17:22:21,640][ERROR][o.e.x.m.c.n.NodeStatsCollector] [dl1-dn00141-d2] collector [node_stats] timed out when collecting data [2018-07-31T17:23:11,641][ERROR][o.e.x.m.c.n.NodeStatsCollector] [dl1-dn00141-d2] collector [node_stats] timed out when collecting data [2018-07-31T17:24:01,642][ERROR][o.e.x.m.c.n.NodeStatsCollector] [dl1-dn00141-d2] collector [node_stats] timed out when collecting data [2018-07-31T17:24:31,642][ERROR][o.e.x.m.c.n.NodeStatsCollector] [dl1-dn00141-d2] collector [node_stats] timed out when collecting data [2018-07-31T17:25:21,643][ERROR][o.e.x.m.c.n.NodeStatsCollector] [dl1-dn00141-d2] collector [node_stats] timed out when collecting data

Steps to reproduce:

The cluster has been running normally, no changes.

55 nodes
780 indices
2976 primary shards
5642 replica shards

Provide logs (if relevant):

[2018-07-31T17:26:41,991][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [dl1-dn00141-d2] fatal error in thread [elasticsearch[dl1-dn00141-d2][bulk][T#12]], exiting java.lang.AssertionError: Unexpected AlreadyClosedException at org.elasticsearch.index.engine.InternalEngine.failOnTragicEvent(InternalEngine.java:1786) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.engine.InternalEngine.maybeFailEngine(InternalEngine.java:1803) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:905) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:738) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:707) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnReplica(IndexShard.java:680) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.bulk.TransportShardBulkAction.performOpOnReplica(TransportShardBulkAction.java:518) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.bulk.TransportShardBulkAction.performOnReplica(TransportShardBulkAction.java:480) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:466) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:72) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.onResponse(TransportReplicationAction.java:567) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.onResponse(TransportReplicationAction.java:530) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.shard.IndexShard$2.onResponse(IndexShard.java:2315) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.shard.IndexShard$2.onResponse(IndexShard.java:2293) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:238) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.shard.IndexShard.acquireReplicaOperationPermit(IndexShard.java:2292) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.doRun(TransportReplicationAction.java:641) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:513) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:493) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1555) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.2.3.jar:6.2.3] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_171] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_171] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171] Caused by: org.apache.lucene.store.AlreadyClosedException: translog is already closed at org.elasticsearch.index.translog.Translog.ensureOpen(Translog.java:1667) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.translog.Translog.add(Translog.java:508) ~[elasticsearch-6.2.3.jar:6.2.3] at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:887) ~[elasticsearch-6.2.3.jar:6.2.3] ... 24 more [2018-07-31T23:13:30,678][INFO ][o.e.n.Node ] [dl1-dn00141-d2] initializing ... [2018-07-31T23:13:30,824][INFO ][o.e.e.NodeEnvironment ] [dl1-dn00141-d2] using [4] data paths, mounts [[/data8 (/dev/sdh), /data6 (/dev/sdf), /data7 (/dev/sdg), /data5 (/dev/sde)]], net usable_space [32.2tb], net total_space [36.3tb], types [xfs] [2018-07-31T23:13:30,824][INFO ][o.e.e.NodeEnvironment ] [dl1-dn00141-d2] heap size [29gb], compressed ordinary object pointers [true] [2018-07-31T23:13:34,154][INFO ][o.e.n.Node ] [dl1-dn00141-d2] node name [dl1-dn00141-d2], node ID [MDYk1hFyS02s1Jxq6NNa7g] [2018-07-31T23:13:34,154][INFO ][o.e.n.Node ] [dl1-dn00141-d2] version[6.2.3], pid[27304], build[c59ff00/2018-03-13T10:06:29.741383Z], OS[Linux/3.10.0-862.3.3.el7.x86_64/amd64], JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/1.8.0_171/25.171-b10] [2018-07-31T23:13:34,154][INFO ][o.e.n.Node ] [dl1-dn00141-d2] JVM arguments [-Xms30g, -Xmx30g, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=65, -XX:+UseCMSInitiatingOccupancyOnly, -XX:ParallelGCThreads=10, -XX:ConcGCThreads=5, -XX:+DisableExplicitGC, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -Djdk.io.permissionsUseCanonicalPath=true, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Dlog4j.skipJansi=true, -XX:+HeapDumpOnOutOfMemoryError, -Xloggc:/data5/elasticsearch/logs/datanode2/gc.log, -XX:-UsePerfData, -XX:SurvivorRatio=4, -XX:NewSize=6g, -XX:MaxNewSize=6g, -XX:+UnlockDiagnosticVMOptions, -XX:ParGCCardsPerStrideChunk=32768, -XX:MaxTenuringThreshold=8, -XX:+PrintGCDetails, -XX:+PrintGCDateStamps, -XX:+PrintClassHistogram, -XX:+PrintTenuringDistribution, -XX:+PrintGCApplicationStoppedTime, -XX:+PrintPromotionFailure, -XX:PrintFLSStatistics=2, -XX:+UseGCLogFileRotation, -XX:NumberOfGCLogFiles=10, -XX:GCLogFileSize=512M, -Des.path.home=/usr/share/elasticsearch, -Des.path.conf=/etc/elasticsearch/datanode2]

@ywelsch ywelsch added the :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. label Aug 1, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@ywelsch
Copy link
Contributor

ywelsch commented Aug 1, 2018

Thanks for reporting this. I'm not sure about the NodeStatsCollector errors, but the assertion error that tripped highlights an internal error.

@bleskes I've looked into how this would be possible, i.e., how we could end up with a translog that's closed, while the engine is not closed yet, and while there is no tragic event on the index writer. I think one candidate which might violate this is rollGeneration (and also the new trimOperations (which is not in the above version yet)).

@TinLe
Copy link
Author

TinLe commented Aug 3, 2018

We have two more nodes (different nodes than last time) died with same fatal thread error.

Anything we can do? Do you need additional information, logs?

@bleskes
Copy link
Contributor

bleskes commented Aug 5, 2018

@TinLe this is pretty sneaky to fix and requires careful research (which we are doing). Thanks for offering to help. If there's anything we'll surely reach out.

@TinLe
Copy link
Author

TinLe commented Aug 10, 2018

@bleskes Understood. We are seeing regular crashes of nodes here and there daily. Understandably, we are anxious to get this bug squashed :)

andrershov pushed a commit that referenced this issue Aug 21, 2018
#32674)

We faced with the nasty race condition. See #32526
InternalEngine.failOnTragic method has thrown AssertionError.
If you carefully look at if branches in this method, you will spot that its only possible, if either Lucene IndexWriterhas closed from inside or Translog, has closed from inside, but tragedy exception is not set.
For now, let us concentrate on the Translog class.
We found out that there are two methods in Translog - namely rollGeneration and trimOperations that are closing Translog in case of Exception without tragedy exception being set.
This commit fixes these 2 methods. To fix it, we pull tragedyException from TranslogWriter up-to Translog class, because in these 2 methods IndexWriter could be innocent, but still Translog needs to be closed. Also, tragedyException is wrapped with TragicExceptionHolder to reuse CAS/addSuppresed functionality in Translog and TranslogWriter.
Also to protect us in the future and make sure close method is never called from inside Translog special assertion examining stack trace is added. Since we're still targeting Java 8 for runtime - no StackWalker API is used in the implementation.
In the stack-trace checking method, we're considering inner caller not only Translog methods but Translog child classes methods as well. It does mean that Translog is meant for extending it, but it's needed to be able to test this method.

Closes #32526

(cherry picked from commit 0749b18)
andrershov pushed a commit that referenced this issue Aug 21, 2018
#32674)

We faced with the nasty race condition. See #32526
InternalEngine.failOnTragic method has thrown AssertionError.
If you carefully look at if branches in this method, you will spot that its only possible, if either Lucene IndexWriterhas closed from inside or Translog, has closed from inside, but tragedy exception is not set.
For now, let us concentrate on the Translog class.
We found out that there are two methods in Translog - namely rollGeneration and trimOperations that are closing Translog in case of Exception without tragedy exception being set.
This commit fixes these 2 methods. To fix it, we pull tragedyException from TranslogWriter up-to Translog class, because in these 2 methods IndexWriter could be innocent, but still Translog needs to be closed. Also, tragedyException is wrapped with TragicExceptionHolder to reuse CAS/addSuppresed functionality in Translog and TranslogWriter.
Also to protect us in the future and make sure close method is never called from inside Translog special assertion examining stack trace is added. Since we're still targeting Java 8 for runtime - no StackWalker API is used in the implementation.
In the stack-trace checking method, we're considering inner caller not only Translog methods but Translog child classes methods as well. It does mean that Translog is meant for extending it, but it's needed to be able to test this method.

Closes #32526

(cherry picked from commit 0749b18)
@p0onage
Copy link

p0onage commented Jul 28, 2021

This happened in version 7.5.2 but it was due to a misconfiguration on the server. The ulimit on the Linux box was set to low.

https://www.elastic.co/guide/en/elasticsearch/reference/current/file-descriptors.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard.
Projects
None yet
Development

No branches or pull requests

5 participants