[CI] Test Failure in CloneSnapshotIT.testBackToBackClonesForIndexNotInCluster #64115

original-brownbear · 2020-10-25T18:23:10Z

This failed exactly once in 7.x but I can't really explain why and how (https://gradle-enterprise.elastic.co/s/jeugiua6ddqlc).

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.elasticsearch.snapshots.CloneSnapshotIT.testBackToBackClonesForIndexNotInCluster" -Dtests.seed=8166C04A329F955 -Dtests.security.manager=true -Dtests.locale=nl -Dtests.timezone=Canada/Pacific -Druntime.java=11

failed without any exception by simply failing to ever make progress on one of the clone operations in that test leading to a timeout:

  1> [2020-10-23T21:24:03,703][INFO ][o.e.s.CloneSnapshotIT    ] [testBackToBackClonesForIndexNotInCluster] --> creating repository [test-repo] [mock]
  1> [2020-10-23T21:24:03,703][INFO ][o.e.p.PluginsService     ] [testBackToBackClonesForIndexNotInCluster] no modules loaded
  1> [2020-10-23T21:24:03,703][INFO ][o.e.p.PluginsService     ] [testBackToBackClonesForIndexNotInCluster] loaded plugin [org.elasticsearch.transport.nio.MockNioTransportPlugin]
  1> [2020-10-23T21:24:03,864][INFO ][o.e.s.m.MockRepository   ] [node_t0] starting mock repository with random prefix default
  1> [2020-10-23T21:24:03,866][INFO ][o.e.r.RepositoriesService] [node_t0] put repository [test-repo]
  1> [2020-10-23T21:24:03,889][INFO ][o.e.s.m.MockRepository   ] [node_t1] starting mock repository with random prefix default
  1> [2020-10-23T21:24:03,906][INFO ][o.e.s.m.MockRepository   ] [node_t0] starting mock repository with random prefix default
  1> [2020-10-23T21:24:03,980][INFO ][o.e.s.CloneSnapshotIT    ] [testBackToBackClonesForIndexNotInCluster] --> creating index [index-blocked]
  1> [2020-10-23T21:24:03,986][INFO ][o.e.c.m.MetadataCreateIndexService] [node_t0] [index-blocked] creating index, cause [api], templates [], shards [1]/[0]
  1> [2020-10-23T21:24:04,157][INFO ][o.e.c.r.a.AllocationService] [node_t0] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[index-blocked][0]]]).
  1> [2020-10-23T21:24:04,252][INFO ][o.e.c.m.MetadataMappingService] [node_t0] [index-blocked/7J6PbSNNQ9mjoNRHNyS7QA] create_mapping [_doc]
  1> [2020-10-23T21:24:04,367][INFO ][o.e.s.CloneSnapshotIT    ] [testBackToBackClonesForIndexNotInCluster] --> creating full snapshot [source-snapshot] in [test-repo]
  1> [2020-10-23T21:24:04,417][INFO ][o.e.s.SnapshotsService   ] [node_t0] snapshot [test-repo:source-snapshot/JPDXkiKjT_uG73g8ey_C5w] started
  1> [2020-10-23T21:24:04,679][INFO ][o.e.s.SnapshotsService   ] [node_t0] snapshot [test-repo:source-snapshot/JPDXkiKjT_uG73g8ey_C5w] completed with state [SUCCESS]
  1> [2020-10-23T21:24:04,681][INFO ][o.e.p.PluginsService     ] [testBackToBackClonesForIndexNotInCluster] no modules loaded
  1> [2020-10-23T21:24:04,681][INFO ][o.e.p.PluginsService     ] [testBackToBackClonesForIndexNotInCluster] loaded plugin [org.elasticsearch.transport.nio.MockNioTransportPlugin]
  1> [2020-10-23T21:24:04,826][INFO ][o.e.c.m.MetadataDeleteIndexService] [node_t0] [index-blocked/7J6PbSNNQ9mjoNRHNyS7QA] deleting index
  1> [2020-10-23T21:24:04,901][INFO ][o.e.s.CloneSnapshotIT    ] [testBackToBackClonesForIndexNotInCluster] --> waiting for [test-repo] to be blocked on node [node_t0]
  1> [2020-10-23T21:24:04,914][INFO ][o.e.s.SnapshotsService   ] [node_t0] snapshot clone [test-repo:target-snapshot/SJqFswO9Q_W68n9LQx42BA] started
  1> [2020-10-23T21:24:04,976][INFO ][o.e.s.m.MockRepository   ] [node_t0] [test-repo] blocking I/O operation for file [snap-SJqFswO9Q_W68n9LQx42BA.dat] at path [[indices][FYwjm9zlTIO5OcEDX_aOMA][0]]
  1> [2020-10-23T21:24:05,002][INFO ][o.e.s.CloneSnapshotIT    ] [testBackToBackClonesForIndexNotInCluster] --> wait for [3] snapshots to show up in the cluster state
  1> [2020-10-23T21:24:05,015][INFO ][o.e.s.SnapshotsService   ] [node_t0] snapshot clone [test-repo:target-snapshot-1/gFemqMWARE2xRT092DA9gg] started
  1> [2020-10-23T21:24:05,024][INFO ][o.e.s.CloneSnapshotIT    ] [testBackToBackClonesForIndexNotInCluster] --> wait for [3] snapshots to show up in the cluster state
  1> [2020-10-23T21:24:05,024][INFO ][o.e.s.SnapshotsService   ] [node_t0] snapshot clone [test-repo:target-snapshot-0/KGqbSozaSQyL17ul8iaAlA] started
  1> [2020-10-23T21:24:05,025][INFO ][o.e.s.CloneSnapshotIT    ] [testBackToBackClonesForIndexNotInCluster] --> unblocking [test-repo] on node [node_t0]
  1> [2020-10-23T21:24:05,134][INFO ][o.e.s.SnapshotsService   ] [node_t0] snapshot [test-repo:target-snapshot/SJqFswO9Q_W68n9LQx42BA] completed with state [SUCCESS]
  1> [2020-10-23T21:24:05,194][INFO ][o.e.s.SnapshotsService   ] [node_t0] snapshot [test-repo:target-snapshot-0/KGqbSozaSQyL17ul8iaAlA] completed with state [SUCCESS]
  2> okt 23, 2020 9:44:02 PM com.carrotsearch.randomizedtesting.ThreadLeakControl$2 evaluate
  2> WARNING: Suite execution timed out: org.elasticsearch.snapshots.CloneSnapshotIT
  2> ==== jstack at approximately timeout time ====
  2> "main" ID=1 WAITING on java.util.concurrent.CountDownLatch$Sync@393e85e7
  2> 	at [email protected]/jdk.internal.misc.Unsafe.park(Native Method)
  2> 	- waiting on java.util.concurrent.CountDownLatch$Sync@393e85e7
  2> 	at [email protected]/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
  2> 	at [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885)
  2> 	at [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1039)
  2> 	at [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1345)
  2> 	at [email protected]/java.util.concurrent.CountDownLatch.await(CountDownLatch.java:232)
  2> 	at org.gradle.api.internal.tasks.testing.worker.TestWorker.execute(TestWorker.java:73)
  2> 	at org.gradle.api.internal.tasks.testing.worker.TestWorker.execute(TestWorker.java:47)
  2> 	at org.gradle.process.internal.worker.child.ActionExecutionWorker.execute(ActionExecutionWorker.java:56)
  2> 	at org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:133)
  2> 	at org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:71)
  2> 	at app//worker.org.gradle.process.internal.worker.GradleWorkerMain.run(GradleWorkerMain.java:69)
  2> 	at app//worker.org.gradle.process.internal.worker.GradleWorkerMain.main(GradleWorkerMain.java:74)

  2> "Reference Handler" ID=2 RUNNABLE
  2> 	at [email protected]/java.lang.ref.Reference.waitForReferencePendingList(Native Method)
  2> 	at [email protected]/java.lang.ref.Reference.processPendingReferences(Reference.java:241)
  2> 	at [email protected]/java.lang.ref.Reference$ReferenceHandler.run(Reference.java:213)

I'll try to reason about this a little more and add some logging to see if I can track it down.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-10-25T18:23:12Z

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

We must not remove the snapshot from the initializing set in the `timeout` getter. This was a plain oversight/mistake and went unnoticed. It can lead to the removal of a valid snapshot clone from the cluster state in rare circumstances (e.g. when a node concurrently joins the cluster or a routing change happens as it did in the linked test failure). Closes elastic#64115

original-brownbear · 2020-10-25T19:05:49Z

Tracked it down in #64116

We must not remove the snapshot from the initializing set in the `timeout` getter. This was a plain oversight/mistake and went unnoticed. It can lead to the removal of a valid snapshot clone from the cluster state in rare circumstances (e.g. when a node concurrently joins the cluster or a routing change happens as it did in the linked test failure). Closes #64115

We must not remove the snapshot from the initializing set in the `timeout` getter. This was a plain oversight/mistake and went unnoticed. It can lead to the removal of a valid snapshot clone from the cluster state in rare circumstances (e.g. when a node concurrently joins the cluster or a routing change happens as it did in the linked test failure). Closes elastic#64115

We must not remove the snapshot from the initializing set in the `timeout` getter. This was a plain oversight/mistake and went unnoticed. It can lead to the removal of a valid snapshot clone from the cluster state in rare circumstances (e.g. when a node concurrently joins the cluster or a routing change happens as it did in the linked test failure). Closes #64115

jimczi · 2020-12-18T08:45:19Z

The failure happened again in 7.x so I am reopening:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+multijob+fast+part1/3372

Same symptom, the suite timeout waiting for threads to finish:

09:30:02   1> [2020-12-18T05:10:06,812][INFO ][o.e.s.CloneSnapshotIT    ] [testBackToBackClonesForIndexNotInCluster] --> wait for [3] snapshots to show up in the cluster state
09:30:02   1> [2020-12-18T05:10:06,814][INFO ][o.e.s.CloneSnapshotIT    ] [testBackToBackClonesForIndexNotInCluster] --> unblocking [test-repo] on node [node_t0]
09:30:02   1> [2020-12-18T05:10:07,006][INFO ][o.e.s.SnapshotsService   ] [node_t0] snapshot [test-repo:target-snapshot/nsste_b0QsOHLn_CE5qrOQ] completed with state [SUCCESS]
09:30:02   1> [2020-12-18T05:10:07,274][INFO ][o.e.s.SnapshotsService   ] [node_t0] snapshot [test-repo:target-snapshot-1/f2e19bB6Rm-JBM99Xig1Kg] completed with state [SUCCESS]
09:30:02   2> Des 18, 2020 5:29:54 AM com.carrotsearch.randomizedtesting.ThreadLeakControl$2 evaluate
09:30:02   2> WARNING: Suite execution timed out: org.elasticsearch.snapshots.CloneSnapshotIT
09:30:02   2> ==== jstack at approximately timeout time ====
09:30:02   2> "elasticsearch[node_t0][management][T#1]" ID=336 WAITING on org.elasticsearch.common.util.concurrent.EsExecutors$ExecutorScalingQueue@36ad022d
09:30:02   2> 	at sun.misc.Unsafe.park(Native Method)
09:30:02   2> 	- waiting on org.elasticsearch.common.util.concurrent.EsExecutors$ExecutorScalingQueue@36ad022d
09:30:02   2> 	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
09:30:02   2> 	at java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:737)
09:30:02   2> 	at java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:647)
09:30:02   2> 	at java.util.concurrent.LinkedTransferQueue.take(LinkedTransferQueue.java:1269)
09:30:02   2> 	at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
09:30:02   2> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
09:30:02   2> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
09:30:02   2> 	at java.lang.Thread.run(Thread.java:748)
```

original-brownbear · 2020-12-18T19:03:27Z

Thanks for reopening Jim, this was/is caused by a missing backport (my fault). Merged now -> closing here.

original-brownbear added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels Oct 25, 2020

original-brownbear self-assigned this Oct 25, 2020

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Oct 25, 2020

original-brownbear mentioned this issue Oct 25, 2020

Fix Broken Clone Snapshot CS Update #64116

Merged

original-brownbear closed this as completed in #64116 Oct 26, 2020

original-brownbear mentioned this issue Oct 26, 2020

Fix Broken Clone Snapshot CS Update (#64116) #64158

Merged

original-brownbear mentioned this issue Oct 26, 2020

Fix Broken Clone Snapshot CS Update (#64116) #64159

Merged

jimczi reopened this Dec 18, 2020

original-brownbear closed this as completed Dec 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Test Failure in CloneSnapshotIT.testBackToBackClonesForIndexNotInCluster #64115

[CI] Test Failure in CloneSnapshotIT.testBackToBackClonesForIndexNotInCluster #64115

original-brownbear commented Oct 25, 2020 •

edited

Loading

elasticmachine commented Oct 25, 2020

original-brownbear commented Oct 25, 2020

jimczi commented Dec 18, 2020

original-brownbear commented Dec 18, 2020

[CI] Test Failure in CloneSnapshotIT.testBackToBackClonesForIndexNotInCluster #64115

[CI] Test Failure in CloneSnapshotIT.testBackToBackClonesForIndexNotInCluster #64115

Comments

original-brownbear commented Oct 25, 2020 • edited Loading

elasticmachine commented Oct 25, 2020

original-brownbear commented Oct 25, 2020

jimczi commented Dec 18, 2020

original-brownbear commented Dec 18, 2020

original-brownbear commented Oct 25, 2020 •

edited

Loading