Skip to content

[CI] Test Failure in CloneSnapshotIT.testBackToBackClonesForIndexNotInCluster #64115

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
original-brownbear opened this issue Oct 25, 2020 · 4 comments · Fixed by #64116
Closed
Assignees
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI

Comments

@original-brownbear
Copy link
Member

original-brownbear commented Oct 25, 2020

This failed exactly once in 7.x but I can't really explain why and how (https://gradle-enterprise.elastic.co/s/jeugiua6ddqlc).

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.elasticsearch.snapshots.CloneSnapshotIT.testBackToBackClonesForIndexNotInCluster" -Dtests.seed=8166C04A329F955 -Dtests.security.manager=true -Dtests.locale=nl -Dtests.timezone=Canada/Pacific -Druntime.java=11

failed without any exception by simply failing to ever make progress on one of the clone operations in that test leading to a timeout:

  1> [2020-10-23T21:24:03,703][INFO ][o.e.s.CloneSnapshotIT    ] [testBackToBackClonesForIndexNotInCluster] --> creating repository [test-repo] [mock]
  1> [2020-10-23T21:24:03,703][INFO ][o.e.p.PluginsService     ] [testBackToBackClonesForIndexNotInCluster] no modules loaded
  1> [2020-10-23T21:24:03,703][INFO ][o.e.p.PluginsService     ] [testBackToBackClonesForIndexNotInCluster] loaded plugin [org.elasticsearch.transport.nio.MockNioTransportPlugin]
  1> [2020-10-23T21:24:03,864][INFO ][o.e.s.m.MockRepository   ] [node_t0] starting mock repository with random prefix default
  1> [2020-10-23T21:24:03,866][INFO ][o.e.r.RepositoriesService] [node_t0] put repository [test-repo]
  1> [2020-10-23T21:24:03,889][INFO ][o.e.s.m.MockRepository   ] [node_t1] starting mock repository with random prefix default
  1> [2020-10-23T21:24:03,906][INFO ][o.e.s.m.MockRepository   ] [node_t0] starting mock repository with random prefix default
  1> [2020-10-23T21:24:03,980][INFO ][o.e.s.CloneSnapshotIT    ] [testBackToBackClonesForIndexNotInCluster] --> creating index [index-blocked]
  1> [2020-10-23T21:24:03,986][INFO ][o.e.c.m.MetadataCreateIndexService] [node_t0] [index-blocked] creating index, cause [api], templates [], shards [1]/[0]
  1> [2020-10-23T21:24:04,157][INFO ][o.e.c.r.a.AllocationService] [node_t0] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[index-blocked][0]]]).
  1> [2020-10-23T21:24:04,252][INFO ][o.e.c.m.MetadataMappingService] [node_t0] [index-blocked/7J6PbSNNQ9mjoNRHNyS7QA] create_mapping [_doc]
  1> [2020-10-23T21:24:04,367][INFO ][o.e.s.CloneSnapshotIT    ] [testBackToBackClonesForIndexNotInCluster] --> creating full snapshot [source-snapshot] in [test-repo]
  1> [2020-10-23T21:24:04,417][INFO ][o.e.s.SnapshotsService   ] [node_t0] snapshot [test-repo:source-snapshot/JPDXkiKjT_uG73g8ey_C5w] started
  1> [2020-10-23T21:24:04,679][INFO ][o.e.s.SnapshotsService   ] [node_t0] snapshot [test-repo:source-snapshot/JPDXkiKjT_uG73g8ey_C5w] completed with state [SUCCESS]
  1> [2020-10-23T21:24:04,681][INFO ][o.e.p.PluginsService     ] [testBackToBackClonesForIndexNotInCluster] no modules loaded
  1> [2020-10-23T21:24:04,681][INFO ][o.e.p.PluginsService     ] [testBackToBackClonesForIndexNotInCluster] loaded plugin [org.elasticsearch.transport.nio.MockNioTransportPlugin]
  1> [2020-10-23T21:24:04,826][INFO ][o.e.c.m.MetadataDeleteIndexService] [node_t0] [index-blocked/7J6PbSNNQ9mjoNRHNyS7QA] deleting index
  1> [2020-10-23T21:24:04,901][INFO ][o.e.s.CloneSnapshotIT    ] [testBackToBackClonesForIndexNotInCluster] --> waiting for [test-repo] to be blocked on node [node_t0]
  1> [2020-10-23T21:24:04,914][INFO ][o.e.s.SnapshotsService   ] [node_t0] snapshot clone [test-repo:target-snapshot/SJqFswO9Q_W68n9LQx42BA] started
  1> [2020-10-23T21:24:04,976][INFO ][o.e.s.m.MockRepository   ] [node_t0] [test-repo] blocking I/O operation for file [snap-SJqFswO9Q_W68n9LQx42BA.dat] at path [[indices][FYwjm9zlTIO5OcEDX_aOMA][0]]
  1> [2020-10-23T21:24:05,002][INFO ][o.e.s.CloneSnapshotIT    ] [testBackToBackClonesForIndexNotInCluster] --> wait for [3] snapshots to show up in the cluster state
  1> [2020-10-23T21:24:05,015][INFO ][o.e.s.SnapshotsService   ] [node_t0] snapshot clone [test-repo:target-snapshot-1/gFemqMWARE2xRT092DA9gg] started
  1> [2020-10-23T21:24:05,024][INFO ][o.e.s.CloneSnapshotIT    ] [testBackToBackClonesForIndexNotInCluster] --> wait for [3] snapshots to show up in the cluster state
  1> [2020-10-23T21:24:05,024][INFO ][o.e.s.SnapshotsService   ] [node_t0] snapshot clone [test-repo:target-snapshot-0/KGqbSozaSQyL17ul8iaAlA] started
  1> [2020-10-23T21:24:05,025][INFO ][o.e.s.CloneSnapshotIT    ] [testBackToBackClonesForIndexNotInCluster] --> unblocking [test-repo] on node [node_t0]
  1> [2020-10-23T21:24:05,134][INFO ][o.e.s.SnapshotsService   ] [node_t0] snapshot [test-repo:target-snapshot/SJqFswO9Q_W68n9LQx42BA] completed with state [SUCCESS]
  1> [2020-10-23T21:24:05,194][INFO ][o.e.s.SnapshotsService   ] [node_t0] snapshot [test-repo:target-snapshot-0/KGqbSozaSQyL17ul8iaAlA] completed with state [SUCCESS]
  2> okt 23, 2020 9:44:02 PM com.carrotsearch.randomizedtesting.ThreadLeakControl$2 evaluate
  2> WARNING: Suite execution timed out: org.elasticsearch.snapshots.CloneSnapshotIT
  2> ==== jstack at approximately timeout time ====
  2> "main" ID=1 WAITING on java.util.concurrent.CountDownLatch$Sync@393e85e7
  2> 	at [email protected]/jdk.internal.misc.Unsafe.park(Native Method)
  2> 	- waiting on java.util.concurrent.CountDownLatch$Sync@393e85e7
  2> 	at [email protected]/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
  2> 	at [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885)
  2> 	at [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1039)
  2> 	at [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1345)
  2> 	at [email protected]/java.util.concurrent.CountDownLatch.await(CountDownLatch.java:232)
  2> 	at org.gradle.api.internal.tasks.testing.worker.TestWorker.execute(TestWorker.java:73)
  2> 	at org.gradle.api.internal.tasks.testing.worker.TestWorker.execute(TestWorker.java:47)
  2> 	at org.gradle.process.internal.worker.child.ActionExecutionWorker.execute(ActionExecutionWorker.java:56)
  2> 	at org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:133)
  2> 	at org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:71)
  2> 	at app//worker.org.gradle.process.internal.worker.GradleWorkerMain.run(GradleWorkerMain.java:69)
  2> 	at app//worker.org.gradle.process.internal.worker.GradleWorkerMain.main(GradleWorkerMain.java:74)

  2> "Reference Handler" ID=2 RUNNABLE
  2> 	at [email protected]/java.lang.ref.Reference.waitForReferencePendingList(Native Method)
  2> 	at [email protected]/java.lang.ref.Reference.processPendingReferences(Reference.java:241)
  2> 	at [email protected]/java.lang.ref.Reference$ReferenceHandler.run(Reference.java:213)

I'll try to reason about this a little more and add some logging to see if I can track it down.

@original-brownbear original-brownbear added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels Oct 25, 2020
@original-brownbear original-brownbear self-assigned this Oct 25, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Oct 25, 2020
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Oct 25, 2020
We must not remove the snapshot from the initializing set
in the `timeout` getter. This was a plain oversight/mistake
and went unnoticed. It can lead to the removal of a valid
snapshot clone from the cluster state in rare circumstances
(e.g. when a node concurrently joins the cluster or a routing
change happens as it did in the linked test failure).

Closes elastic#64115
@original-brownbear
Copy link
Member Author

Tracked it down in #64116

original-brownbear added a commit that referenced this issue Oct 26, 2020
We must not remove the snapshot from the initializing set
in the `timeout` getter. This was a plain oversight/mistake
and went unnoticed. It can lead to the removal of a valid
snapshot clone from the cluster state in rare circumstances
(e.g. when a node concurrently joins the cluster or a routing
change happens as it did in the linked test failure).

Closes #64115
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Oct 26, 2020
We must not remove the snapshot from the initializing set
in the `timeout` getter. This was a plain oversight/mistake
and went unnoticed. It can lead to the removal of a valid
snapshot clone from the cluster state in rare circumstances
(e.g. when a node concurrently joins the cluster or a routing
change happens as it did in the linked test failure).

Closes elastic#64115
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Oct 26, 2020
We must not remove the snapshot from the initializing set
in the `timeout` getter. This was a plain oversight/mistake
and went unnoticed. It can lead to the removal of a valid
snapshot clone from the cluster state in rare circumstances
(e.g. when a node concurrently joins the cluster or a routing
change happens as it did in the linked test failure).

Closes elastic#64115
original-brownbear added a commit that referenced this issue Oct 26, 2020
We must not remove the snapshot from the initializing set
in the `timeout` getter. This was a plain oversight/mistake
and went unnoticed. It can lead to the removal of a valid
snapshot clone from the cluster state in rare circumstances
(e.g. when a node concurrently joins the cluster or a routing
change happens as it did in the linked test failure).

Closes #64115
original-brownbear added a commit that referenced this issue Oct 26, 2020
We must not remove the snapshot from the initializing set
in the `timeout` getter. This was a plain oversight/mistake
and went unnoticed. It can lead to the removal of a valid
snapshot clone from the cluster state in rare circumstances
(e.g. when a node concurrently joins the cluster or a routing
change happens as it did in the linked test failure).

Closes #64115
@jimczi
Copy link
Contributor

jimczi commented Dec 18, 2020

The failure happened again in 7.x so I am reopening:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+multijob+fast+part1/3372

Same symptom, the suite timeout waiting for threads to finish:

09:30:02   1> [2020-12-18T05:10:06,812][INFO ][o.e.s.CloneSnapshotIT    ] [testBackToBackClonesForIndexNotInCluster] --> wait for [3] snapshots to show up in the cluster state
09:30:02   1> [2020-12-18T05:10:06,814][INFO ][o.e.s.CloneSnapshotIT    ] [testBackToBackClonesForIndexNotInCluster] --> unblocking [test-repo] on node [node_t0]
09:30:02   1> [2020-12-18T05:10:07,006][INFO ][o.e.s.SnapshotsService   ] [node_t0] snapshot [test-repo:target-snapshot/nsste_b0QsOHLn_CE5qrOQ] completed with state [SUCCESS]
09:30:02   1> [2020-12-18T05:10:07,274][INFO ][o.e.s.SnapshotsService   ] [node_t0] snapshot [test-repo:target-snapshot-1/f2e19bB6Rm-JBM99Xig1Kg] completed with state [SUCCESS]
09:30:02   2> Des 18, 2020 5:29:54 AM com.carrotsearch.randomizedtesting.ThreadLeakControl$2 evaluate
09:30:02   2> WARNING: Suite execution timed out: org.elasticsearch.snapshots.CloneSnapshotIT
09:30:02   2> ==== jstack at approximately timeout time ====
09:30:02   2> "elasticsearch[node_t0][management][T#1]" ID=336 WAITING on org.elasticsearch.common.util.concurrent.EsExecutors$ExecutorScalingQueue@36ad022d
09:30:02   2> 	at sun.misc.Unsafe.park(Native Method)
09:30:02   2> 	- waiting on org.elasticsearch.common.util.concurrent.EsExecutors$ExecutorScalingQueue@36ad022d
09:30:02   2> 	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
09:30:02   2> 	at java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:737)
09:30:02   2> 	at java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:647)
09:30:02   2> 	at java.util.concurrent.LinkedTransferQueue.take(LinkedTransferQueue.java:1269)
09:30:02   2> 	at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
09:30:02   2> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
09:30:02   2> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
09:30:02   2> 	at java.lang.Thread.run(Thread.java:748)
```

@jimczi jimczi reopened this Dec 18, 2020
@original-brownbear
Copy link
Member Author

Thanks for reopening Jim, this was/is caused by a missing backport (my fault). Merged now -> closing here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants