Skip to content

DatafeedJobsIT.testRealtime_multipleStopCalls failure on CI because task doesn't exist #45518

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gwbrown opened this issue Aug 13, 2019 · 1 comment · Fixed by #46495
Closed
Assignees
Labels
:ml Machine learning >test-failure Triaged test failures from CI

Comments

@gwbrown
Copy link
Contributor

gwbrown commented Aug 13, 2019

Hit this on CI in an SLM PR that doesn't touch anything related to ML or tasks. Per build stats, this test has failed 7 times in the last 60 days, but only this time was this specific failure.

Build scan
Public Jenkins build

Reproduce line (does not reproduce locally):

./gradlew :x-pack:plugin:ml:qa:native-multi-node-tests:integTestRunner --tests "org.elasticsearch.xpack.ml.integration.DatafeedJobsIT.testRealtime_multipleStopCalls" -Dtests.seed=6D580ED560756B6A -Dtests.security.manager=true -Dtests.locale=es -Dtests.timezone=America/Indiana/Tell_City -Dcompiler.java=12 -Druntime.java=11

Stack trace:

com.carrotsearch.randomizedtesting.UncaughtExceptionError
: 
Captured an uncaught exception in thread: Thread[id=239, name=Thread-9, state=RUNNABLE, group=TGRP-DatafeedJobsIT]
Caused by: 
org.elasticsearch.ResourceNotFoundException
: 
the task with id datafeed-realtime-job-multiple-stop-datafeed and allocation id 16 doesn't exist
Close stacktrace
at __randomizedtesting.SeedInfo.seed([6D580ED560756B6A]:0)
at org.elasticsearch.persistent.PersistentTasksClusterService$4.execute(PersistentTasksClusterService.java:234)
at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47)
at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:697)
at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:319)
at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:214)
at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151)
at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150)
at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188)
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:699)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.lang.Thread.run(Thread.java:834)
@gwbrown gwbrown added >test-failure Triaged test failures from CI :ml Machine learning labels Aug 13, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@dimitris-athanasiou dimitris-athanasiou self-assigned this Sep 6, 2019
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this issue Sep 9, 2019
Investigating the test failure reported in elastic#45518 it appears that
the datafeed task was not found during a tast state update. There
are only two places where such an update is performed: when we set
the state to `started` and when we set it to `stopping`. We handle
`ResourceNotFoundException` in the latter but not in the former.

Thus the test reveals a rare race condition where the datafeed gets
requested to stop before we managed to update its state to `started`.
I could not reproduce this scenario but it would be my best guess.

This commit catches `ResourceNotFoundException` while updating the
state to `started` and lets the task terminate smoothly.

Closes elastic#45518
dimitris-athanasiou added a commit that referenced this issue Sep 10, 2019
Investigating the test failure reported in #45518 it appears that
the datafeed task was not found during a tast state update. There
are only two places where such an update is performed: when we set
the state to `started` and when we set it to `stopping`. We handle
`ResourceNotFoundException` in the latter but not in the former.

Thus the test reveals a rare race condition where the datafeed gets
requested to stop before we managed to update its state to `started`.
I could not reproduce this scenario but it would be my best guess.

This commit catches `ResourceNotFoundException` while updating the
state to `started` and lets the task terminate smoothly.

Closes #45518
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this issue Sep 10, 2019
…astic#46495)

Investigating the test failure reported in elastic#45518 it appears that
the datafeed task was not found during a tast state update. There
are only two places where such an update is performed: when we set
the state to `started` and when we set it to `stopping`. We handle
`ResourceNotFoundException` in the latter but not in the former.

Thus the test reveals a rare race condition where the datafeed gets
requested to stop before we managed to update its state to `started`.
I could not reproduce this scenario but it would be my best guess.

This commit catches `ResourceNotFoundException` while updating the
state to `started` and lets the task terminate smoothly.

Closes elastic#45518

Backport of elastic#46495
dimitris-athanasiou added a commit that referenced this issue Sep 11, 2019
…6495) (#46542)

Investigating the test failure reported in #45518 it appears that
the datafeed task was not found during a tast state update. There
are only two places where such an update is performed: when we set
the state to `started` and when we set it to `stopping`. We handle
`ResourceNotFoundException` in the latter but not in the former.

Thus the test reveals a rare race condition where the datafeed gets
requested to stop before we managed to update its state to `started`.
I could not reproduce this scenario but it would be my best guess.

This commit catches `ResourceNotFoundException` while updating the
state to `started` and lets the task terminate smoothly.

Closes #45518

Backport of #46495
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants