DatafeedJobsIT.testRealtime_multipleStopCalls failure on CI because task doesn't exist #45518

gwbrown · 2019-08-13T21:33:08Z

Hit this on CI in an SLM PR that doesn't touch anything related to ML or tasks. Per build stats, this test has failed 7 times in the last 60 days, but only this time was this specific failure.

Build scan
Public Jenkins build

Reproduce line (does not reproduce locally):

./gradlew :x-pack:plugin:ml:qa:native-multi-node-tests:integTestRunner --tests "org.elasticsearch.xpack.ml.integration.DatafeedJobsIT.testRealtime_multipleStopCalls" -Dtests.seed=6D580ED560756B6A -Dtests.security.manager=true -Dtests.locale=es -Dtests.timezone=America/Indiana/Tell_City -Dcompiler.java=12 -Druntime.java=11

Stack trace:

com.carrotsearch.randomizedtesting.UncaughtExceptionError
: 
Captured an uncaught exception in thread: Thread[id=239, name=Thread-9, state=RUNNABLE, group=TGRP-DatafeedJobsIT]
Caused by: 
org.elasticsearch.ResourceNotFoundException
: 
the task with id datafeed-realtime-job-multiple-stop-datafeed and allocation id 16 doesn't exist
Close stacktrace
at __randomizedtesting.SeedInfo.seed([6D580ED560756B6A]:0)
at org.elasticsearch.persistent.PersistentTasksClusterService$4.execute(PersistentTasksClusterService.java:234)
at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47)
at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:697)
at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:319)
at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:214)
at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151)
at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150)
at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188)
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:699)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.lang.Thread.run(Thread.java:834)

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-08-13T21:33:10Z

Pinging @elastic/ml-core

Investigating the test failure reported in elastic#45518 it appears that the datafeed task was not found during a tast state update. There are only two places where such an update is performed: when we set the state to `started` and when we set it to `stopping`. We handle `ResourceNotFoundException` in the latter but not in the former. Thus the test reveals a rare race condition where the datafeed gets requested to stop before we managed to update its state to `started`. I could not reproduce this scenario but it would be my best guess. This commit catches `ResourceNotFoundException` while updating the state to `started` and lets the task terminate smoothly. Closes elastic#45518

Investigating the test failure reported in #45518 it appears that the datafeed task was not found during a tast state update. There are only two places where such an update is performed: when we set the state to `started` and when we set it to `stopping`. We handle `ResourceNotFoundException` in the latter but not in the former. Thus the test reveals a rare race condition where the datafeed gets requested to stop before we managed to update its state to `started`. I could not reproduce this scenario but it would be my best guess. This commit catches `ResourceNotFoundException` while updating the state to `started` and lets the task terminate smoothly. Closes #45518

…astic#46495) Investigating the test failure reported in elastic#45518 it appears that the datafeed task was not found during a tast state update. There are only two places where such an update is performed: when we set the state to `started` and when we set it to `stopping`. We handle `ResourceNotFoundException` in the latter but not in the former. Thus the test reveals a rare race condition where the datafeed gets requested to stop before we managed to update its state to `started`. I could not reproduce this scenario but it would be my best guess. This commit catches `ResourceNotFoundException` while updating the state to `started` and lets the task terminate smoothly. Closes elastic#45518 Backport of elastic#46495

…6495) (#46542) Investigating the test failure reported in #45518 it appears that the datafeed task was not found during a tast state update. There are only two places where such an update is performed: when we set the state to `started` and when we set it to `stopping`. We handle `ResourceNotFoundException` in the latter but not in the former. Thus the test reveals a rare race condition where the datafeed gets requested to stop before we managed to update its state to `started`. I could not reproduce this scenario but it would be my best guess. This commit catches `ResourceNotFoundException` while updating the state to `started` and lets the task terminate smoothly. Closes #45518 Backport of #46495

gwbrown added >test-failure Triaged test failures from CI :ml Machine learning labels Aug 13, 2019

gwbrown mentioned this issue Aug 13, 2019

Record history of SLM retention actions #45513

Merged

dimitris-athanasiou self-assigned this Sep 6, 2019

dimitris-athanasiou mentioned this issue Sep 9, 2019

[ML] No error when datafeed stops during updating to started #46495

Merged

dimitris-athanasiou closed this as completed in #46495 Sep 10, 2019

dimitris-athanasiou mentioned this issue Sep 10, 2019

[7.x][ML] No error when datafeed stops during updating to started #46542

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DatafeedJobsIT.testRealtime_multipleStopCalls failure on CI because task doesn't exist #45518

DatafeedJobsIT.testRealtime_multipleStopCalls failure on CI because task doesn't exist #45518

gwbrown commented Aug 13, 2019

elasticmachine commented Aug 13, 2019

Uh oh!

DatafeedJobsIT.testRealtime_multipleStopCalls failure on CI because task doesn't exist #45518

DatafeedJobsIT.testRealtime_multipleStopCalls failure on CI because task doesn't exist #45518

Comments

gwbrown commented Aug 13, 2019

elasticmachine commented Aug 13, 2019

Uh oh!