-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Unexpected job state [failed] while waiting for job to be opened #37545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Pinging @elastic/ml-core |
Note: this might be related to #30300, since various tests resort to force-killing jobs when they are in |
https://github.com/elastic/elasticsearch/pull/32615/files That PR addressed a bug we had a while back around the temp dir being cleaned up by the OS, the error of the file not being created is similar. I would have thought that these tests would have the fix...I am reaching out to others who would know more. |
This is because of the change to move the ml config out of the cluster state. Previously a job not found error was returned but now it is difficult to check the job's existence in the transport action. The problem is described in #34747 |
The error message is the same but the cause of the problem is almost certainly different. A huge number of tests running in GCP failed yesterday due to the extreme slowness of the VMs. We expect the named pipes to open within 10 seconds of requesting the The problem that #32615 was working around was sub-directories of the
#34747 is describing the problem with error reporting, but it seems to me that there's a more fundamental issue here that Now that evaluation of
It looks like the other ML tests in the list failed for the same reason as the test that's analyzed in detail - The security and watcher ones seem to have just taken that long unrelated to ML. When I saw the list I initially thought ML had left a hung task behind that the
and:
Maybe those tests don't usually take that long, but I suspect they were slower in this build due to the unexplained general slowness in GCP yesterday. SummaryThe actual test failure reported in this issue is due to the unreasonable slowness of VMs that caused a huge number of test failures yesterday. BUT The problem that occurred during the force-close that was caused by the spurious failure is a problem that could affect real users and that we should try to address. It's not going to be very common, so shouldn't be considered a blocker for 6.6.0, but it's clearly not right for an |
Thanks for the detailed analysis @droberts195! Makes sense re: the slowness... I was thinking this was a legit failure because the autodetect failure didn't look like a timeout on the surface, but in retrospect it does make sense that it was just a different manifestation of the timeout stuff we're seeing elsewhere. Going to add this symptom to the infra ticket as well, since it sounds like these named pipes failing are a pretty extreme example of the issues (even more so than the other timeouts that were failing). Thanks! |
I have raised a dedicated issue, #37959, for the underlying problem that this test failure revealed, because it's hard to quickly see the thing that needs to be fixed within this long issue. |
Not entirely sure what's going on here, but I have a rough timeline based on the gradle and server log.
It seems that when
job_get_stats.yml
is setting upjob-stats-test
, something goes wrong with the autodetect process and the job is failed while opening. This has the knock-on effect of killing a bunch of tests because the job is stuck infailed
status:Unexpected job state [failed] while waiting for job to be opened
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.5+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=java11,nodes=virtual&&linux/166/
Appears to happen somewhat consistently over the last six months, although sporadic. Does not reproduce for me:
The full log is below, but the salient error is:
which causes other tests to either fail explicitly because they are waiting on that job to open, or because they notice there are unexpected persistent tasks:
Or even more interesting, the cleanup code tries to kill the job via
_all
but the job doesn't exist... which sounds like the state is only partially cleaned up, or a mismatch between a local node and master's cluster state? Note this is for a different job (job-post-data-job
) but it seems like a similar situation:Full logs
And finally, a bunch of these tests start to stall right when the autodetect process throws those errors:
The text was updated successfully, but these errors were encountered: