-
Notifications
You must be signed in to change notification settings - Fork 25.2k
[CI] XPackRestIT test {p0=ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled} failing #74141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Pinging @elastic/ml-core (Team:ML) |
The 3 tests that have failed in https://gradle-enterprise.elastic.co/s/cu5xpsi2gbnkg all fail because of this error in the server-side logs:
This is happening because the job's C++ process is still running from a previous invocation, but the Java code has lost track of it. The root cause is this:
Since we are seeing this problem for the first time after adding a call to reset in the node assignment chain it seems very likely that something needs changing there. Please can you have a look @dimitris-athanasiou? I think this will affect users in 7.14 if we don't fix it before release. |
I took some time looking into this. I believe the problem is caused by the fact that something triggered a feature reset. Before the job reset starting we can see in the logs:
This is what later kills the job process and, probably, causes the job to fail connecting to the process. What could be triggering a feature reset in the middle of the test? cc @benwtrent |
It doesn't help that none of the 3 tests that failed ran at the time the root cause of the problem occurred. Looking at the full log of test start and stop times from the console log we see that the tests running around the time of the problem were:
Therefore, I think that a feature reset didn't run in the middle of a test, but a feature reset was cleaning up at the end of That test is a bit unusual in that it ends with upgrade mode enabled, so our end-of-test cleanup then has to work while upgrade mode is enabled. |
Here's a bit more of the server side log (the bit that comes immediately before what's include in #74141 (comment)):
So actually that statement was wrong. After the |
We discussed this more on Slack. The problem appears to be that feature reset kills the job while the job persistent task executor is doing job reset before starting the process. That kill request seems like it is reporting success - fair enough as no process was running at the time. But then, after the job reset completes the job persistent task executor is moving on to start the C++ process. This then becomes an invisible ninja process that the Java side has no knowledge of, but which causes problems later on. If a kill request is received while the job persistent task executor is going through its open sequence but before the process is running then it needs to not start the process. It needs to remember that the process was "killed" before it was even running and not start it in this situation. |
A couple more failures: Muted on master in #74351 |
Although it was the first test in the suite to fail when this issue was first opened I think
What happens is that the cleanup after |
This commit checks if the job has been requested to close after the reset action completes as part of allocating the job to a new node. This ensures we do not proceed to start the job process even though the job had been requested to close. Closes elastic#74141
While the job is opening it is possible that the kill process action is called. If the kill process action is received before the job process has started, we currently start the process anyway. The process will eventually timeout to connect to anything and will exit. However, it may cause an unexpected failure if the job is opened again as it won't be able to launch a process as one would already exist. This commit ensures the JobTask.isClosing() reports true when the kill process action has been called in order to abort opening the process. Closes #74141
…#74441) While the job is opening it is possible that the kill process action is called. If the kill process action is received before the job process has started, we currently start the process anyway. The process will eventually timeout to connect to anything and will exit. However, it may cause an unexpected failure if the job is opened again as it won't be able to launch a process as one would already exist. This commit ensures the JobTask.isClosing() reports true when the kill process action has been called in order to abort opening the process. Closes #74141 Backport of #74415
There's still a problem here. The failure in https://gradle-enterprise.elastic.co/s/eaysuxbqtxyp4 is an evolution of the original race condition following the fix. The server-side log shows this:
What has happened is that the test code of So it appears that we violate the "it's not an error to close a job twice" rule when we go through the "Cannot close job [xxx] as it has already been closed or is closing" code path. |
There was a point during the job opening sequence where performing a feature reset could hang. This happened when the kill request issued by feature reset was executed after the job's persistent task was assigned but before the job's native process was started. The persistent task was incorrectly left running in this situation, yet the job opening sequence was aborted which meant the subsequent close request issued by feature reset would wait for a very long time for the persistent task to disappear. The fix is to make the kill process request cancel the persistent task consistently based on its request parameters and not on the current state of the task. Fixes elastic#74141
There was a point during the job opening sequence where performing a feature reset could hang. This happened when the kill request issued by feature reset was executed after the job's persistent task was assigned but before the job's native process was started. The persistent task was incorrectly left running in this situation, yet the job opening sequence was aborted which meant the subsequent close request issued by feature reset would wait for a very long time for the persistent task to disappear. The fix is to make the kill process request cancel the persistent task consistently based on its request parameters and not on the current state of the task. Fixes #74141
I am pretty sure this is a side effect of #73908. Resetting jobs on assignment to a new node doesn't seem to fit reliably with ML upgrade mode.
Build scan:
https://gradle-enterprise.elastic.co/s/cu5xpsi2gbnkg/tests/:x-pack:plugin:yamlRestTest/org.elasticsearch.xpack.test.rest.XPackRestIT/test%20%7Bp0=ml%2Fset_upgrade_mode%2FSetting%20upgrade%20mode%20to%20disabled%20from%20enabled%7D
Reproduction line:
gradlew ':x-pack:plugin:yamlRestTest' --tests "org.elasticsearch.xpack.test.rest.XPackRestIT.test {p0=ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled}" -Dtests.seed=1E2DD794B9477D7B -Dtests.locale=th-TH -Dtests.timezone=America/St_Barthelemy -Druntime.java=11
Applicable branches:
master, 7.x
Reproduces locally?:
No
Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.xpack.test.rest.XPackRestIT&tests.test=test%20%7Bp0%3Dml/set_upgrade_mode/Setting%20upgrade%20mode%20to%20disabled%20from%20enabled%7D
Failure excerpt:
The text was updated successfully, but these errors were encountered: