-
Notifications
You must be signed in to change notification settings - Fork 25.2k
MlDistributedFailureIT testJobRelocationIsMemoryAware failure #64171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Pinging @elastic/ml-core (:ml) |
The log shows this:
This makes it look like the jobs got assigned using job counts (since they alternate between the two available nodes) rather than memory. So by finding the small jobs split between 2 nodes rather than 1 the test is failing in the way it was designed to fail if the code doesn't work as expected. It means some of the time when an ML node leaves the cluster we are probably not balancing the jobs that need to be redistributed as well as we could. This "expected:<1> but was:<2>" error was first seen in 7.7 on 1st May 2020 in build https://gradle-enterprise.elastic.co/s/5dmfgujbmccvq. (There have been many other failures of this test but due to suite timeouts or failure to form a cluster or other test infrastructure problems that are not interesting as far as this particular failure is concerned.) So it is probably worth looking at changes to the node assignment code in the 6 weeks prior to 1st May to see if anything was changed that might have introduced a loophole into the logic. |
This test is still failing intermittently. A failure from just now is https://gradle-enterprise.elastic.co/s/aomqzdw7hle46 |
I just hit another failure https://gradle-enterprise.elastic.co/s/wykwy5ai5rcty |
The above build scan has other failures and are tracked at #66885 |
#68685 is a duplicate of this, and there is less noise in that issue so I will consolidate the knowledge from this issue into that one. |
Build scan:
https://gradle-enterprise.elastic.co/s/jblhglq5uutpq
Repro line:
./gradlew ':x-pack:plugin:ml:internalClusterTest' --tests "org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.testJobRelocationIsMemoryAware"
-Dtests.seed=B23A5FB64496149
-Dtests.security.manager=true
-Dtests.locale=hr-HR
-Dtests.timezone=America/Scoresbysund
-Druntime.java=14
Reproduces locally?:
No
Applicable branches:
Master
Failure history:
Not many, 3 in the last 30 days
https://build-stats.elastic.co/app/kibana#/discover?_g=(refreshInterval:(pause:!t,value:0),time:(from:now-30d,mode:quick,to:now))&_a=(columns:!(build.branch,test.failed-testcases),index:b646ed00-7efc-11e8-bf69-63c8ef516157,interval:auto,query:(language:lucene,query:MlDistributedFailureIT),sort:!(process.time-start,desc))
Failure excerpt:
The text was updated successfully, but these errors were encountered: