Skip to content

MlDistributedFailureIT testJobRelocationIsMemoryAware failure #64171

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
davidkyle opened this issue Oct 26, 2020 · 7 comments
Closed

MlDistributedFailureIT testJobRelocationIsMemoryAware failure #64171

davidkyle opened this issue Oct 26, 2020 · 7 comments
Labels
:ml Machine learning >test-failure Triaged test failures from CI

Comments

@davidkyle
Copy link
Member

Build scan:
https://gradle-enterprise.elastic.co/s/jblhglq5uutpq

Repro line:
./gradlew ':x-pack:plugin:ml:internalClusterTest' --tests "org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.testJobRelocationIsMemoryAware"
-Dtests.seed=B23A5FB64496149
-Dtests.security.manager=true
-Dtests.locale=hr-HR
-Dtests.timezone=America/Scoresbysund
-Druntime.java=14

Reproduces locally?:
No

Applicable branches:
Master

Failure history:
Not many, 3 in the last 30 days

https://build-stats.elastic.co/app/kibana#/discover?_g=(refreshInterval:(pause:!t,value:0),time:(from:now-30d,mode:quick,to:now))&_a=(columns:!(build.branch,test.failed-testcases),index:b646ed00-7efc-11e8-bf69-63c8ef516157,interval:auto,query:(language:lucene,query:MlDistributedFailureIT),sort:!(process.time-start,desc))

Failure excerpt:

java.lang.AssertionError: expected:<1> but was:<2>
	at __randomizedtesting.SeedInfo.seed([B23A5FB64496149:98664C72A06272E7]:0)
	at org.junit.Assert.fail(Assert.java:88)
	at org.junit.Assert.failNotEquals(Assert.java:834)
	at org.junit.Assert.assertEquals(Assert.java:645)
	at org.junit.Assert.assertEquals(Assert.java:631)
	at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.lambda$testJobRelocationIsMemoryAware$14(MlDistributedFailureIT.java:413)
	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:939)
	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:912)
	at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.testJobRelocationIsMemoryAware(MlDistributedFailureIT.java:400)
@davidkyle davidkyle added >test-failure Triaged test failures from CI :ml Machine learning labels Oct 26, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@droberts195
Copy link
Contributor

The log shows this:

small job nodes: [node_t2, node_t1, node_t2, node_t1], big job nodes: [node_t2] 

This makes it look like the jobs got assigned using job counts (since they alternate between the two available nodes) rather than memory. So by finding the small jobs split between 2 nodes rather than 1 the test is failing in the way it was designed to fail if the code doesn't work as expected.

It means some of the time when an ML node leaves the cluster we are probably not balancing the jobs that need to be redistributed as well as we could.

This "expected:<1> but was:<2>" error was first seen in 7.7 on 1st May 2020 in build https://gradle-enterprise.elastic.co/s/5dmfgujbmccvq. (There have been many other failures of this test but due to suite timeouts or failure to form a cluster or other test infrastructure problems that are not interesting as far as this particular failure is concerned.) So it is probably worth looking at changes to the node assignment code in the 6 weeks prior to 1st May to see if anything was changed that might have introduced a loophole into the logic.

@droberts195
Copy link
Contributor

This test is still failing intermittently. A failure from just now is https://gradle-enterprise.elastic.co/s/aomqzdw7hle46

@jaymode
Copy link
Member

jaymode commented Dec 14, 2020

I just hit another failure https://gradle-enterprise.elastic.co/s/wykwy5ai5rcty

@ywangd
Copy link
Member

ywangd commented Jan 5, 2021

@droberts195
Copy link
Contributor

Like #66885 the most recent failure will have #66629 as its root cause - if we cannot determine the amount of memory on a machine then several ML tests will fail. The earlier intermittent failures are caused by something else, maybe a race condition in the ML code.

@droberts195
Copy link
Contributor

#68685 is a duplicate of this, and there is less noise in that issue so I will consolidate the knowledge from this issue into that one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

5 participants