MlDistributedFailureIT testJobRelocationIsMemoryAware failure #64171

davidkyle · 2020-10-26T14:52:08Z

Build scan:
https://gradle-enterprise.elastic.co/s/jblhglq5uutpq

Repro line:
./gradlew ':x-pack:plugin:ml:internalClusterTest' --tests "org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.testJobRelocationIsMemoryAware"
-Dtests.seed=B23A5FB64496149
-Dtests.security.manager=true
-Dtests.locale=hr-HR
-Dtests.timezone=America/Scoresbysund
-Druntime.java=14

Reproduces locally?:
No

Applicable branches:
Master

Failure history:
Not many, 3 in the last 30 days

https://build-stats.elastic.co/app/kibana#/discover?_g=(refreshInterval:(pause:!t,value:0),time:(from:now-30d,mode:quick,to:now))&_a=(columns:!(build.branch,test.failed-testcases),index:b646ed00-7efc-11e8-bf69-63c8ef516157,interval:auto,query:(language:lucene,query:MlDistributedFailureIT),sort:!(process.time-start,desc))

Failure excerpt:

java.lang.AssertionError: expected:<1> but was:<2>
	at __randomizedtesting.SeedInfo.seed([B23A5FB64496149:98664C72A06272E7]:0)
	at org.junit.Assert.fail(Assert.java:88)
	at org.junit.Assert.failNotEquals(Assert.java:834)
	at org.junit.Assert.assertEquals(Assert.java:645)
	at org.junit.Assert.assertEquals(Assert.java:631)
	at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.lambda$testJobRelocationIsMemoryAware$14(MlDistributedFailureIT.java:413)
	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:939)
	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:912)
	at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.testJobRelocationIsMemoryAware(MlDistributedFailureIT.java:400)

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-10-26T14:52:10Z

Pinging @elastic/ml-core (:ml)

droberts195 · 2020-10-26T15:13:43Z

The log shows this:

small job nodes: [node_t2, node_t1, node_t2, node_t1], big job nodes: [node_t2]

This makes it look like the jobs got assigned using job counts (since they alternate between the two available nodes) rather than memory. So by finding the small jobs split between 2 nodes rather than 1 the test is failing in the way it was designed to fail if the code doesn't work as expected.

It means some of the time when an ML node leaves the cluster we are probably not balancing the jobs that need to be redistributed as well as we could.

This "expected:<1> but was:<2>" error was first seen in 7.7 on 1st May 2020 in build https://gradle-enterprise.elastic.co/s/5dmfgujbmccvq. (There have been many other failures of this test but due to suite timeouts or failure to form a cluster or other test infrastructure problems that are not interesting as far as this particular failure is concerned.) So it is probably worth looking at changes to the node assignment code in the 6 weeks prior to 1st May to see if anything was changed that might have introduced a loophole into the logic.

droberts195 · 2020-12-04T12:32:27Z

This test is still failing intermittently. A failure from just now is https://gradle-enterprise.elastic.co/s/aomqzdw7hle46

jaymode · 2020-12-14T20:45:35Z

I just hit another failure https://gradle-enterprise.elastic.co/s/wykwy5ai5rcty

ywangd · 2021-01-05T00:07:12Z

Another one: https://gradle-enterprise.elastic.co/s/a4wwanx2zcwyk/tests/:x-pack:plugin:ml:internalClusterTest/org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT/testJobRelocationIsMemoryAware#1

The above build scan has other failures and are tracked at #66885

droberts195 · 2021-01-05T08:18:03Z

Like #66885 the most recent failure will have #66629 as its root cause - if we cannot determine the amount of memory on a machine then several ML tests will fail. The earlier intermittent failures are caused by something else, maybe a race condition in the ML code.

droberts195 · 2021-03-23T10:49:37Z

#68685 is a duplicate of this, and there is less noise in that issue so I will consolidate the knowledge from this issue into that one.

davidkyle added >test-failure Triaged test failures from CI :ml Machine learning labels Oct 26, 2020

droberts195 mentioned this issue Feb 8, 2021

[CI] MlDistributedFailureIT testJobRelocationIsMemoryAware #68685

Closed

droberts195 closed this as completed Mar 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MlDistributedFailureIT testJobRelocationIsMemoryAware failure #64171

MlDistributedFailureIT testJobRelocationIsMemoryAware failure #64171

davidkyle commented Oct 26, 2020

elasticmachine commented Oct 26, 2020

Uh oh!

droberts195 commented Oct 26, 2020

Uh oh!

droberts195 commented Dec 4, 2020

Uh oh!

jaymode commented Dec 14, 2020

Uh oh!

ywangd commented Jan 5, 2021

Uh oh!

droberts195 commented Jan 5, 2021

Uh oh!

droberts195 commented Mar 23, 2021

Uh oh!

MlDistributedFailureIT testJobRelocationIsMemoryAware failure #64171

MlDistributedFailureIT testJobRelocationIsMemoryAware failure #64171

Comments

davidkyle commented Oct 26, 2020

elasticmachine commented Oct 26, 2020

Uh oh!

droberts195 commented Oct 26, 2020

Uh oh!

droberts195 commented Dec 4, 2020

Uh oh!

jaymode commented Dec 14, 2020

Uh oh!

ywangd commented Jan 5, 2021

Uh oh!

droberts195 commented Jan 5, 2021

Uh oh!

droberts195 commented Mar 23, 2021

Uh oh!