MlDistributedFailureIT.testFullClusterRestart failed #58532

pugnascotia · 2020-06-25T09:57:10Z

Build scan:

https://gradle-enterprise.elastic.co/s/jeoql6ppg2iku/console-log?task=:x-pack:plugin:ml:internalClusterTest

Repro line:

./gradlew ':x-pack:plugin:ml:internalClusterTest' --tests "org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.testFullClusterRestart" -Dtests.seed=DD96E9D2F4821BAF -Dtests.security.manager=true -Dtests.locale=fr-BE -Dtests.timezone=US/Mountain -Druntime.java=11

Reproduces locally?:

No.

Applicable branches:

master

Failure history:

Failure excerpt:

`java.lang.AssertionError: failed to reach a stable cluster of [3] nodes. Tried via [node_t0].`

org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT > testFullClusterRestart FAILED
java.lang.AssertionError: failed to reach a stable cluster of [3] nodes. Tried via [node_t0]. last cluster state:
cluster uuid: DIXb_u8PSQe4QvzUrb8E9g [committed: true]
version: 64
state uuid: PbzLzbRtTVuznkxxS6LdXA
from_diff: false
meta data version: 56
coordination_metadata:
term: 4
last_committed_config: VotingConfiguration{libhdxGrQDCted5Sub97WQ,grRXwwwHR8WYarJGUllUpA,GOWk8mE8RXyF1-L_okWoSw}
last_accepted_config: VotingConfiguration{libhdxGrQDCted5Sub97WQ,grRXwwwHR8WYarJGUllUpA,GOWk8mE8RXyF1-L_okWoSw}
voting tombstones: []
[.ml-config/MZW2ZlLOT56WBeI8jDxqQQ]: v[9], mv[1], sv[2], av[1]
0: p_term [2], isa_ids [JYNICHv3RA2sxmU54Mvkvg, B82JZA2iSNWq0CWK5oIlJw]
[.ml-anomalies-shared/rwyxayFaQeaEmX4Rd2huSQ]: v[12], mv[1], sv[2], av[2]
0: p_term [2], isa_ids [NAd8QNw5QNaZs82sR6s2lQ, 7SI08RWpSz6p0BomCjOCyQ]
[.ml-annotations-6/QFh701rrRFusRV3p5L89oQ]: v[12], mv[1], sv[2], av[2]
0: p_term [2], isa_ids [Y7AXS8N8RnGJ3cdI_PhqcA, RkqtUhAASkaLrEGWWiNZBA]
[data/Oo5baM63RoCSadULUGgDUw]: v[6], mv[1], sv[1], av[1]
0: p_term [2], isa_ids [mYVzh39-RKiSq8VPNqg8wg, GSSkfs9gQ-K-JYH0R24v_w]
[.ml-state-000001/UW6lrZRYSK2DXNHrGY9B6A]: v[13], mv[1], sv[2], av[1]
0: p_term [2], isa_ids [yT6j-nKiS--cSbuTF-uoiw, GCzg69rwTgmxnqmN33HZ8A]
[.ml-notifications-000001/JHJYeK8STb6HnWQBLr73MQ]: v[10], mv[1], sv[2], av[1]
0: p_term [2], isa_ids [O_mgdTFnTf2Iun6JdEOk8Q, hNuzXg5PQUqEdC_fIXMyLQ]
metadata customs:
index_lifecycle: {
"policies" : {
"ml-size-based-ilm-policy" : {
"policy" : {
"phases" : {
"hot" : {
"min_age" : "0ms",
"actions" : {
"rollover" : {
"max_size" : "50gb"
}
}
}
}
},
"headers" : { },
"version" : 1,
"modified_date" : 1593076621794,
"modified_date_string" : "2020-06-25T09:17:01.794Z"
},
"slm-history-ilm-policy" : {
"policy" : {
"phases" : {
"hot" : {
"min_age" : "0ms",
"actions" : {
"rollover" : {
"max_size" : "50gb",
"max_age" : "30d"
}
}
},
"delete" : {
"min_age" : "90d",
"actions" : {
"delete" : {
"delete_searchable_snapshot" : true
}
}
}
}
},
"headers" : { },
"version" : 1,
"modified_date" : 1593076620773,
"modified_date_string" : "2020-06-25T09:17:00.773Z"
}
},
"operation_mode" : "RUNNING"
} ingest: org.elasticsearch.ingest.IngestMetadata@398db12a index-graveyard: IndexGraveyard[[]] persistent_tasks: {"last_allocation_id":6,"tasks":[{"id":"job-full-cluster-restart-job","task":{"xpack/ml/job":{"params":{"job_id":"full-cluster-restart-job","timeout":"1800000ms","job":{"job_id":"full-cluster-restart-job","job_type":"anomaly_detector","job_version":"8.0.0","create_time":1593076638347,"analysis_config":{"bucket_span":"1h","detectors":[{"detector_description":"count","function":"count","detector_index":0}],"influencers":[]},"analysis_limits":{"model_memory_limit":"1024mb","categorization_examples_limit":4},"data_description":{"time_field":"time","time_format":"yyyy-MM-dd HH:mm:ss"},"model_snapshot_retention_days":10,"daily_model_snapshot_retention_after_days":1,"results_index_name":"shared","allow_lazy_open":false}},"state":{"state":"opened","allocation_id":5}}},"allocation_id":5,"assignment":{"executor_node":"GOWk8mE8RXyF1-L_okWoSw","explanation":""},"allocation_id_on_last_status_update":5},{"id":"datafeed-data_feed_id","task":{"xpack/ml/datafeed":{"params":{"datafeed_id":"data_feed_id","start":"0","timeout":"20000ms","job_id":"full-cluster-restart-job","indices":["data"],"indices_options":{"expand_wildcards":["open"],"ignore_unavailable":false,"allow_no_indices":true,"ignore_throttled":true}},"state":{"state":"started"}}},"allocation_id":6,"assignment":{"executor_node":"GOWk8mE8RXyF1-L_okWoSw","explanation":""},"allocation_id_on_last_status_update":6}]}
nodes:
{node_t1}{GOWk8mE8RXyF1-L_okWoSw}{FiJjhUD0SJ2bV6I07B82hQ}{127.0.0.1}{127.0.0.1:36407}{dilmr}{ml.machine_memory=101094690816, ml.max_open_jobs=20, xpack.installed=true}
{node_t0}{grRXwwwHR8WYarJGUllUpA}{oOGpoxPQQYa_HCc1HL1TDQ}{127.0.0.1}{127.0.0.1:41819}{dilmr}{ml.machine_memory=101094690816, ml.max_open_jobs=20, xpack.installed=true}
{node_t2}{libhdxGrQDCted5Sub97WQ}{OYVU1TxHS_muQKUQVNPnoA}{127.0.0.1}{127.0.0.1:41017}{dilmr}{ml.machine_memory=101094690816, ml.max_open_jobs=20, xpack.installed=true}, master
routing_table (version 8):
-- index [[.ml-config/MZW2ZlLOT56WBeI8jDxqQQ]]
----shard_id [.ml-config][0]
--------[.ml-config][0], node[libhdxGrQDCted5Sub97WQ], [P], s[STARTED], a[id=B82JZA2iSNWq0CWK5oIlJw]
--------[.ml-config][0], node[grRXwwwHR8WYarJGUllUpA], [R], s[STARTED], a[id=JYNICHv3RA2sxmU54Mvkvg]

-- index [[.ml-annotations-6/QFh701rrRFusRV3p5L89oQ]]
----shard_id [.ml-annotations-6][0]
--------[.ml-annotations-6][0], node[GOWk8mE8RXyF1-L_okWoSw], [P], s[STARTED], a[id=RkqtUhAASkaLrEGWWiNZBA]
--------[.ml-annotations-6][0], node[grRXwwwHR8WYarJGUllUpA], [R], s[STARTED], a[id=Y7AXS8N8RnGJ3cdI_PhqcA]

-- index [[.ml-anomalies-shared/rwyxayFaQeaEmX4Rd2huSQ]]
----shard_id [.ml-anomalies-shared][0]
--------[.ml-anomalies-shared][0], node[GOWk8mE8RXyF1-L_okWoSw], [R], s[STARTED], a[id=NAd8QNw5QNaZs82sR6s2lQ]
--------[.ml-anomalies-shared][0], node[grRXwwwHR8WYarJGUllUpA], [P], s[STARTED], a[id=7SI08RWpSz6p0BomCjOCyQ]

-- index [[data/Oo5baM63RoCSadULUGgDUw]]
----shard_id [data][0]
--------[data][0], node[GOWk8mE8RXyF1-L_okWoSw], [P], s[STARTED], a[id=GSSkfs9gQ-K-JYH0R24v_w]
--------[data][0], node[libhdxGrQDCted5Sub97WQ], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=SxzR0FYRQf-YI2FykImH2A], unassigned_info[[reason=CLUSTER_RECOVERED], at[2020-06-25T09:18:18.429Z], delayed=false, allocation_status[no_attempt]]

-- index [[.ml-state-000001/UW6lrZRYSK2DXNHrGY9B6A]]
----shard_id [.ml-state-000001][0]
--------[.ml-state-000001][0], node[GOWk8mE8RXyF1-L_okWoSw], [P], s[STARTED], a[id=yT6j-nKiS--cSbuTF-uoiw]
--------[.ml-state-000001][0], node[libhdxGrQDCted5Sub97WQ], [R], s[STARTED], a[id=GCzg69rwTgmxnqmN33HZ8A]

-- index [[.ml-notifications-000001/JHJYeK8STb6HnWQBLr73MQ]]
----shard_id [.ml-notifications-000001][0]
--------[.ml-notifications-000001][0], node[libhdxGrQDCted5Sub97WQ], [P], s[STARTED], a[id=hNuzXg5PQUqEdC_fIXMyLQ]
--------[.ml-notifications-000001][0], node[grRXwwwHR8WYarJGUllUpA], [R], s[STARTED], a[id=O_mgdTFnTf2Iun6JdEOk8Q]

routing_nodes:
-----node_id[GOWk8mE8RXyF1-L_okWoSw][V]
--------[.ml-annotations-6][0], node[GOWk8mE8RXyF1-L_okWoSw], [P], s[STARTED], a[id=RkqtUhAASkaLrEGWWiNZBA]
--------[.ml-anomalies-shared][0], node[GOWk8mE8RXyF1-L_okWoSw], [R], s[STARTED], a[id=NAd8QNw5QNaZs82sR6s2lQ]
--------[data][0], node[GOWk8mE8RXyF1-L_okWoSw], [P], s[STARTED], a[id=GSSkfs9gQ-K-JYH0R24v_w]
--------[.ml-state-000001][0], node[GOWk8mE8RXyF1-L_okWoSw], [P], s[STARTED], a[id=yT6j-nKiS--cSbuTF-uoiw]
-----node_id[libhdxGrQDCted5Sub97WQ][V]
--------[.ml-config][0], node[libhdxGrQDCted5Sub97WQ], [P], s[STARTED], a[id=B82JZA2iSNWq0CWK5oIlJw]
--------[data][0], node[libhdxGrQDCted5Sub97WQ], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=SxzR0FYRQf-YI2FykImH2A], unassigned_info[[reason=CLUSTER_RECOVERED], at[2020-06-25T09:18:18.429Z], delayed=false, allocation_status[no_attempt]]
--------[.ml-state-000001][0], node[libhdxGrQDCted5Sub97WQ], [R], s[STARTED], a[id=GCzg69rwTgmxnqmN33HZ8A]
--------[.ml-notifications-000001][0], node[libhdxGrQDCted5Sub97WQ], [P], s[STARTED], a[id=hNuzXg5PQUqEdC_fIXMyLQ]
-----node_id[grRXwwwHR8WYarJGUllUpA][V]
--------[.ml-config][0], node[grRXwwwHR8WYarJGUllUpA], [R], s[STARTED], a[id=JYNICHv3RA2sxmU54Mvkvg]
--------[.ml-annotations-6][0], node[grRXwwwHR8WYarJGUllUpA], [R], s[STARTED], a[id=Y7AXS8N8RnGJ3cdI_PhqcA]
--------[.ml-anomalies-shared][0], node[grRXwwwHR8WYarJGUllUpA], [P], s[STARTED], a[id=7SI08RWpSz6p0BomCjOCyQ]
--------[.ml-notifications-000001][0], node[grRXwwwHR8WYarJGUllUpA], [R], s[STARTED], a[id=O_mgdTFnTf2Iun6JdEOk8Q]
---- unassigned
    at __randomizedtesting.SeedInfo.seed([DD96E9D2F4821BAF:8CB9F8C301AE5300]:0)
    at org.junit.Assert.fail(Assert.java:88)
    at org.elasticsearch.test.ESIntegTestCase.ensureStableCluster(ESIntegTestCase.java:1162)
    at org.elasticsearch.test.ESIntegTestCase.ensureStableCluster(ESIntegTestCase.java:1145)
    at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.ensureStableClusterOnAllNodes(MlDistributedFailureIT.java:546)
    at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.lambda$testFullClusterRestart$4(MlDistributedFailureIT.java:126)
    at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.run(MlDistributedFailureIT.java:470)
    at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.testFullClusterRestart(MlDistributedFailureIT.java:122)


</details>

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-06-25T09:57:12Z

Pinging @elastic/ml-core (:ml)

For #58532

davidkyle · 2020-07-02T09:53:52Z

Another instance https://gradle-enterprise.elastic.co/s/vzl2w4brhmiv6/tests/:x-pack:plugin:ml:internalClusterTest/org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT/testFullClusterRestart#1

Muted on master in 0dffe37

davidkyle · 2020-07-02T10:41:50Z

The test performs a full cluster restart (as you would imagine) then it logs this message and waits for a stable cluster

[2020-07-02T13:38:56,971][INFO ][o.e.x.m.i.MlDistributedFailureIT] [testFullClusterRestart] Restarted all nodes

The error is that ESIntegTestCase.ensureStableCluster(...) times out after 30 seconds.

Within that 30 second window a cluster state green message is logged (this is 28 seconds later, in the other failure the same message occurred 27 seconds later)

[2020-07-02T13:39:24,570][INFO ][o.e.c.r.a.AllocationService] [node_t1] current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[data][0]]])." previous.health="YELLOW" reason="shards started [[data][0]]"

ensureStableCluster sets setWaitForNoRelocatingShards: true

elasticsearch/test/framework/src/main/java/org/elasticsearch/test/ESIntegTestCase.java

Line 1158 in acf031c

.setWaitForNoRelocatingShards(true)

And sure enough in the last cluster state we see the data index replica shard is in recovery. I think this is why ensureStableCluster timed out

--------[data][0], node[Mrm365-qTBaAyCuhPZHNKQ], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=uu_2AYoWTIefDl6z1l-2uQ], unassigned_info[[reason=CLUSTER_RECOVERED], at[2020-07-02T09:08:56.982Z], delayed=false, allocation_status[no_attempt]]

In both the failures here it was the data index that was recovering but build stats has a failure where the .ml-config index is recovering.

https://build-stats.elastic.co/app/kibana#/doc/e58bf320-7efd-11e8-bf69-63c8ef516157/failure-1531848256632/t?id=dgvAh3IBRctK8eeC9eTw&_g=()

This has been failing fairly regularly since 6th June

https://build-stats.elastic.co/app/kibana#/discover?_g=(refreshInterval:(pause:!t,value:0),time:(from:now-60d,mode:quick,to:now))&_a=(columns:!(_source),index:e58bf320-7efd-11e8-bf69-63c8ef516157,interval:auto,query:(language:lucene,query:MlDistributedFailureIT),sort:!(time,desc))

davidkyle · 2020-07-02T11:15:47Z

pinging @elastic/es-distributed we have a full cluster restart test that is failing due to ESIntegTestCase.ensureStableCluster() timing out post restart. According to build stats this has been failing regularly since 6th June is there any change you can think of that may be causing the problems? Thanks

ywelsch · 2020-07-02T13:28:05Z

I think this might just be the effect of the test being particularly slow at times (given that we execute a lot of other stuff concurrently with gradle). Looking at the failed test linked to above, it spends already a lot of time until it gets to the point where the nodes are restarted (indexing started at 13:37:57, nodes are restarted at 13:38:45, i.e. nearly 50 seconds later), indicating that the timeout of 30 seconds would just not have been good enough for this particularly slow CI run (the restart completed at 13:38:56, and going to yellow only happened at 13:39:04, i.e. 10 seconds later, and becomes green at 13:39:24, i.e. after 28 seconds. The shards might be rebalanced after that, causing an extra few seconds, which just brings this over the 30 seconds.

I see two possibilities: Rework the test so that it runs less slowly, or increase the timeout

droberts195 · 2020-07-02T15:23:54Z

The Linux CI workers now run 16 tasks in parallel, as the timeline for https://gradle-enterprise.elastic.co/s/jeoql6ppg2iku shows:

You can see that :x-pack:plugin:ml:internalClusterTest overlapped with the main X-Pack YAML tests, the ML YAML tests that run 3 times with different levels of security permissions, the CCR internal cluster tests and the EQL integration tests.

The load on the machine for around 20 minutes was over 60 processes wanting to do stuff, when the machine has 32 cores:

As we push parallelism to its limits we're probably going to have to increase quite a few timeouts.

davidkyle · 2020-07-03T09:48:27Z

Thanks for the comments @ywelsch and @droberts195

I was curious because there was a marked increase in failures from 6th June but I'm satisfied this is most likely due to CI infra changes and I can see that in all cases the machine was under heavy load when failures occurred.

I bumped the timeout to 60s as it was on the edge

pugnascotia added >test-failure Triaged test failures from CI :ml Machine learning labels Jun 25, 2020

benwtrent mentioned this issue Jun 25, 2020

[CI] MlDistributedFailureIT#testFullClusterRestart #58533

Closed

davidkyle mentioned this issue Jul 2, 2020

Mute MlDistributedFailureIT testFullClusterRestart #58891

Merged

davidkyle added a commit that referenced this issue Jul 2, 2020

Mute MlDistributedFailureIT testFullClusterRestart (#58891)

0dffe37

For #58532

davidkyle mentioned this issue Jul 2, 2020

Pipeline Inference Aggregation #58193

Merged

davidkyle mentioned this issue Jul 3, 2020

Increase timeout in MlDistributedFailureIT #58997

Merged

davidkyle closed this as completed in #58997 Jul 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MlDistributedFailureIT.testFullClusterRestart failed #58532

MlDistributedFailureIT.testFullClusterRestart failed #58532

pugnascotia commented Jun 25, 2020

elasticmachine commented Jun 25, 2020

Uh oh!

davidkyle commented Jul 2, 2020

Uh oh!

davidkyle commented Jul 2, 2020

Uh oh!

davidkyle commented Jul 2, 2020

Uh oh!

ywelsch commented Jul 2, 2020

Uh oh!

droberts195 commented Jul 2, 2020

Uh oh!

davidkyle commented Jul 3, 2020

Uh oh!

MlDistributedFailureIT.testFullClusterRestart failed #58532

MlDistributedFailureIT.testFullClusterRestart failed #58532

Comments

pugnascotia commented Jun 25, 2020

elasticmachine commented Jun 25, 2020

Uh oh!

davidkyle commented Jul 2, 2020

Uh oh!

davidkyle commented Jul 2, 2020

Uh oh!

davidkyle commented Jul 2, 2020

Uh oh!

ywelsch commented Jul 2, 2020

Uh oh!

droberts195 commented Jul 2, 2020

Uh oh!

davidkyle commented Jul 3, 2020

Uh oh!