Skip to content

MlDistributedFailureIT.testFullClusterRestart failed #58532

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pugnascotia opened this issue Jun 25, 2020 · 7 comments · Fixed by #58997
Closed

MlDistributedFailureIT.testFullClusterRestart failed #58532

pugnascotia opened this issue Jun 25, 2020 · 7 comments · Fixed by #58997
Labels
:ml Machine learning >test-failure Triaged test failures from CI

Comments

@pugnascotia
Copy link
Contributor

Build scan:

https://gradle-enterprise.elastic.co/s/jeoql6ppg2iku/console-log?task=:x-pack:plugin:ml:internalClusterTest

Repro line:

./gradlew ':x-pack:plugin:ml:internalClusterTest' --tests "org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.testFullClusterRestart" -Dtests.seed=DD96E9D2F4821BAF -Dtests.security.manager=true -Dtests.locale=fr-BE -Dtests.timezone=US/Mountain -Druntime.java=11

Reproduces locally?:

No.

Applicable branches:

master

Failure history:

Failure excerpt:

`java.lang.AssertionError: failed to reach a stable cluster of [3] nodes. Tried via [node_t0].`

org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT > testFullClusterRestart FAILED
java.lang.AssertionError: failed to reach a stable cluster of [3] nodes. Tried via [node_t0]. last cluster state:
cluster uuid: DIXb_u8PSQe4QvzUrb8E9g [committed: true]
version: 64
state uuid: PbzLzbRtTVuznkxxS6LdXA
from_diff: false
meta data version: 56
coordination_metadata:
term: 4
last_committed_config: VotingConfiguration{libhdxGrQDCted5Sub97WQ,grRXwwwHR8WYarJGUllUpA,GOWk8mE8RXyF1-L_okWoSw}
last_accepted_config: VotingConfiguration{libhdxGrQDCted5Sub97WQ,grRXwwwHR8WYarJGUllUpA,GOWk8mE8RXyF1-L_okWoSw}
voting tombstones: []
[.ml-config/MZW2ZlLOT56WBeI8jDxqQQ]: v[9], mv[1], sv[2], av[1]
0: p_term [2], isa_ids [JYNICHv3RA2sxmU54Mvkvg, B82JZA2iSNWq0CWK5oIlJw]
[.ml-anomalies-shared/rwyxayFaQeaEmX4Rd2huSQ]: v[12], mv[1], sv[2], av[2]
0: p_term [2], isa_ids [NAd8QNw5QNaZs82sR6s2lQ, 7SI08RWpSz6p0BomCjOCyQ]
[.ml-annotations-6/QFh701rrRFusRV3p5L89oQ]: v[12], mv[1], sv[2], av[2]
0: p_term [2], isa_ids [Y7AXS8N8RnGJ3cdI_PhqcA, RkqtUhAASkaLrEGWWiNZBA]
[data/Oo5baM63RoCSadULUGgDUw]: v[6], mv[1], sv[1], av[1]
0: p_term [2], isa_ids [mYVzh39-RKiSq8VPNqg8wg, GSSkfs9gQ-K-JYH0R24v_w]
[.ml-state-000001/UW6lrZRYSK2DXNHrGY9B6A]: v[13], mv[1], sv[2], av[1]
0: p_term [2], isa_ids [yT6j-nKiS--cSbuTF-uoiw, GCzg69rwTgmxnqmN33HZ8A]
[.ml-notifications-000001/JHJYeK8STb6HnWQBLr73MQ]: v[10], mv[1], sv[2], av[1]
0: p_term [2], isa_ids [O_mgdTFnTf2Iun6JdEOk8Q, hNuzXg5PQUqEdC_fIXMyLQ]
metadata customs:
index_lifecycle: {
"policies" : {
"ml-size-based-ilm-policy" : {
"policy" : {
"phases" : {
"hot" : {
"min_age" : "0ms",
"actions" : {
"rollover" : {
"max_size" : "50gb"
}
}
}
}
},
"headers" : { },
"version" : 1,
"modified_date" : 1593076621794,
"modified_date_string" : "2020-06-25T09:17:01.794Z"
},
"slm-history-ilm-policy" : {
"policy" : {
"phases" : {
"hot" : {
"min_age" : "0ms",
"actions" : {
"rollover" : {
"max_size" : "50gb",
"max_age" : "30d"
}
}
},
"delete" : {
"min_age" : "90d",
"actions" : {
"delete" : {
"delete_searchable_snapshot" : true
}
}
}
}
},
"headers" : { },
"version" : 1,
"modified_date" : 1593076620773,
"modified_date_string" : "2020-06-25T09:17:00.773Z"
}
},
"operation_mode" : "RUNNING"
} ingest: org.elasticsearch.ingest.IngestMetadata@398db12a index-graveyard: IndexGraveyard[[]] persistent_tasks: {"last_allocation_id":6,"tasks":[{"id":"job-full-cluster-restart-job","task":{"xpack/ml/job":{"params":{"job_id":"full-cluster-restart-job","timeout":"1800000ms","job":{"job_id":"full-cluster-restart-job","job_type":"anomaly_detector","job_version":"8.0.0","create_time":1593076638347,"analysis_config":{"bucket_span":"1h","detectors":[{"detector_description":"count","function":"count","detector_index":0}],"influencers":[]},"analysis_limits":{"model_memory_limit":"1024mb","categorization_examples_limit":4},"data_description":{"time_field":"time","time_format":"yyyy-MM-dd HH:mm:ss"},"model_snapshot_retention_days":10,"daily_model_snapshot_retention_after_days":1,"results_index_name":"shared","allow_lazy_open":false}},"state":{"state":"opened","allocation_id":5}}},"allocation_id":5,"assignment":{"executor_node":"GOWk8mE8RXyF1-L_okWoSw","explanation":""},"allocation_id_on_last_status_update":5},{"id":"datafeed-data_feed_id","task":{"xpack/ml/datafeed":{"params":{"datafeed_id":"data_feed_id","start":"0","timeout":"20000ms","job_id":"full-cluster-restart-job","indices":["data"],"indices_options":{"expand_wildcards":["open"],"ignore_unavailable":false,"allow_no_indices":true,"ignore_throttled":true}},"state":{"state":"started"}}},"allocation_id":6,"assignment":{"executor_node":"GOWk8mE8RXyF1-L_okWoSw","explanation":""},"allocation_id_on_last_status_update":6}]}
nodes:
{node_t1}{GOWk8mE8RXyF1-L_okWoSw}{FiJjhUD0SJ2bV6I07B82hQ}{127.0.0.1}{127.0.0.1:36407}{dilmr}{ml.machine_memory=101094690816, ml.max_open_jobs=20, xpack.installed=true}
{node_t0}{grRXwwwHR8WYarJGUllUpA}{oOGpoxPQQYa_HCc1HL1TDQ}{127.0.0.1}{127.0.0.1:41819}{dilmr}{ml.machine_memory=101094690816, ml.max_open_jobs=20, xpack.installed=true}
{node_t2}{libhdxGrQDCted5Sub97WQ}{OYVU1TxHS_muQKUQVNPnoA}{127.0.0.1}{127.0.0.1:41017}{dilmr}{ml.machine_memory=101094690816, ml.max_open_jobs=20, xpack.installed=true}, master
routing_table (version 8):
-- index [[.ml-config/MZW2ZlLOT56WBeI8jDxqQQ]]
----shard_id [.ml-config][0]
--------[.ml-config][0], node[libhdxGrQDCted5Sub97WQ], [P], s[STARTED], a[id=B82JZA2iSNWq0CWK5oIlJw]
--------[.ml-config][0], node[grRXwwwHR8WYarJGUllUpA], [R], s[STARTED], a[id=JYNICHv3RA2sxmU54Mvkvg]

-- index [[.ml-annotations-6/QFh701rrRFusRV3p5L89oQ]]
----shard_id [.ml-annotations-6][0]
--------[.ml-annotations-6][0], node[GOWk8mE8RXyF1-L_okWoSw], [P], s[STARTED], a[id=RkqtUhAASkaLrEGWWiNZBA]
--------[.ml-annotations-6][0], node[grRXwwwHR8WYarJGUllUpA], [R], s[STARTED], a[id=Y7AXS8N8RnGJ3cdI_PhqcA]

-- index [[.ml-anomalies-shared/rwyxayFaQeaEmX4Rd2huSQ]]
----shard_id [.ml-anomalies-shared][0]
--------[.ml-anomalies-shared][0], node[GOWk8mE8RXyF1-L_okWoSw], [R], s[STARTED], a[id=NAd8QNw5QNaZs82sR6s2lQ]
--------[.ml-anomalies-shared][0], node[grRXwwwHR8WYarJGUllUpA], [P], s[STARTED], a[id=7SI08RWpSz6p0BomCjOCyQ]

-- index [[data/Oo5baM63RoCSadULUGgDUw]]
----shard_id [data][0]
--------[data][0], node[GOWk8mE8RXyF1-L_okWoSw], [P], s[STARTED], a[id=GSSkfs9gQ-K-JYH0R24v_w]
--------[data][0], node[libhdxGrQDCted5Sub97WQ], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=SxzR0FYRQf-YI2FykImH2A], unassigned_info[[reason=CLUSTER_RECOVERED], at[2020-06-25T09:18:18.429Z], delayed=false, allocation_status[no_attempt]]

-- index [[.ml-state-000001/UW6lrZRYSK2DXNHrGY9B6A]]
----shard_id [.ml-state-000001][0]
--------[.ml-state-000001][0], node[GOWk8mE8RXyF1-L_okWoSw], [P], s[STARTED], a[id=yT6j-nKiS--cSbuTF-uoiw]
--------[.ml-state-000001][0], node[libhdxGrQDCted5Sub97WQ], [R], s[STARTED], a[id=GCzg69rwTgmxnqmN33HZ8A]

-- index [[.ml-notifications-000001/JHJYeK8STb6HnWQBLr73MQ]]
----shard_id [.ml-notifications-000001][0]
--------[.ml-notifications-000001][0], node[libhdxGrQDCted5Sub97WQ], [P], s[STARTED], a[id=hNuzXg5PQUqEdC_fIXMyLQ]
--------[.ml-notifications-000001][0], node[grRXwwwHR8WYarJGUllUpA], [R], s[STARTED], a[id=O_mgdTFnTf2Iun6JdEOk8Q]

routing_nodes:
-----node_id[GOWk8mE8RXyF1-L_okWoSw][V]
--------[.ml-annotations-6][0], node[GOWk8mE8RXyF1-L_okWoSw], [P], s[STARTED], a[id=RkqtUhAASkaLrEGWWiNZBA]
--------[.ml-anomalies-shared][0], node[GOWk8mE8RXyF1-L_okWoSw], [R], s[STARTED], a[id=NAd8QNw5QNaZs82sR6s2lQ]
--------[data][0], node[GOWk8mE8RXyF1-L_okWoSw], [P], s[STARTED], a[id=GSSkfs9gQ-K-JYH0R24v_w]
--------[.ml-state-000001][0], node[GOWk8mE8RXyF1-L_okWoSw], [P], s[STARTED], a[id=yT6j-nKiS--cSbuTF-uoiw]
-----node_id[libhdxGrQDCted5Sub97WQ][V]
--------[.ml-config][0], node[libhdxGrQDCted5Sub97WQ], [P], s[STARTED], a[id=B82JZA2iSNWq0CWK5oIlJw]
--------[data][0], node[libhdxGrQDCted5Sub97WQ], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=SxzR0FYRQf-YI2FykImH2A], unassigned_info[[reason=CLUSTER_RECOVERED], at[2020-06-25T09:18:18.429Z], delayed=false, allocation_status[no_attempt]]
--------[.ml-state-000001][0], node[libhdxGrQDCted5Sub97WQ], [R], s[STARTED], a[id=GCzg69rwTgmxnqmN33HZ8A]
--------[.ml-notifications-000001][0], node[libhdxGrQDCted5Sub97WQ], [P], s[STARTED], a[id=hNuzXg5PQUqEdC_fIXMyLQ]
-----node_id[grRXwwwHR8WYarJGUllUpA][V]
--------[.ml-config][0], node[grRXwwwHR8WYarJGUllUpA], [R], s[STARTED], a[id=JYNICHv3RA2sxmU54Mvkvg]
--------[.ml-annotations-6][0], node[grRXwwwHR8WYarJGUllUpA], [R], s[STARTED], a[id=Y7AXS8N8RnGJ3cdI_PhqcA]
--------[.ml-anomalies-shared][0], node[grRXwwwHR8WYarJGUllUpA], [P], s[STARTED], a[id=7SI08RWpSz6p0BomCjOCyQ]
--------[.ml-notifications-000001][0], node[grRXwwwHR8WYarJGUllUpA], [R], s[STARTED], a[id=O_mgdTFnTf2Iun6JdEOk8Q]
---- unassigned
    at __randomizedtesting.SeedInfo.seed([DD96E9D2F4821BAF:8CB9F8C301AE5300]:0)
    at org.junit.Assert.fail(Assert.java:88)
    at org.elasticsearch.test.ESIntegTestCase.ensureStableCluster(ESIntegTestCase.java:1162)
    at org.elasticsearch.test.ESIntegTestCase.ensureStableCluster(ESIntegTestCase.java:1145)
    at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.ensureStableClusterOnAllNodes(MlDistributedFailureIT.java:546)
    at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.lambda$testFullClusterRestart$4(MlDistributedFailureIT.java:126)
    at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.run(MlDistributedFailureIT.java:470)
    at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.testFullClusterRestart(MlDistributedFailureIT.java:122)

</details>


@pugnascotia pugnascotia added >test-failure Triaged test failures from CI :ml Machine learning labels Jun 25, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@davidkyle
Copy link
Member

The test performs a full cluster restart (as you would imagine) then it logs this message and waits for a stable cluster

[2020-07-02T13:38:56,971][INFO ][o.e.x.m.i.MlDistributedFailureIT] [testFullClusterRestart] Restarted all nodes

The error is that ESIntegTestCase.ensureStableCluster(...) times out after 30 seconds.

Within that 30 second window a cluster state green message is logged (this is 28 seconds later, in the other failure the same message occurred 27 seconds later)

[2020-07-02T13:39:24,570][INFO ][o.e.c.r.a.AllocationService] [node_t1] current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[data][0]]])." previous.health="YELLOW" reason="shards started [[data][0]]"

ensureStableCluster sets setWaitForNoRelocatingShards: true

And sure enough in the last cluster state we see the data index replica shard is in recovery. I think this is why ensureStableCluster timed out

--------[data][0], node[Mrm365-qTBaAyCuhPZHNKQ], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=uu_2AYoWTIefDl6z1l-2uQ], unassigned_info[[reason=CLUSTER_RECOVERED], at[2020-07-02T09:08:56.982Z], delayed=false, allocation_status[no_attempt]]

In both the failures here it was the data index that was recovering but build stats has a failure where the .ml-config index is recovering.

https://build-stats.elastic.co/app/kibana#/doc/e58bf320-7efd-11e8-bf69-63c8ef516157/failure-1531848256632/t?id=dgvAh3IBRctK8eeC9eTw&_g=()

This has been failing fairly regularly since 6th June

https://build-stats.elastic.co/app/kibana#/discover?_g=(refreshInterval:(pause:!t,value:0),time:(from:now-60d,mode:quick,to:now))&_a=(columns:!(_source),index:e58bf320-7efd-11e8-bf69-63c8ef516157,interval:auto,query:(language:lucene,query:MlDistributedFailureIT),sort:!(time,desc))

@davidkyle
Copy link
Member

pinging @elastic/es-distributed we have a full cluster restart test that is failing due to ESIntegTestCase.ensureStableCluster() timing out post restart. According to build stats this has been failing regularly since 6th June is there any change you can think of that may be causing the problems? Thanks

@ywelsch
Copy link
Contributor

ywelsch commented Jul 2, 2020

I think this might just be the effect of the test being particularly slow at times (given that we execute a lot of other stuff concurrently with gradle). Looking at the failed test linked to above, it spends already a lot of time until it gets to the point where the nodes are restarted (indexing started at 13:37:57, nodes are restarted at 13:38:45, i.e. nearly 50 seconds later), indicating that the timeout of 30 seconds would just not have been good enough for this particularly slow CI run (the restart completed at 13:38:56, and going to yellow only happened at 13:39:04, i.e. 10 seconds later, and becomes green at 13:39:24, i.e. after 28 seconds. The shards might be rebalanced after that, causing an extra few seconds, which just brings this over the 30 seconds.

I see two possibilities: Rework the test so that it runs less slowly, or increase the timeout

@droberts195
Copy link
Contributor

The Linux CI workers now run 16 tasks in parallel, as the timeline for https://gradle-enterprise.elastic.co/s/jeoql6ppg2iku shows:

Screenshot 2020-07-02 at 15 56 18

You can see that :x-pack:plugin:ml:internalClusterTest overlapped with the main X-Pack YAML tests, the ML YAML tests that run 3 times with different levels of security permissions, the CCR internal cluster tests and the EQL integration tests.

The load on the machine for around 20 minutes was over 60 processes wanting to do stuff, when the machine has 32 cores:

Screenshot 2020-07-02 at 15 55 37

As we push parallelism to its limits we're probably going to have to increase quite a few timeouts.

@davidkyle
Copy link
Member

Thanks for the comments @ywelsch and @droberts195

I was curious because there was a marked increase in failures from 6th June but I'm satisfied this is most likely due to CI infra changes and I can see that in all cases the machine was under heavy load when failures occurred.

I bumped the timeout to 60s as it was on the edge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants