-
Notifications
You must be signed in to change notification settings - Fork 25.2k
MlDistributedFailureIT.testFullClusterRestart failed #58532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Pinging @elastic/ml-core (:ml) |
The test performs a full cluster restart (as you would imagine) then it logs this message and waits for a stable cluster
The error is that Within that 30 second window a cluster state green message is logged (this is 28 seconds later, in the other failure the same message occurred 27 seconds later)
elasticsearch/test/framework/src/main/java/org/elasticsearch/test/ESIntegTestCase.java Line 1158 in acf031c
And sure enough in the last cluster state we see the
In both the failures here it was the data index that was recovering but build stats has a failure where the This has been failing fairly regularly since 6th June |
pinging @elastic/es-distributed we have a full cluster restart test that is failing due to |
I think this might just be the effect of the test being particularly slow at times (given that we execute a lot of other stuff concurrently with gradle). Looking at the failed test linked to above, it spends already a lot of time until it gets to the point where the nodes are restarted (indexing started at 13:37:57, nodes are restarted at 13:38:45, i.e. nearly 50 seconds later), indicating that the timeout of 30 seconds would just not have been good enough for this particularly slow CI run (the restart completed at 13:38:56, and going to yellow only happened at 13:39:04, i.e. 10 seconds later, and becomes green at 13:39:24, i.e. after 28 seconds. The shards might be rebalanced after that, causing an extra few seconds, which just brings this over the 30 seconds. I see two possibilities: Rework the test so that it runs less slowly, or increase the timeout |
The Linux CI workers now run 16 tasks in parallel, as the timeline for https://gradle-enterprise.elastic.co/s/jeoql6ppg2iku shows: You can see that The load on the machine for around 20 minutes was over 60 processes wanting to do stuff, when the machine has 32 cores: As we push parallelism to its limits we're probably going to have to increase quite a few timeouts. |
Thanks for the comments @ywelsch and @droberts195 I was curious because there was a marked increase in failures from 6th June but I'm satisfied this is most likely due to CI infra changes and I can see that in all cases the machine was under heavy load when failures occurred. I bumped the timeout to 60s as it was on the edge |
Build scan:
https://gradle-enterprise.elastic.co/s/jeoql6ppg2iku/console-log?task=:x-pack:plugin:ml:internalClusterTest
Repro line:
./gradlew ':x-pack:plugin:ml:internalClusterTest' --tests "org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.testFullClusterRestart" -Dtests.seed=DD96E9D2F4821BAF -Dtests.security.manager=true -Dtests.locale=fr-BE -Dtests.timezone=US/Mountain -Druntime.java=11
Reproduces locally?:
No.
Applicable branches:
master
Failure history:
Failure excerpt:
`java.lang.AssertionError: failed to reach a stable cluster of [3] nodes. Tried via [node_t0].`
org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT > testFullClusterRestart FAILED
java.lang.AssertionError: failed to reach a stable cluster of [3] nodes. Tried via [node_t0]. last cluster state:
cluster uuid: DIXb_u8PSQe4QvzUrb8E9g [committed: true]
version: 64
state uuid: PbzLzbRtTVuznkxxS6LdXA
from_diff: false
meta data version: 56
coordination_metadata:
term: 4
last_committed_config: VotingConfiguration{libhdxGrQDCted5Sub97WQ,grRXwwwHR8WYarJGUllUpA,GOWk8mE8RXyF1-L_okWoSw}
last_accepted_config: VotingConfiguration{libhdxGrQDCted5Sub97WQ,grRXwwwHR8WYarJGUllUpA,GOWk8mE8RXyF1-L_okWoSw}
voting tombstones: []
[.ml-config/MZW2ZlLOT56WBeI8jDxqQQ]: v[9], mv[1], sv[2], av[1]
0: p_term [2], isa_ids [JYNICHv3RA2sxmU54Mvkvg, B82JZA2iSNWq0CWK5oIlJw]
[.ml-anomalies-shared/rwyxayFaQeaEmX4Rd2huSQ]: v[12], mv[1], sv[2], av[2]
0: p_term [2], isa_ids [NAd8QNw5QNaZs82sR6s2lQ, 7SI08RWpSz6p0BomCjOCyQ]
[.ml-annotations-6/QFh701rrRFusRV3p5L89oQ]: v[12], mv[1], sv[2], av[2]
0: p_term [2], isa_ids [Y7AXS8N8RnGJ3cdI_PhqcA, RkqtUhAASkaLrEGWWiNZBA]
[data/Oo5baM63RoCSadULUGgDUw]: v[6], mv[1], sv[1], av[1]
0: p_term [2], isa_ids [mYVzh39-RKiSq8VPNqg8wg, GSSkfs9gQ-K-JYH0R24v_w]
[.ml-state-000001/UW6lrZRYSK2DXNHrGY9B6A]: v[13], mv[1], sv[2], av[1]
0: p_term [2], isa_ids [yT6j-nKiS--cSbuTF-uoiw, GCzg69rwTgmxnqmN33HZ8A]
[.ml-notifications-000001/JHJYeK8STb6HnWQBLr73MQ]: v[10], mv[1], sv[2], av[1]
0: p_term [2], isa_ids [O_mgdTFnTf2Iun6JdEOk8Q, hNuzXg5PQUqEdC_fIXMyLQ]
metadata customs:
index_lifecycle: {
"policies" : {
"ml-size-based-ilm-policy" : {
"policy" : {
"phases" : {
"hot" : {
"min_age" : "0ms",
"actions" : {
"rollover" : {
"max_size" : "50gb"
}
}
}
}
},
"headers" : { },
"version" : 1,
"modified_date" : 1593076621794,
"modified_date_string" : "2020-06-25T09:17:01.794Z"
},
"slm-history-ilm-policy" : {
"policy" : {
"phases" : {
"hot" : {
"min_age" : "0ms",
"actions" : {
"rollover" : {
"max_size" : "50gb",
"max_age" : "30d"
}
}
},
"delete" : {
"min_age" : "90d",
"actions" : {
"delete" : {
"delete_searchable_snapshot" : true
}
}
}
}
},
"headers" : { },
"version" : 1,
"modified_date" : 1593076620773,
"modified_date_string" : "2020-06-25T09:17:00.773Z"
}
},
"operation_mode" : "RUNNING"
} ingest: org.elasticsearch.ingest.IngestMetadata@398db12a index-graveyard: IndexGraveyard[[]] persistent_tasks: {"last_allocation_id":6,"tasks":[{"id":"job-full-cluster-restart-job","task":{"xpack/ml/job":{"params":{"job_id":"full-cluster-restart-job","timeout":"1800000ms","job":{"job_id":"full-cluster-restart-job","job_type":"anomaly_detector","job_version":"8.0.0","create_time":1593076638347,"analysis_config":{"bucket_span":"1h","detectors":[{"detector_description":"count","function":"count","detector_index":0}],"influencers":[]},"analysis_limits":{"model_memory_limit":"1024mb","categorization_examples_limit":4},"data_description":{"time_field":"time","time_format":"yyyy-MM-dd HH:mm:ss"},"model_snapshot_retention_days":10,"daily_model_snapshot_retention_after_days":1,"results_index_name":"shared","allow_lazy_open":false}},"state":{"state":"opened","allocation_id":5}}},"allocation_id":5,"assignment":{"executor_node":"GOWk8mE8RXyF1-L_okWoSw","explanation":""},"allocation_id_on_last_status_update":5},{"id":"datafeed-data_feed_id","task":{"xpack/ml/datafeed":{"params":{"datafeed_id":"data_feed_id","start":"0","timeout":"20000ms","job_id":"full-cluster-restart-job","indices":["data"],"indices_options":{"expand_wildcards":["open"],"ignore_unavailable":false,"allow_no_indices":true,"ignore_throttled":true}},"state":{"state":"started"}}},"allocation_id":6,"assignment":{"executor_node":"GOWk8mE8RXyF1-L_okWoSw","explanation":""},"allocation_id_on_last_status_update":6}]}
nodes:
{node_t1}{GOWk8mE8RXyF1-L_okWoSw}{FiJjhUD0SJ2bV6I07B82hQ}{127.0.0.1}{127.0.0.1:36407}{dilmr}{ml.machine_memory=101094690816, ml.max_open_jobs=20, xpack.installed=true}
{node_t0}{grRXwwwHR8WYarJGUllUpA}{oOGpoxPQQYa_HCc1HL1TDQ}{127.0.0.1}{127.0.0.1:41819}{dilmr}{ml.machine_memory=101094690816, ml.max_open_jobs=20, xpack.installed=true}
{node_t2}{libhdxGrQDCted5Sub97WQ}{OYVU1TxHS_muQKUQVNPnoA}{127.0.0.1}{127.0.0.1:41017}{dilmr}{ml.machine_memory=101094690816, ml.max_open_jobs=20, xpack.installed=true}, master
routing_table (version 8):
-- index [[.ml-config/MZW2ZlLOT56WBeI8jDxqQQ]]
----shard_id [.ml-config][0]
--------[.ml-config][0], node[libhdxGrQDCted5Sub97WQ], [P], s[STARTED], a[id=B82JZA2iSNWq0CWK5oIlJw]
--------[.ml-config][0], node[grRXwwwHR8WYarJGUllUpA], [R], s[STARTED], a[id=JYNICHv3RA2sxmU54Mvkvg]
The text was updated successfully, but these errors were encountered: