-
Notifications
You must be signed in to change notification settings - Fork 25.2k
[ML] adding ml autoscaling integration test #65638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] adding ml autoscaling integration test #65638
Conversation
Pinging @elastic/ml-core (:ml) |
Pinging @elastic/es-distributed (Team:Distributed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the test.
I have one question, plus there is something to be careful about in the Cloud-side PR with the removal of the setting maximum.
@@ -449,7 +449,7 @@ | |||
false, | |||
Property.NodeScope); | |||
public static final Setting<Integer> MAX_LAZY_ML_NODES = | |||
Setting.intSetting("xpack.ml.max_lazy_ml_nodes", 0, 0, 3, Property.Dynamic, Property.NodeScope); | |||
Setting.intSetting("xpack.ml.max_lazy_ml_nodes", 0, 0, Property.Dynamic, Property.NodeScope); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will have to be careful that the Cloud infrastructure never tries to set this setting higher than 3 in a pre-7.11/post-7.11 mixed version cluster. This should be possible for autoscaling, as we've previously said that autoscaling won't attempt to do anything in such a cluster.
This ties in with the Cloud-side PR to set the settings required for ML autoscaling. It is probably best if this one doesn't go in elasticsearch.yml
, but instead gets set using a cluster settings API call once the entire cluster has been upgraded to 7.11 or higher.
Also, the docs state that the maximum value is 3. I think it's probably best to leave the docs like this for the time being. Maybe in a year or two we can adjust them, but while there is a risk of problems in mixed version clusters it's probably best that we don't.
new GetAutoscalingCapacityAction.Request() | ||
).actionGet(), | ||
"requesting scale up as number of jobs in queues exceeded configured limit", | ||
380991001934L, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see where the 1328196267L
comes from in the previous assertion: ceil((100 + 40 + 200 + 40) * 1024^2 * 100 / 30)
.
But why is this one 380991001934L
? ceil((100 + 40 + 200 + 40 + 20000 + 40 + 10000 + 40 + 30000 + 40) * 1024^2 * 100 / 30) = 211462826667
I think it would be good to put the expected formula that's been evaluated in a comment so that if ever there are changes to the constants used in the code then the person who edits the expected result here doesn't just mindlessly paste in whatever makes the test pass but can use the formula to manually check the result makes sense given their new constants.
.build(); | ||
} | ||
|
||
public void testMLAutoscalingCapacity() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is running on the assumption that xpack.ml.max_machine_memory_percent
is set to the default of 30
and xpack.ml.use_auto_machine_memory_percent
is false
right? I think it's worth adding a comment to say that these are the expectations for this test and it will need modifying if either of those defaults is ever changed. It will make it clearer for future maintainers where the numbers have come from.
…aling-integration-test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
run elasticsearch-ci/2 |
…aling-integration-test
This adds ml autoscaling integration tests. The test verifies that the scaling requirements adjust according to the current real load on the cluster given machine learning jobs of various sizes. Additionally, there was a bug in the ml scaling service settings. This commit addresses the bug.
* [ML] adding ml autoscaling integration test (#65638) This adds ml autoscaling integration tests. The test verifies that the scaling requirements adjust according to the current real load on the cluster given machine learning jobs of various sizes. Additionally, there was a bug in the ml scaling service settings. This commit addresses the bug.
This adds ml autoscaling integration tests.
The test verifies that the scaling requirements adjust according to the current real load
on the cluster given machine learning jobs of various sizes.
Additionally, there was a bug in the ml scaling service settings. This commit addresses the bug.