Skip to content

[ML] Fix PyTorchModelIT::testDeploymentStats #81161

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Dec 1, 2021

Conversation

davidkyle
Copy link
Member

@davidkyle davidkyle commented Nov 30, 2021

PyTorchModelIT::testDeploymentStats has been failing in #80819 due to missing fields in the GET stats response. The problem is in the test as it has the wrong expectations about what is returned when a deployment is starting

The only way a stats response can be constructed like this is if there are no task responses from the individual nodes and only the nodes for started models are included in the GET stats request

https://github.com/elastic/elasticsearch/blob/master/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportGetDeploymentStatsAction.java#L123

The task request will never be sent to a node hosting a single model in the starting state. This is by design as those responses are built on the co-ordinating node. The test passed most of the time because the response is valid if the model is started on at least 1 node.

As to why sometimes the 2nd GET stats call failed this is because the 2 ml nodes have different views of the trained model allocation, one node knows the model is started there but the other doesn't. GET stats is not a master node action but the responses will be eventually consistent.

Closes #80819

@davidkyle davidkyle added >test Issues or PRs that are addressing/adding tests :ml Machine learning v8.0.0 v8.1.0 labels Nov 30, 2021
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Nov 30, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@davidkyle davidkyle changed the title [ML] Can't test starting models [ML] Fix PyTorchModelIT::testDeploymentStats Nov 30, 2021
Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

@davidkyle davidkyle merged commit 29d17c0 into elastic:master Dec 1, 2021
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.0 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 81161

davidkyle added a commit to davidkyle/elasticsearch that referenced this pull request Dec 1, 2021
Adjusts the test's expectations about the information available
when deployments are in the `starting` state
elasticsearchmachine pushed a commit that referenced this pull request Dec 1, 2021
Adjusts the test's expectations about the information available
when deployments are in the `starting` state
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning Team:ML Meta label for the ML team >test Issues or PRs that are addressing/adding tests v8.0.0-rc1 v8.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] PyTorchModelIT testDeploymentStats failing
5 participants