Skip to content

[CI] InferenceIngestIT.testPipelineIngest and testPathologicalPipelineCreationAndDeletion failures #61564

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ywangd opened this issue Aug 26, 2020 · 3 comments · Fixed by #65774
Assignees
Labels
:ml Machine learning >test-failure Triaged test failures from CI

Comments

@ywangd
Copy link
Member

ywangd commented Aug 26, 2020

Build scan:
https://gradle-enterprise.elastic.co/s/2kdpwkkrjc6r4

Repro line:

./gradlew ':x-pack:plugin:ml:qa:native-multi-node-tests:integTest' --tests "org.elasticsearch.xpack.ml.integration.InferenceIngestIT.testPipelineIngest" -Dtests.seed=F132D0E49ECDF4B3 -Dtests.security.manager=true -Dtests.locale=sr-ME -Dtests.timezone=Pacific/Midway -Druntime.java=8

./gradlew ':x-pack:plugin:ml:qa:native-multi-node-tests:integTest' --tests "org.elasticsearch.xpack.ml.integration.InferenceIngestIT.testPathologicalPipelineCreationAndDeletion" -Dtests.seed=F132D0E49ECDF4B3 -Dtests.security.manager=true -Dtests.locale=sr-ME -Dtests.timezone=Pacific/Midway -Druntime.java=8

Reproduces locally?:
No

Applicable branches:

  • 7.x
  • 7.9

Failure history:

These two tests seem only fail for v7 branches. It failed 8 times within last 60 days according to build-stats.

When they fail, there are always failures of `ClassificationIT in the same build scan with "ClusterHealthResponse has timed out" error. Maybe they are related, like one is the cause of the other.

I also noticed we have a previous issue (#54786) for testPipelineIngest but the failure message is different. So I am opening a new issue.

Failure excerpt:


java.lang.AssertionError: |  
-- | --
  | Expected: a string containing "\"cache_miss_count\":3" |  
  | but: was "{"count":1,"trained_model_stats":[{"model_id":"test_classification","pipeline_count":0,"inference_stats":{"failure_count":0,"inference_count":10,"cache_miss_count":2,"missing_all_fields_count":0,"timestamp":1598400435615}}]}" |  

at __randomizedtesting.SeedInfo.seed([F132D0E49ECDF4B3:536E46D26C7D13F0]:0) |  
-- | --
  |   | •••
  |   | at org.elasticsearch.xpack.ml.integration.InferenceIngestIT.lambda$testPipelineIngest$1(InferenceIngestIT.java:178) |  
  |   | •••
  |   | at org.elasticsearch.xpack.ml.integration.InferenceIngestIT.testPipelineIngest(InferenceIngestIT.java:172) |  
  |   | •••


@ywangd ywangd added >test-failure Triaged test failures from CI :ml Machine learning labels Aug 26, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@droberts195
Copy link
Contributor

This is still failing in the same way:

Expected: a string containing "\"cache_miss_count\":30"	
	     but: was "{"count":1,"trained_model_stats":[{"model_id":"test_pathological_classification","pipeline_count":0,"inference_stats":{"failure_count":0,"inference_count":10,"cache_miss_count":29,"missing_all_fields_count":0,"timestamp":1601791172638}}]}"

The build scan for that is: https://gradle-enterprise.elastic.co/s/pk6orftvu2us2

@pgomulka
Copy link
Contributor

pgomulka commented Nov 3, 2020

another one https://gradle-enterprise.elastic.co/s/h5omv5t3fhc36
failed on master

@benwtrent benwtrent self-assigned this Dec 2, 2020
benwtrent added a commit that referenced this issue Dec 3, 2020
…nts (#65774)

Looking over the failure history, it is always the cache miss count that is off. This is mostly ok as all the failures had indicated that there were indeed cache failures and every one of them were fence-post errors. 

Opting to make the cache miss count check lenient as other stats checked verify consistency.

closes #61564
benwtrent added a commit to benwtrent/elasticsearch that referenced this issue Dec 3, 2020
…nts (elastic#65774)

Looking over the failure history, it is always the cache miss count that is off. This is mostly ok as all the failures had indicated that there were indeed cache failures and every one of them were fence-post errors. 

Opting to make the cache miss count check lenient as other stats checked verify consistency.

closes elastic#61564
benwtrent added a commit that referenced this issue Dec 3, 2020
…nts (#65774) (#65815)

Looking over the failure history, it is always the cache miss count that is off. This is mostly ok as all the failures had indicated that there were indeed cache failures and every one of them were fence-post errors. 

Opting to make the cache miss count check lenient as other stats checked verify consistency.

closes #61564
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants