[CI] InferenceIngestIT.testPipelineIngest occasionally fails #54786

dimitris-athanasiou · 2020-04-06T08:01:11Z

Jenkins: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+matrix-java-periodic/ES_RUNTIME_JAVA=zulu11,nodes=general-purpose/614/console

Build scan: https://gradle-enterprise.elastic.co/s/q4hcgpbmvcn3g

Failure:

java.lang.AssertionError: 

Expected: a string containing "\"inference_count\":10"
     but: was "{"count":1,"trained_model_stats":[{"model_id":"test_classification","pipeline_count":0,"inference_stats":{"failure_count":0,"inference_count":11,"missing_all_fields_count":0,"time_stamp":-9223372036854775808}}]}"

at __randomizedtesting.SeedInfo.seed([3274046E5719A7DD:90289258A5A9409E]:0)
at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
at org.junit.Assert.assertThat(Assert.java:956)
at org.junit.Assert.assertThat(Assert.java:923)
at org.elasticsearch.xpack.ml.integration.InferenceIngestIT.lambda$testPipelineIngest$0(InferenceIngestIT.java:135)

Reproduce with:

./gradlew ':x-pack:plugin:ml:qa:native-multi-node-tests:integTestRunner' --tests "org.elasticsearch.xpack.ml.integration.InferenceIngestIT.testPipelineIngest" -Dtests.seed=3274046E5719A7DD -Dtests.security.manager=true -Dtests.locale=es-CR -Dtests.timezone=Asia/Jayapura -Dcompiler.java=13

I couldn't reproduce locally.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-04-06T08:01:12Z

Pinging @elastic/ml-core (:ml)

Relates elastic#54786

Relates #54786

We needlessly send documents to be persisted. If there are no stats added, then we should not attempt to persist them. Also, this PR fixes the race condition that caused issue: #54786

We needlessly send documents to be persisted. If there are no stats added, then we should not attempt to persist them. Also, this PR fixes the race condition that caused issue: elastic#54786

We needlessly send documents to be persisted. If there are no stats added, then we should not attempt to persist them. Also, this PR fixes the race condition that caused issue: #54786

mark-vieira · 2020-04-13T19:51:42Z

This looks to still be failing:

https://gradle-enterprise.elastic.co/s/wo64a3vdiq7do/tests/zcquf3hc3eoc6-oe3h7fwwev7pa
https://gradle-enterprise.elastic.co/s/dkkslswr4qgam/tests/zcquf3hc3eoc6-oe3h7fwwev7pa

mark-vieira · 2020-04-13T19:52:57Z

I've remuted on master with fbda31dafb667a0842b32f3f406f668caf39daea

@param

`updateAndGet` could actually call the internal method more than once on contention. If I read the JavaDocs, it says: ```* @param updateFunction a side-effect-free function``` So, it could be getting multiple updates on contention, thus having a race condition where stats are double counted. To fix, I am going to use a `ReadWriteLock`. The `LongAdder` objects allows fast thread safe writes in high contention environments. These can be protected by the `ReadWriteLock::readLock`. When stats are persisted, I need to call reset on all these adders. This is NOT thread safe if additions are taking place concurrently. So, I am going to protect with `ReadWriteLock::writeLock`. This should prevent race conditions while allowing high (ish) throughput in the highly contention paths in inference. I did some simple throughput tests and this change is not significantly slower and is simpler to grok (IMO). closes #54786

@param

`updateAndGet` could actually call the internal method more than once on contention. If I read the JavaDocs, it says: ```* @param updateFunction a side-effect-free function``` So, it could be getting multiple updates on contention, thus having a race condition where stats are double counted. To fix, I am going to use a `ReadWriteLock`. The `LongAdder` objects allows fast thread safe writes in high contention environments. These can be protected by the `ReadWriteLock::readLock`. When stats are persisted, I need to call reset on all these adders. This is NOT thread safe if additions are taking place concurrently. So, I am going to protect with `ReadWriteLock::writeLock`. This should prevent race conditions while allowing high (ish) throughput in the highly contention paths in inference. I did some simple throughput tests and this change is not significantly slower and is simpler to grok (IMO). closes elastic#54786

@param

`updateAndGet` could actually call the internal method more than once on contention. If I read the JavaDocs, it says: ```* @param updateFunction a side-effect-free function``` So, it could be getting multiple updates on contention, thus having a race condition where stats are double counted. To fix, I am going to use a `ReadWriteLock`. The `LongAdder` objects allows fast thread safe writes in high contention environments. These can be protected by the `ReadWriteLock::readLock`. When stats are persisted, I need to call reset on all these adders. This is NOT thread safe if additions are taking place concurrently. So, I am going to protect with `ReadWriteLock::writeLock`. This should prevent race conditions while allowing high (ish) throughput in the highly contention paths in inference. I did some simple throughput tests and this change is not significantly slower and is simpler to grok (IMO). closes #54786

dimitris-athanasiou added >test-failure Triaged test failures from CI :ml Machine learning labels Apr 6, 2020

dimitris-athanasiou assigned benwtrent Apr 6, 2020

dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this issue Apr 6, 2020

[ML] Mute InferenceIngestIT.testPipelineIngest

677ed40

Relates elastic#54786

dimitris-athanasiou mentioned this issue Apr 6, 2020

[ML] Mute InferenceIngestIT.testPipelineIngest #54788

Merged

dimitris-athanasiou added a commit that referenced this issue Apr 6, 2020

[ML] Mute InferenceIngestIT.testPipelineIngest (#54788)

5d7f694

Relates #54786

benwtrent mentioned this issue Apr 13, 2020

[ML] inference only persist if there are stats #54752

Merged

benwtrent closed this as completed in #54752 Apr 13, 2020

mark-vieira reopened this Apr 13, 2020

benwtrent mentioned this issue Apr 14, 2020

[ML] Fixing inference stats race condition #55163

Merged

benwtrent closed this as completed in #55163 Apr 20, 2020

ywangd mentioned this issue Aug 26, 2020

[CI] InferenceIngestIT.testPipelineIngest and testPathologicalPipelineCreationAndDeletion failures #61564

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] InferenceIngestIT.testPipelineIngest occasionally fails #54786

[CI] InferenceIngestIT.testPipelineIngest occasionally fails #54786

dimitris-athanasiou commented Apr 6, 2020 •

edited by droberts195

Loading

elasticmachine commented Apr 6, 2020

mark-vieira commented Apr 13, 2020

mark-vieira commented Apr 13, 2020

[CI] InferenceIngestIT.testPipelineIngest occasionally fails #54786

[CI] InferenceIngestIT.testPipelineIngest occasionally fails #54786

Comments

dimitris-athanasiou commented Apr 6, 2020 • edited by droberts195 Loading

elasticmachine commented Apr 6, 2020

mark-vieira commented Apr 13, 2020

mark-vieira commented Apr 13, 2020

dimitris-athanasiou commented Apr 6, 2020 •

edited by droberts195

Loading