Skip to content

[CI] InferenceIngestIT.testPipelineIngest occasionally fails #54786

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dimitris-athanasiou opened this issue Apr 6, 2020 · 3 comments · Fixed by #54752 or #55163
Closed

[CI] InferenceIngestIT.testPipelineIngest occasionally fails #54786

dimitris-athanasiou opened this issue Apr 6, 2020 · 3 comments · Fixed by #54752 or #55163
Assignees
Labels
:ml Machine learning >test-failure Triaged test failures from CI

Comments

@dimitris-athanasiou
Copy link
Contributor

dimitris-athanasiou commented Apr 6, 2020

Jenkins: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+matrix-java-periodic/ES_RUNTIME_JAVA=zulu11,nodes=general-purpose/614/console

Build scan: https://gradle-enterprise.elastic.co/s/q4hcgpbmvcn3g

Failure:

java.lang.AssertionError: 

Expected: a string containing "\"inference_count\":10"
     but: was "{"count":1,"trained_model_stats":[{"model_id":"test_classification","pipeline_count":0,"inference_stats":{"failure_count":0,"inference_count":11,"missing_all_fields_count":0,"time_stamp":-9223372036854775808}}]}"

at __randomizedtesting.SeedInfo.seed([3274046E5719A7DD:90289258A5A9409E]:0)
at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
at org.junit.Assert.assertThat(Assert.java:956)
at org.junit.Assert.assertThat(Assert.java:923)
at org.elasticsearch.xpack.ml.integration.InferenceIngestIT.lambda$testPipelineIngest$0(InferenceIngestIT.java:135)

Reproduce with:

./gradlew ':x-pack:plugin:ml:qa:native-multi-node-tests:integTestRunner' --tests "org.elasticsearch.xpack.ml.integration.InferenceIngestIT.testPipelineIngest" -Dtests.seed=3274046E5719A7DD -Dtests.security.manager=true -Dtests.locale=es-CR -Dtests.timezone=Asia/Jayapura -Dcompiler.java=13

I couldn't reproduce locally.

@dimitris-athanasiou dimitris-athanasiou added >test-failure Triaged test failures from CI :ml Machine learning labels Apr 6, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this issue Apr 6, 2020
benwtrent added a commit that referenced this issue Apr 13, 2020
We needlessly send documents to be persisted. If there are no stats added, then we should not attempt to persist them.

Also, this PR fixes the race condition that caused issue:  #54786
benwtrent added a commit to benwtrent/elasticsearch that referenced this issue Apr 13, 2020
We needlessly send documents to be persisted. If there are no stats added, then we should not attempt to persist them.

Also, this PR fixes the race condition that caused issue:  elastic#54786
benwtrent added a commit that referenced this issue Apr 13, 2020
We needlessly send documents to be persisted. If there are no stats added, then we should not attempt to persist them.

Also, this PR fixes the race condition that caused issue:  #54786
@mark-vieira mark-vieira reopened this Apr 13, 2020
@mark-vieira
Copy link
Contributor

I've remuted on master with fbda31dafb667a0842b32f3f406f668caf39daea

benwtrent added a commit that referenced this issue Apr 20, 2020
`updateAndGet` could actually call the internal method more than once on contention.
If I read the JavaDocs, it says:
```* @param updateFunction a side-effect-free function```
So, it could be getting multiple updates on contention, thus having a race condition where stats are double counted.

To fix, I am going to use a `ReadWriteLock`. The `LongAdder` objects allows fast thread safe writes in high contention environments. These can be protected by the `ReadWriteLock::readLock`.

When stats are persisted, I need to call reset on all these adders. This is NOT thread safe if additions are taking place concurrently. So, I am going to protect with `ReadWriteLock::writeLock`.

This should prevent race conditions while allowing high (ish) throughput in the highly contention paths in inference.

I did some simple throughput tests and this change is not significantly slower and is simpler to grok (IMO).

closes  #54786
benwtrent added a commit to benwtrent/elasticsearch that referenced this issue Apr 20, 2020
`updateAndGet` could actually call the internal method more than once on contention.
If I read the JavaDocs, it says:
```* @param updateFunction a side-effect-free function```
So, it could be getting multiple updates on contention, thus having a race condition where stats are double counted.

To fix, I am going to use a `ReadWriteLock`. The `LongAdder` objects allows fast thread safe writes in high contention environments. These can be protected by the `ReadWriteLock::readLock`.

When stats are persisted, I need to call reset on all these adders. This is NOT thread safe if additions are taking place concurrently. So, I am going to protect with `ReadWriteLock::writeLock`.

This should prevent race conditions while allowing high (ish) throughput in the highly contention paths in inference.

I did some simple throughput tests and this change is not significantly slower and is simpler to grok (IMO).

closes  elastic#54786
benwtrent added a commit that referenced this issue Apr 20, 2020
`updateAndGet` could actually call the internal method more than once on contention.
If I read the JavaDocs, it says:
```* @param updateFunction a side-effect-free function```
So, it could be getting multiple updates on contention, thus having a race condition where stats are double counted.

To fix, I am going to use a `ReadWriteLock`. The `LongAdder` objects allows fast thread safe writes in high contention environments. These can be protected by the `ReadWriteLock::readLock`.

When stats are persisted, I need to call reset on all these adders. This is NOT thread safe if additions are taking place concurrently. So, I am going to protect with `ReadWriteLock::writeLock`.

This should prevent race conditions while allowing high (ish) throughput in the highly contention paths in inference.

I did some simple throughput tests and this change is not significantly slower and is simpler to grok (IMO).

closes  #54786
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning >test-failure Triaged test failures from CI
Projects
None yet
4 participants