-
Notifications
You must be signed in to change notification settings - Fork 25.2k
ES|QL SAMPLE aggregation function #127629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
55fe96f
to
62de767
Compare
Pinging @elastic/ml-core (Team:ML) |
Hi @jan-elastic, I've created a changelog YAML for you. |
...te/src/main/generated-src/org/elasticsearch/compute/aggregation/SampleBooleanAggregator.java
Outdated
Show resolved
Hide resolved
"version" }, | ||
description = "Collects sample values for a field.", | ||
type = FunctionType.AGGREGATE, | ||
examples = @Example(file = "stats_sample", tag = "doc") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have an example of the output in the docs. I'm not entirely sure the right way to hack that one up because it's non-deterministic. Maybe it's hand rolled.
I think we want that example because my first question when reading this is "can I get duplicates or do those count as distinct samples?" Mostly because I'm not good at statistics.
I do think it's interesting that SAMPLE(bool)
is strictly more work than VALUES(bool)
. It feels like sampling shouldn't be, but it makes some sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense that SAMPLE(bool)
is more work. VALUES(bool)
just keeps track of two boolean values: does true
exist and does false
exist. SAMPLE(bool)
does more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Obv, I prefer some example output too. I didn't know how to achieve that, but I'll think of something. Should've left a TODO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I've hacked up something. Not particularly proud of it, but it gets the job done.
717536b
to
3c4bae7
Compare
this.breaker = bigArrays.breakerService().getBreaker(CircuitBreaker.REQUEST); | ||
this.sort = new BytesRefBucketedSort(breaker, "sample", bigArrays, SortOrder.ASC, limit); | ||
this.bytesRefBuilder = new BreakingBytesRefBuilder(breaker, "sample"); | ||
this.random = new SplittableRandom(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some notes on this SplittableRandom
:
- If I replace it by
Random
, I get the precommit error
Forbidden method invocation: java.util.Random#() [Use org.elasticsearch.common.Randomness#get for reproducible sources of randomness]
Using SplittableRandom
instead works around this by not being on the blacklist, but that's not in the spirit of what's intended.
- If I replace it by
Randomness.get()
in the constructor, I get:
java.lang.IllegalStateException: This Random was created for/by another thread (Thread[#39,TEST-SampleLongAggregatorFunctionTests.testManyInitialManyPartialFinalRunner-seed#[B3B51719A90700AD],5,TGRP-SampleLongAggregatorFunctionTests]). Random instances must not be shared (acquire per-thread). Current thread: Thread[#51,elasticsearch[test][esql_test_executor][T#1],5,TGRP-SampleLongAggregatorFunctionTests]
Even though the Aggregator
is used on a single thread (I hope; otherwise there are more issues), it's created on a different thread then the thread that's actively using it.
- If I replace it by
Randomness.get()
inside theadd
method, the testSampleLongAggregatorFunctionTests::testDistribution
fails. It looks like each iteration instantiates the same random generator (same seed), leading to the statistics being completely wrong.
I'm still looking into these issues. If you have any thoughts, let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK! The Lucene stuff has stuff like:
if (Thread.currentThread() != prevThread) {
prevThread = Thread.currentThread();
random = Randomness.get();
}
That might do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you exactly mean by this. Where's this stuff exactly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also whipped up a different fix. Let me know what you think...
3c4bae7
to
d46b18f
Compare
d46b18f
to
520087d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Added some questions and things to check 👀
...in/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/aggregate/Sample.java
Outdated
Show resolved
Hide resolved
...in/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/aggregate/Sample.java
Outdated
Show resolved
Hide resolved
...c/test/java/org/elasticsearch/compute/aggregation/SampleBytesRefAggregatorFunctionTests.java
Show resolved
Hide resolved
.../esql/compute/src/main/java/org/elasticsearch/compute/aggregation/X-SampleAggregator.java.st
Show resolved
Hide resolved
520087d
to
940b6f4
Compare
Pinging @elastic/kibana-esql (ES|QL-ui) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
test/framework/src/main/java/org/elasticsearch/test/MixWithIncrement.java
Outdated
Show resolved
Hide resolved
865c90c
to
c1dc208
Compare
.../esql/compute/src/main/java/org/elasticsearch/compute/aggregation/X-SampleAggregator.java.st
Show resolved
Hide resolved
* ES|QL SAMPLE aggregation function * [CI] Auto commit changes from spotless * ThreadLocalRandom -> SplittableRandom * Update docs/changelog/127629.yaml * fix yaml test * Add SampleTests * docs + example * polish code * mark generated imports * comment with algorith description * use Randomness.get() * close properly * type checks * reuse hash * regen some files * [CI] Auto commit changes from spotless --------- Co-authored-by: elasticsearchmachine <[email protected]>
* ES|QL SAMPLE aggregation function * [CI] Auto commit changes from spotless * ThreadLocalRandom -> SplittableRandom * Update docs/changelog/127629.yaml * fix yaml test * Add SampleTests * docs + example * polish code * mark generated imports * comment with algorith description * use Randomness.get() * close properly * type checks * reuse hash * regen some files * [CI] Auto commit changes from spotless --------- Co-authored-by: elasticsearchmachine <[email protected]>
No description provided.