Skip to content

ES|QL SAMPLE aggregation function #127629

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
May 8, 2025
Merged

Conversation

jan-elastic
Copy link
Contributor

No description provided.

@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label v9.1.0 labels May 2, 2025
@jan-elastic jan-elastic force-pushed the esql-sample-agg-2 branch from 55fe96f to 62de767 Compare May 2, 2025 13:23
@jan-elastic jan-elastic added >feature :ml Machine learning Team:ML Meta label for the ML team labels May 2, 2025
@elasticsearchmachine elasticsearchmachine removed the needs:triage Requires assignment of a team area label label May 2, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@elasticsearchmachine
Copy link
Collaborator

Hi @jan-elastic, I've created a changelog YAML for you.

@jan-elastic jan-elastic requested a review from alex-spies May 2, 2025 15:36
"version" },
description = "Collects sample values for a field.",
type = FunctionType.AGGREGATE,
examples = @Example(file = "stats_sample", tag = "doc")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have an example of the output in the docs. I'm not entirely sure the right way to hack that one up because it's non-deterministic. Maybe it's hand rolled.

I think we want that example because my first question when reading this is "can I get duplicates or do those count as distinct samples?" Mostly because I'm not good at statistics.

I do think it's interesting that SAMPLE(bool) is strictly more work than VALUES(bool). It feels like sampling shouldn't be, but it makes some sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense that SAMPLE(bool) is more work. VALUES(bool) just keeps track of two boolean values: does true exist and does false exist. SAMPLE(bool) does more.

Copy link
Contributor Author

@jan-elastic jan-elastic May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obv, I prefer some example output too. I didn't know how to achieve that, but I'll think of something. Should've left a TODO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've hacked up something. Not particularly proud of it, but it gets the job done.

@jan-elastic jan-elastic force-pushed the esql-sample-agg-2 branch 2 times, most recently from 717536b to 3c4bae7 Compare May 6, 2025 10:42
this.breaker = bigArrays.breakerService().getBreaker(CircuitBreaker.REQUEST);
this.sort = new BytesRefBucketedSort(breaker, "sample", bigArrays, SortOrder.ASC, limit);
this.bytesRefBuilder = new BreakingBytesRefBuilder(breaker, "sample");
this.random = new SplittableRandom();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes on this SplittableRandom:

  • If I replace it by Random, I get the precommit error

Forbidden method invocation: java.util.Random#() [Use org.elasticsearch.common.Randomness#get for reproducible sources of randomness]

Using SplittableRandom instead works around this by not being on the blacklist, but that's not in the spirit of what's intended.

  • If I replace it by Randomness.get() in the constructor, I get:

java.lang.IllegalStateException: This Random was created for/by another thread (Thread[#39,TEST-SampleLongAggregatorFunctionTests.testManyInitialManyPartialFinalRunner-seed#[B3B51719A90700AD],5,TGRP-SampleLongAggregatorFunctionTests]). Random instances must not be shared (acquire per-thread). Current thread: Thread[#51,elasticsearch[test][esql_test_executor][T#1],5,TGRP-SampleLongAggregatorFunctionTests]

Even though the Aggregator is used on a single thread (I hope; otherwise there are more issues), it's created on a different thread then the thread that's actively using it.

  • If I replace it by Randomness.get() inside the add method, the test SampleLongAggregatorFunctionTests::testDistribution fails. It looks like each iteration instantiates the same random generator (same seed), leading to the statistics being completely wrong.

I'm still looking into these issues. If you have any thoughts, let me know.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK! The Lucene stuff has stuff like:

        if (Thread.currentThread() != prevThread) {
            prevThread = Thread.currentThread();
            random = Randomness.get();
        }

That might do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you exactly mean by this. Where's this stuff exactly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also whipped up a different fix. Let me know what you think...

@jan-elastic jan-elastic force-pushed the esql-sample-agg-2 branch from 3c4bae7 to d46b18f Compare May 6, 2025 11:21
@jan-elastic jan-elastic requested a review from nik9000 May 6, 2025 12:20
@jan-elastic jan-elastic force-pushed the esql-sample-agg-2 branch from d46b18f to 520087d Compare May 6, 2025 14:51
@jan-elastic jan-elastic requested review from ivancea and removed request for alex-spies May 7, 2025 08:17
Copy link
Contributor

@ivancea ivancea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Added some questions and things to check 👀

@jan-elastic jan-elastic force-pushed the esql-sample-agg-2 branch from 520087d to 940b6f4 Compare May 7, 2025 12:20
@jan-elastic jan-elastic requested a review from ivancea May 7, 2025 12:26
@jan-elastic jan-elastic added the ES|QL-ui Impacts ES|QL UI label May 7, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/kibana-esql (ES|QL-ui)

Copy link
Contributor

@ivancea ivancea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@jan-elastic jan-elastic force-pushed the esql-sample-agg-2 branch from 865c90c to c1dc208 Compare May 7, 2025 14:13
@jan-elastic jan-elastic requested a review from a team as a code owner May 7, 2025 14:13
@jan-elastic jan-elastic merged commit 9cf2a64 into elastic:main May 8, 2025
17 checks passed
@jan-elastic jan-elastic deleted the esql-sample-agg-2 branch May 8, 2025 06:02
ywangd pushed a commit to ywangd/elasticsearch that referenced this pull request May 9, 2025
* ES|QL SAMPLE aggregation function

* [CI] Auto commit changes from spotless

* ThreadLocalRandom -> SplittableRandom

* Update docs/changelog/127629.yaml

* fix yaml test

* Add SampleTests

* docs + example

* polish code

* mark generated imports

* comment with algorith description

* use Randomness.get()

* close properly

* type checks

* reuse hash

* regen some files

* [CI] Auto commit changes from spotless

---------

Co-authored-by: elasticsearchmachine <[email protected]>
jfreden pushed a commit to jfreden/elasticsearch that referenced this pull request May 12, 2025
* ES|QL SAMPLE aggregation function

* [CI] Auto commit changes from spotless

* ThreadLocalRandom -> SplittableRandom

* Update docs/changelog/127629.yaml

* fix yaml test

* Add SampleTests

* docs + example

* polish code

* mark generated imports

* comment with algorith description

* use Randomness.get()

* close properly

* type checks

* reuse hash

* regen some files

* [CI] Auto commit changes from spotless

---------

Co-authored-by: elasticsearchmachine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ES|QL-ui Impacts ES|QL UI >feature :ml Machine learning Team:ML Meta label for the ML team v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants