Add a simple random sampling query #25561

polyfractal · 2017-07-05T20:38:04Z

This adds a query that randomly matches documents with a user-defined probability. The user can also specify a seed for reproducible randomness.

This query uses a trick to accelerate sampling. Instead of checking the rng on each document, it approximates the gap between documents and skips forward. This should allow Lucene to shave off a fair amount of execution time, making the collection process sub-linear instead of linear.

Gap approximation is only used for segments with >100 documents, since empirically it seems approximation doesn't work well with small numbers.

It also skips approximating if the probability is very high; in this case, many documents are returned so there is less benefit to "skipping" forward, and the math invoked (two logs) is a bit more expensive than just the RNG. Probably... this has not been benchmarked. If we think it's worthwhile I can try to workup a benchmark for that. Maybe the overhead of collecting the document is high enough to always do gap sampling?

This is the first time I've implemented a query, so hopefully I got all the bits right :) If this is accepted, I'd like to implement a Reservoir Sampling version too.

This adds a query that matches documents with a random probability. The user can specify the probability that any particular document matches, as well as a seed for reproducible randomness. This query uses a trick to accelerate sampling. Instead of checking the rng on each document, it approximates the gap between documents and skips forward. This should allow Lucene to shave off a fair amount of execution time, making the collection process sub-linear instead of linear. Gap approximation is only used for segments with >100 documents, since empirically it seems approximation doesn't work well with small numbers. It also skips approximating if the probability is very high; in this case, many documents are returned so there is less benefit to "skipping" forward, and the math invoked (two logs) is a bit more expensive than just the RNG. Probably... this has not been benchmarked.

jpountz

I left some comments.

jpountz · 2017-07-06T14:09:32Z

core/src/main/java/org/elasticsearch/index/query/RandomSampleQuery.java

+                }
+                doc = doc + 1;
+            }
+            return NO_MORE_DOCS;


one issue I have with this is that the next matching document depends on the target. Would it be possible to fix it so that the set of matching documents does not depend on how the scorer as advanced?

ie. could the random number be only a function of seed and doc?

And can docid be the default, but the value be optionally pulled from a field? See #25240 (comment). Otherwise there is no way to have real consistency with a seed (and IMO seed should only be allowed when a field is specified).

@jpountz Not entirely sure I understand. Is the issue that the same seeded rng may match different documents depending on how many times it was invoked (due to different scorer positions)? If that's the case... does it matter? Every document has the same probability p to be matched, indepedendent of other documents. Technically it's influenced by the prior docs due to not being a truly random generator, but the result is essentially the same from a sampling perspective.

@rjernst Are you saying to re-seed the rng on each document (using it's docid or field, etc)? Wouldn't that be rather expensive?

See RandomScoreFunction (it does not depend on the order of visiting docs, and just substitute using the uid field with either docid or a user supplied field).

Gotcha, I see how it works now (not using Random at all, just hashing/mixing). Will make changes similar to @jpountz recent PR for the random_score :)

@polyfractal what I meant is that the matches of a given query should be the same regardless how the query is consumed. For instance that that for a given source of randomness and seed, documents 5 and 12 should match but no other documents. Then calling advance(10) or advance(11) should both return 12. However the impl seems a bit broken here since the next matching document is computed using the target.

jpountz · 2017-07-06T14:13:08Z

core/src/main/java/org/elasticsearch/index/query/RandomSampleQuery.java

+        assert(p > 0.0 && p <= 1.0);
+        this.p = p;
+        this.seed = seed;
+        LOG_INVERSE_P =  Math.log(1-p);


Assigning a static variable in a constructor looks buggy to me, maybe logInverseP should not be static?

Oops, yeah. I just wanted to pre-calculate it since it doesn't change, no need for static.

jpountz · 2017-07-06T14:14:25Z

core/src/main/java/org/elasticsearch/index/query/RandomSampleQuery.java

+
+    static int getGap(Random rng) {
+        double u = Math.max(rng.nextDouble(), EPSILON);
+        return (int)(Math.floor(Math.log(u) / LOG_INVERSE_P)) + 1;


we don't need to call floor if we cast to an int?

jpountz · 2017-07-06T14:16:40Z

core/src/main/java/org/elasticsearch/index/query/RandomSampleQuery.java

+            // Keep approximating gaps until we hit or surpass the target
+            while (doc <= target) {
+                doc += getGap(rng);
+            }


I'm slightly worried this could be slow if the percentage of docs to match is rather high and target-doc is large?

There is a threshold, PROBABILITY_THRESHOLD, that is currently used to cut-over to a "naive" random sample if the probability is high (e.g. if the sampler is supposed to match a large number of documents). In that case it's standard sampling using the relatively cheaper rng sampling. I set it rather high (0.7) but we could set it closer to 0.5?

And when p is small, the gaps should be relatively large and I think it'll always be faster than the naive approach.

Should I work up some benchmarks for this to make sure?

jpountz · 2017-07-06T14:17:52Z

docs/reference/query-dsl/random-sample-query.asciidoc

+deterministic operation of the query so that the same documents are matched each
+time.
+
+This can be accomplished by adding the `seed` parameter:


maybe the seed is a bit trappy since it only reproduces matches on a per point-in-time view of a shard. So if you go to a replica or refresh, you could match different documents?

Ugh yeah, good point. I'm fine getting rid of the seed, it sounds like a mess to deal with

polyfractal · 2017-07-10T22:28:30Z

Alright, updated to use the doc's ID or a seed + field as the source of randomness. Also tweaked the tests to make them a bit more robust.

I was somewhat concerned about how the IDs generate random values, and how it interacts with the gap sampling. E.g. small random values generate large gaps, so the first document will have BitMixer.mix(context.docBase + docId) == BitMixer.mix(0 + 0) == 0 as the random value, which generates a gap of ~200 something, and essentially skips all small segments. I padded the mix with 1 (BitMixer.mix(context.docBase + docId + 1);) so this doesn't happen, but was still concerned about the randomness of the output.

Naive sampling

So, I ran some tests. Indices sized 1000-1m docs, probability from 10-90% and pure naive sampling (e.g. checking rng on each document). Here is the relative error to theoretical values:

Not really surprising, the sampling accuracy improves as the index grows. Sub-1000 documents and you're looking at 5-30% error.

Gap Sampling

If we do 100% gap sampling, we get worse behavior pretty much across the board, but noticeably worse on the small indices:

50:50 mix

And if we do a 50:50 mix, where we use gap sampling when p<0.5 but naive sampling when p>0.5:

Conclusion?

I'm not really sure there is a strong takeaway from these tests, other than to say that it appears behavior is relatively consistent. Gap sampling seems to perform better with high p, while naive sampling gives more uniform error. Mixing the two doesn't appear to change the results dramatically (it may clean up the low end, but that may just be noise).

I think the only big takeaway is that you shouldn't sample small indices but that's part of the deal with sampling :)

jpountz

I'm curious what your use-case is and whether queries are the right entry point or whether it should be eg. a collector.

Something that makes me a bit unhappy about both implementations is that deciding whether a document matches or not depends on how nextDoc and advance are called in sequence, which should not be the case for queries. I'm wondering whether it would be worth dropping the quality of the randomness for something that allows us to build better-behaving queries. For instance if p=0.125 we could pick a random number between 0 and 7 and consider that all doc IDs whose lower 3 bits are equal to that number match. It would be less random but implementing advance would be very efficient and you would always get the same matches regardless of how you consume the iterator. It's just an idea, I don't know much about random number generation but maybe there are better algorithms that would yield matching sets of better quality while addressing my concerns.

I don't think it is possible to address my concerns for the field-based approach since we can't predict what the next values for the field will be, but I'm wondering whether we actually need reproducibility?

jpountz · 2017-07-26T11:39:07Z

core/src/main/java/org/elasticsearch/index/query/QueryBuilders.java

+     * Creates a random sampler query
+     *
+     * @param probability The probability that a document is randomly sampled
+     * @param seed  A seed to use with the random generator


maybe also document that the field is used as a source of randomness

jpountz · 2017-07-26T11:41:01Z

core/src/main/java/org/elasticsearch/index/query/RandomSampleQuery.java

+    private final int seed;
+    private final int salt;
+    private final IndexFieldData<?> fieldData;
+    private float LOG_INVERSE_P;


instance variables should use camelcase

can you make it final?

jpountz · 2017-07-26T11:41:55Z

core/src/main/java/org/elasticsearch/index/query/RandomSampleQuery.java

+    private static final float EPSILON = 1e-10f;
+
+    RandomSampleQuery(double p, int seed, int salt, IndexFieldData<?> fieldData) {
+        assert(p > 0.0 && p < 1.0);


we should use actual exceptions instead of assertions since we are validating some user input

jpountz · 2017-07-26T11:44:12Z

core/src/main/java/org/elasticsearch/index/query/RandomSampleQuery.java

+
+            }
+            @Override
+            public BulkScorer bulkScorer(LeafReaderContext context) throws IOException {


I'm wondering why you decided to override this optional API. Is this impl expected to be faster than pulling the iterator and calling next in a loop? (this is what the default impl does)

jpountz · 2017-07-26T11:45:03Z

core/src/main/java/org/elasticsearch/index/query/RandomSampleQuery.java

+        }
+        RandomSampleQuery other = (RandomSampleQuery)o;
+        return Objects.equals(this.p, other.p) &&
+            Objects.equals(this.seed, other.seed);


I think we need to compare the last and fielddata instances as well?

jpountz · 2017-07-26T11:49:09Z

core/src/main/java/org/elasticsearch/index/query/RandomSampleQuery.java

+                }
+                doc = doc + 1;
+            }
+            return NO_MORE_DOCS;


@polyfractal what I meant is that the matches of a given query should be the same regardless how the query is consumed. For instance that that for a given source of randomness and seed, documents 5 and 12 should match but no other documents. Then calling advance(10) or advance(11) should both return 12. However the impl seems a bit broken here since the next matching document is computed using the target.

jpountz · 2017-07-26T11:54:17Z

core/src/main/java/org/elasticsearch/index/query/RandomSampleQuery.java

+        float getFloat(int docId) throws IOException {
+            int hash;
+            if (values == null) {
+                hash = BitMixer.mix(context.docBase + docId + 1);


not really wrong since we do not require things to be reproducible in that case, but I'd rather like to use context.reader().maxDoc() instead of context.docBase so that matches only depend on the current segment

andyb-elastic · 2018-03-19T22:23:07Z

@polyfractal should this stay open? Looks like a cool feature

@elastic/es-search-aggs

polyfractal · 2018-03-27T19:01:07Z

Missed the ping on this, sorry for the delay. I'm going to tag this as stalled, since there is some theoretical work we need to do to make the sampling happy with how Lucene works, or perhaps change it into a Collector.

Not sure... need to do some deep thinking on it a bit :)

s1monw · 2018-11-15T20:10:13Z

@polyfractal are you still working on this or wanna close?

polyfractal · 2018-11-15T21:26:55Z

Not actively working on it, I'll close. I still think it's a super useful feature but I also think the current method either needs some serious tweaking, or converted to a collector. Will re-open a new PR if/when I ever find time to revisit.

If anyone else wants to take it up please do!

hconceic · 2019-06-27T15:38:10Z

there's a good use case for a Random Sampling collector. For example, in e-commerce, you often want to display the counters for the aggregations on the left side bar (e.g., for query X, we have n results for this brand, m results for this other brand, etc.). When the number of results is big, queries for aggregations get prohibitively slow. As such, one common strategy is to estimate those counters based on a sample of the search result.

The problem with the above is that, if you only consider the N top docs (like provider with the sampler ES provides), you may be losing some aggregations, e.g., if the top N results are all of the same brand, you'll only see that brand in the aggregations.

The solution, as discussed here: https://archive.nyu.edu/bitstream/2451/14953/2/USEDBOOK11.pdf, is to use a random sampler that, statistically, can include results of many brands (to follow on the example).

Achieving this in elasticsearch, currently, is a bit tricky and requires at least two different queries and some logic in between

$polyfractal$

$@polyfractal$ polyfractal added :Search/Search Search-related issues that do not fall into other categories >feature review labels Jul 5, 2017

$@polyfractal$ polyfractal requested a review from jpountz July 5, 2017 20:38

jpountz reviewed Jul 6, 2017

View reviewed changes

$@polyfractal$

Review cleanup

234f974

jpountz reviewed Jul 26, 2017

View reviewed changes

$@polyfractal$ polyfractal removed the review label Aug 25, 2017

$@polyfractal$ polyfractal added the stalled label Mar 27, 2018

$@polyfractal$ polyfractal mentioned this pull request Mar 27, 2018

min_score doesn't seem to prune irrelevant results #14455

Closed

$@polyfractal$ polyfractal closed this Nov 15, 2018

$@polyfractal$ polyfractal mentioned this pull request Mar 5, 2020

Add a simple random sampling query #53200

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a simple random sampling query #25561

Add a simple random sampling query #25561

$@polyfractal$ polyfractal commented Jul 5, 2017

jpountz left a comment

jpountz Jul 6, 2017

jpountz Jul 6, 2017

rjernst Jul 6, 2017 •

edited

Loading

$@polyfractal$ polyfractal Jul 6, 2017

rjernst Jul 6, 2017

$@polyfractal$ polyfractal Jul 7, 2017

jpountz Jul 26, 2017

jpountz Jul 6, 2017

$@polyfractal$ polyfractal Jul 6, 2017

jpountz Jul 6, 2017

jpountz Jul 6, 2017

$@polyfractal$ polyfractal Jul 6, 2017

jpountz Jul 6, 2017

$@polyfractal$ polyfractal Jul 6, 2017

polyfractal commented Jul 10, 2017

jpountz left a comment

jpountz Jul 26, 2017

jpountz Jul 26, 2017

jpountz Jul 26, 2017

jpountz Jul 26, 2017

jpountz Jul 26, 2017

jpountz Jul 26, 2017

jpountz Jul 26, 2017

jpountz Jul 26, 2017

andyb-elastic commented Mar 19, 2018

polyfractal commented Mar 27, 2018

s1monw commented Nov 15, 2018

polyfractal commented Nov 15, 2018

hconceic commented Jun 27, 2019 •

edited

Loading

Add a simple random sampling query #25561

Add a simple random sampling query #25561

Conversation

polyfractal commented Jul 5, 2017

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rjernst Jul 6, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

polyfractal commented Jul 10, 2017

Naive sampling

Gap Sampling

50:50 mix

Conclusion?

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andyb-elastic commented Mar 19, 2018

polyfractal commented Mar 27, 2018

s1monw commented Nov 15, 2018

polyfractal commented Nov 15, 2018

hconceic commented Jun 27, 2019 • edited Loading

$@polyfractal$ polyfractal commented Jul 5, 2017

rjernst Jul 6, 2017 •

edited

Loading

hconceic commented Jun 27, 2019 •

edited

Loading