Skip to content

Add a simple random sampling query #25561

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

polyfractal
Copy link
Contributor

This adds a query that randomly matches documents with a user-defined probability. The user can also specify a seed for reproducible randomness.

This query uses a trick to accelerate sampling. Instead of checking the rng on each document, it approximates the gap between documents and skips forward. This should allow Lucene to shave off a fair amount of execution time, making the collection process sub-linear instead of linear.

Gap approximation is only used for segments with >100 documents, since empirically it seems approximation doesn't work well with small numbers.

It also skips approximating if the probability is very high; in this case, many documents are returned so there is less benefit to "skipping" forward, and the math invoked (two logs) is a bit more expensive than just the RNG. Probably... this has not been benchmarked. If we think it's worthwhile I can try to workup a benchmark for that. Maybe the overhead of collecting the document is high enough to always do gap sampling?

This is the first time I've implemented a query, so hopefully I got all the bits right :) If this is accepted, I'd like to implement a Reservoir Sampling version too.

This adds a query that matches documents with a random probability.
The user can specify the probability that any particular document matches,
as well as a seed for reproducible randomness.

This query uses a trick to accelerate sampling.  Instead of checking the rng
on each document, it approximates the gap between documents and skips
forward. This should allow Lucene to shave off a fair amount of execution time,
making the collection process sub-linear instead of linear.

Gap approximation is only used for segments with >100 documents, since
empirically it seems approximation doesn't work well with small numbers.

It also skips approximating if the probability is very high; in this case, many documents
are returned so there is less benefit to "skipping" forward, and the math invoked
(two logs) is a bit more expensive than just the RNG.  Probably... this has not
been benchmarked.
@polyfractal polyfractal added :Search/Search Search-related issues that do not fall into other categories >feature review labels Jul 5, 2017
@polyfractal polyfractal requested a review from jpountz July 5, 2017 20:38
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments.

}
doc = doc + 1;
}
return NO_MORE_DOCS;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one issue I have with this is that the next matching document depends on the target. Would it be possible to fix it so that the set of matching documents does not depend on how the scorer as advanced?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ie. could the random number be only a function of seed and doc?

Copy link
Member

@rjernst rjernst Jul 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And can docid be the default, but the value be optionally pulled from a field? See #25240 (comment). Otherwise there is no way to have real consistency with a seed (and IMO seed should only be allowed when a field is specified).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpountz Not entirely sure I understand. Is the issue that the same seeded rng may match different documents depending on how many times it was invoked (due to different scorer positions)? If that's the case... does it matter? Every document has the same probability p to be matched, indepedendent of other documents. Technically it's influenced by the prior docs due to not being a truly random generator, but the result is essentially the same from a sampling perspective.

@rjernst Are you saying to re-seed the rng on each document (using it's docid or field, etc)? Wouldn't that be rather expensive?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See RandomScoreFunction (it does not depend on the order of visiting docs, and just substitute using the uid field with either docid or a user supplied field).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, I see how it works now (not using Random at all, just hashing/mixing). Will make changes similar to @jpountz recent PR for the random_score :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@polyfractal what I meant is that the matches of a given query should be the same regardless how the query is consumed. For instance that that for a given source of randomness and seed, documents 5 and 12 should match but no other documents. Then calling advance(10) or advance(11) should both return 12. However the impl seems a bit broken here since the next matching document is computed using the target.

assert(p > 0.0 && p <= 1.0);
this.p = p;
this.seed = seed;
LOG_INVERSE_P = Math.log(1-p);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assigning a static variable in a constructor looks buggy to me, maybe logInverseP should not be static?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, yeah. I just wanted to pre-calculate it since it doesn't change, no need for static.


static int getGap(Random rng) {
double u = Math.max(rng.nextDouble(), EPSILON);
return (int)(Math.floor(Math.log(u) / LOG_INVERSE_P)) + 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need to call floor if we cast to an int?

// Keep approximating gaps until we hit or surpass the target
while (doc <= target) {
doc += getGap(rng);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm slightly worried this could be slow if the percentage of docs to match is rather high and target-doc is large?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a threshold, PROBABILITY_THRESHOLD, that is currently used to cut-over to a "naive" random sample if the probability is high (e.g. if the sampler is supposed to match a large number of documents). In that case it's standard sampling using the relatively cheaper rng sampling. I set it rather high (0.7) but we could set it closer to 0.5?

And when p is small, the gaps should be relatively large and I think it'll always be faster than the naive approach.

Should I work up some benchmarks for this to make sure?

deterministic operation of the query so that the same documents are matched each
time.

This can be accomplished by adding the `seed` parameter:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe the seed is a bit trappy since it only reproduces matches on a per point-in-time view of a shard. So if you go to a replica or refresh, you could match different documents?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh yeah, good point. I'm fine getting rid of the seed, it sounds like a mess to deal with

@polyfractal
Copy link
Contributor Author

Alright, updated to use the doc's ID or a seed + field as the source of randomness. Also tweaked the tests to make them a bit more robust.

I was somewhat concerned about how the IDs generate random values, and how it interacts with the gap sampling. E.g. small random values generate large gaps, so the first document will have BitMixer.mix(context.docBase + docId) == BitMixer.mix(0 + 0) == 0 as the random value, which generates a gap of ~200 something, and essentially skips all small segments. I padded the mix with 1 (BitMixer.mix(context.docBase + docId + 1);) so this doesn't happen, but was still concerned about the randomness of the output.

Naive sampling

So, I ran some tests. Indices sized 1000-1m docs, probability from 10-90% and pure naive sampling (e.g. checking rng on each document). Here is the relative error to theoretical values:

screen shot 2017-07-10 at 6 18 07 pm

Not really surprising, the sampling accuracy improves as the index grows. Sub-1000 documents and you're looking at 5-30% error.

Gap Sampling

If we do 100% gap sampling, we get worse behavior pretty much across the board, but noticeably worse on the small indices:
screen shot 2017-07-10 at 6 17 44 pm

50:50 mix

And if we do a 50:50 mix, where we use gap sampling when p<0.5 but naive sampling when p>0.5:

screen shot 2017-07-10 at 6 17 20 pm

Conclusion?

I'm not really sure there is a strong takeaway from these tests, other than to say that it appears behavior is relatively consistent. Gap sampling seems to perform better with high p, while naive sampling gives more uniform error. Mixing the two doesn't appear to change the results dramatically (it may clean up the low end, but that may just be noise).

I think the only big takeaway is that you shouldn't sample small indices but that's part of the deal with sampling :)

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious what your use-case is and whether queries are the right entry point or whether it should be eg. a collector.

Something that makes me a bit unhappy about both implementations is that deciding whether a document matches or not depends on how nextDoc and advance are called in sequence, which should not be the case for queries. I'm wondering whether it would be worth dropping the quality of the randomness for something that allows us to build better-behaving queries. For instance if p=0.125 we could pick a random number between 0 and 7 and consider that all doc IDs whose lower 3 bits are equal to that number match. It would be less random but implementing advance would be very efficient and you would always get the same matches regardless of how you consume the iterator. It's just an idea, I don't know much about random number generation but maybe there are better algorithms that would yield matching sets of better quality while addressing my concerns.

I don't think it is possible to address my concerns for the field-based approach since we can't predict what the next values for the field will be, but I'm wondering whether we actually need reproducibility?

* Creates a random sampler query
*
* @param probability The probability that a document is randomly sampled
* @param seed A seed to use with the random generator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe also document that the field is used as a source of randomness

private final int seed;
private final int salt;
private final IndexFieldData<?> fieldData;
private float LOG_INVERSE_P;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instance variables should use camelcase

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make it final?

private static final float EPSILON = 1e-10f;

RandomSampleQuery(double p, int seed, int salt, IndexFieldData<?> fieldData) {
assert(p > 0.0 && p < 1.0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should use actual exceptions instead of assertions since we are validating some user input


}
@Override
public BulkScorer bulkScorer(LeafReaderContext context) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering why you decided to override this optional API. Is this impl expected to be faster than pulling the iterator and calling next in a loop? (this is what the default impl does)

}
RandomSampleQuery other = (RandomSampleQuery)o;
return Objects.equals(this.p, other.p) &&
Objects.equals(this.seed, other.seed);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to compare the last and fielddata instances as well?

}
doc = doc + 1;
}
return NO_MORE_DOCS;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@polyfractal what I meant is that the matches of a given query should be the same regardless how the query is consumed. For instance that that for a given source of randomness and seed, documents 5 and 12 should match but no other documents. Then calling advance(10) or advance(11) should both return 12. However the impl seems a bit broken here since the next matching document is computed using the target.

float getFloat(int docId) throws IOException {
int hash;
if (values == null) {
hash = BitMixer.mix(context.docBase + docId + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really wrong since we do not require things to be reproducible in that case, but I'd rather like to use context.reader().maxDoc() instead of context.docBase so that matches only depend on the current segment

@andyb-elastic
Copy link
Contributor

@polyfractal should this stay open? Looks like a cool feature

@elastic/es-search-aggs

@polyfractal
Copy link
Contributor Author

Missed the ping on this, sorry for the delay. I'm going to tag this as stalled, since there is some theoretical work we need to do to make the sampling happy with how Lucene works, or perhaps change it into a Collector.

Not sure... need to do some deep thinking on it a bit :)

@s1monw
Copy link
Contributor

s1monw commented Nov 15, 2018

@polyfractal are you still working on this or wanna close?

@polyfractal
Copy link
Contributor Author

Not actively working on it, I'll close. I still think it's a super useful feature but I also think the current method either needs some serious tweaking, or converted to a collector. Will re-open a new PR if/when I ever find time to revisit.

If anyone else wants to take it up please do!

@hconceic
Copy link

hconceic commented Jun 27, 2019

there's a good use case for a Random Sampling collector. For example, in e-commerce, you often want to display the counters for the aggregations on the left side bar (e.g., for query X, we have n results for this brand, m results for this other brand, etc.). When the number of results is big, queries for aggregations get prohibitively slow. As such, one common strategy is to estimate those counters based on a sample of the search result.

The problem with the above is that, if you only consider the N top docs (like provider with the sampler ES provides), you may be losing some aggregations, e.g., if the top N results are all of the same brand, you'll only see that brand in the aggregations.

The solution, as discussed here: https://archive.nyu.edu/bitstream/2451/14953/2/USEDBOOK11.pdf, is to use a random sampler that, statistically, can include results of many brands (to follow on the example).

Achieving this in elasticsearch, currently, is a bit tricky and requires at least two different queries and some logic in between

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :Search/Search Search-related issues that do not fall into other categories stalled
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants