-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Add a simple random sampling query #25561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This adds a query that matches documents with a random probability. The user can specify the probability that any particular document matches, as well as a seed for reproducible randomness. This query uses a trick to accelerate sampling. Instead of checking the rng on each document, it approximates the gap between documents and skips forward. This should allow Lucene to shave off a fair amount of execution time, making the collection process sub-linear instead of linear. Gap approximation is only used for segments with >100 documents, since empirically it seems approximation doesn't work well with small numbers. It also skips approximating if the probability is very high; in this case, many documents are returned so there is less benefit to "skipping" forward, and the math invoked (two logs) is a bit more expensive than just the RNG. Probably... this has not been benchmarked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments.
} | ||
doc = doc + 1; | ||
} | ||
return NO_MORE_DOCS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one issue I have with this is that the next matching document depends on the target. Would it be possible to fix it so that the set of matching documents does not depend on how the scorer as advanced?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ie. could the random number be only a function of seed
and doc
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And can docid be the default, but the value be optionally pulled from a field? See #25240 (comment). Otherwise there is no way to have real consistency with a seed (and IMO seed should only be allowed when a field is specified).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpountz Not entirely sure I understand. Is the issue that the same seeded rng may match different documents depending on how many times it was invoked (due to different scorer positions)? If that's the case... does it matter? Every document has the same probability p
to be matched, indepedendent of other documents. Technically it's influenced by the prior docs due to not being a truly random generator, but the result is essentially the same from a sampling perspective.
@rjernst Are you saying to re-seed the rng on each document (using it's docid or field, etc)? Wouldn't that be rather expensive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See RandomScoreFunction (it does not depend on the order of visiting docs, and just substitute using the uid field with either docid or a user supplied field).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, I see how it works now (not using Random at all, just hashing/mixing). Will make changes similar to @jpountz recent PR for the random_score
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@polyfractal what I meant is that the matches of a given query should be the same regardless how the query is consumed. For instance that that for a given source of randomness and seed, documents 5 and 12 should match but no other documents. Then calling advance(10) or advance(11) should both return 12. However the impl seems a bit broken here since the next matching document is computed using the target.
assert(p > 0.0 && p <= 1.0); | ||
this.p = p; | ||
this.seed = seed; | ||
LOG_INVERSE_P = Math.log(1-p); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assigning a static variable in a constructor looks buggy to me, maybe logInverseP should not be static?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, yeah. I just wanted to pre-calculate it since it doesn't change, no need for static.
|
||
static int getGap(Random rng) { | ||
double u = Math.max(rng.nextDouble(), EPSILON); | ||
return (int)(Math.floor(Math.log(u) / LOG_INVERSE_P)) + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need to call floor
if we cast to an int
?
// Keep approximating gaps until we hit or surpass the target | ||
while (doc <= target) { | ||
doc += getGap(rng); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm slightly worried this could be slow if the percentage of docs to match is rather high and target-doc
is large?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a threshold, PROBABILITY_THRESHOLD
, that is currently used to cut-over to a "naive" random sample if the probability is high (e.g. if the sampler is supposed to match a large number of documents). In that case it's standard sampling using the relatively cheaper rng sampling. I set it rather high (0.7) but we could set it closer to 0.5?
And when p
is small, the gaps should be relatively large and I think it'll always be faster than the naive approach.
Should I work up some benchmarks for this to make sure?
deterministic operation of the query so that the same documents are matched each | ||
time. | ||
|
||
This can be accomplished by adding the `seed` parameter: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe the seed is a bit trappy since it only reproduces matches on a per point-in-time view of a shard. So if you go to a replica or refresh, you could match different documents?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh yeah, good point. I'm fine getting rid of the seed, it sounds like a mess to deal with
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious what your use-case is and whether queries are the right entry point or whether it should be eg. a collector.
Something that makes me a bit unhappy about both implementations is that deciding whether a document matches or not depends on how nextDoc and advance are called in sequence, which should not be the case for queries. I'm wondering whether it would be worth dropping the quality of the randomness for something that allows us to build better-behaving queries. For instance if p=0.125 we could pick a random number between 0 and 7 and consider that all doc IDs whose lower 3 bits are equal to that number match. It would be less random but implementing advance would be very efficient and you would always get the same matches regardless of how you consume the iterator. It's just an idea, I don't know much about random number generation but maybe there are better algorithms that would yield matching sets of better quality while addressing my concerns.
I don't think it is possible to address my concerns for the field-based approach since we can't predict what the next values for the field will be, but I'm wondering whether we actually need reproducibility?
* Creates a random sampler query | ||
* | ||
* @param probability The probability that a document is randomly sampled | ||
* @param seed A seed to use with the random generator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe also document that the field is used as a source of randomness
private final int seed; | ||
private final int salt; | ||
private final IndexFieldData<?> fieldData; | ||
private float LOG_INVERSE_P; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instance variables should use camelcase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make it final?
private static final float EPSILON = 1e-10f; | ||
|
||
RandomSampleQuery(double p, int seed, int salt, IndexFieldData<?> fieldData) { | ||
assert(p > 0.0 && p < 1.0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should use actual exceptions instead of assertions since we are validating some user input
|
||
} | ||
@Override | ||
public BulkScorer bulkScorer(LeafReaderContext context) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering why you decided to override this optional API. Is this impl expected to be faster than pulling the iterator and calling next in a loop? (this is what the default impl does)
} | ||
RandomSampleQuery other = (RandomSampleQuery)o; | ||
return Objects.equals(this.p, other.p) && | ||
Objects.equals(this.seed, other.seed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to compare the last and fielddata instances as well?
} | ||
doc = doc + 1; | ||
} | ||
return NO_MORE_DOCS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@polyfractal what I meant is that the matches of a given query should be the same regardless how the query is consumed. For instance that that for a given source of randomness and seed, documents 5 and 12 should match but no other documents. Then calling advance(10) or advance(11) should both return 12. However the impl seems a bit broken here since the next matching document is computed using the target.
float getFloat(int docId) throws IOException { | ||
int hash; | ||
if (values == null) { | ||
hash = BitMixer.mix(context.docBase + docId + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really wrong since we do not require things to be reproducible in that case, but I'd rather like to use context.reader().maxDoc() instead of context.docBase so that matches only depend on the current segment
@polyfractal should this stay open? Looks like a cool feature @elastic/es-search-aggs |
Missed the ping on this, sorry for the delay. I'm going to tag this as Not sure... need to do some deep thinking on it a bit :) |
@polyfractal are you still working on this or wanna close? |
Not actively working on it, I'll close. I still think it's a super useful feature but I also think the current method either needs some serious tweaking, or converted to a collector. Will re-open a new PR if/when I ever find time to revisit. If anyone else wants to take it up please do! |
there's a good use case for a Random Sampling collector. For example, in e-commerce, you often want to display the counters for the aggregations on the left side bar (e.g., for query X, we have n results for this brand, m results for this other brand, etc.). When the number of results is big, queries for aggregations get prohibitively slow. As such, one common strategy is to estimate those counters based on a sample of the search result. The problem with the above is that, if you only consider the N top docs (like provider with the sampler ES provides), you may be losing some aggregations, e.g., if the top N results are all of the same brand, you'll only see that brand in the aggregations. The solution, as discussed here: https://archive.nyu.edu/bitstream/2451/14953/2/USEDBOOK11.pdf, is to use a random sampler that, statistically, can include results of many brands (to follow on the example). Achieving this in elasticsearch, currently, is a bit tricky and requires at least two different queries and some logic in between |
This adds a query that randomly matches documents with a user-defined probability. The user can also specify a seed for reproducible randomness.
This query uses a trick to accelerate sampling. Instead of checking the rng on each document, it approximates the gap between documents and skips forward. This should allow Lucene to shave off a fair amount of execution time, making the collection process sub-linear instead of linear.
Gap approximation is only used for segments with >100 documents, since empirically it seems approximation doesn't work well with small numbers.
It also skips approximating if the probability is very high; in this case, many documents are returned so there is less benefit to "skipping" forward, and the math invoked (two logs) is a bit more expensive than just the RNG. Probably... this has not been benchmarked. If we think it's worthwhile I can try to workup a benchmark for that. Maybe the overhead of collecting the document is high enough to always do gap sampling?
This is the first time I've implemented a query, so hopefully I got all the bits right :) If this is accepted, I'd like to implement a Reservoir Sampling version too.