Skip to content

Store _doc_count field as custom term frequency #65776

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Dec 3, 2020

Conversation

csoulios
Copy link
Contributor

@csoulios csoulios commented Dec 2, 2020

A while back, Lucene introduced the ability to index custom term frequencies, ie. giving users the ability to provide a numeric value that should be indexed as a term frequency rather than letting Lucene compute the term frequency by itself based on the number of occurrences of a term.

This PR modifies the _doc_count field so that it is stored as Lucene custom term frequency.

A benefit of moving to custom term frequencies is that Lucene will automatically compute global term statistics like totalTermFreq which will let us know the sum of the values of the _doc_count field across an entire shard. This could in-turn be useful to generalize optimizations to rollup indices, e.g. buckets aggregations where all documents fall into the same bucket.

Relates to #64503

@csoulios csoulios added >enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types v8.0.0 v7.11.0 labels Dec 2, 2020
@csoulios csoulios requested review from jpountz and jimczi December 2, 2020 20:28
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Dec 2, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good.

I wonder if we should have a special field, e.g. _term_freq that we could use whenever we want to index a field as a term frequency, like rank_feature does by always indexing into the _feature Lucene field. And this _term_freq field could have several terms including _doc_count. This way we would only use one Lucene field under the hood?

}

@Override
public Query existsQuery(QueryShardContext context) {
return new DocValuesFieldExistsQuery(NAME);
throw new IllegalArgumentException("Field [" + name() + "] of type [" + typeName() + "] does not support exists queries");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use the same exception type as termQuery?

@@ -33,17 +33,25 @@
*/
public class DocCountProvider {

private NumericDocValues docCountValues;
public static final int defaultValue = DocCountFieldMapper.DocCountFieldType.defaultValue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be DEFAULT_VALUE since it is a constant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally, I had it DEFAULT_VALUE, but then changed it so that is inline with DocCountFieldMapper.DocCountFieldType.defaultValue which is also a constant.

Changing it for both fields

@jpountz
Copy link
Contributor

jpountz commented Dec 2, 2020

Maybe another question: by using NumericDocValuesField, Lucene would enforce for us that this _doc_count field be single-valued. Do we need to check this explicitly now?

@csoulios
Copy link
Contributor Author

csoulios commented Dec 3, 2020

Maybe another question: by using NumericDocValuesField, Lucene would enforce for us that this _doc_count field be single-valued. Do we need to check this explicitly now?

Good catch. I added a check in the parseCreateField() method and a test.

@csoulios
Copy link
Contributor Author

csoulios commented Dec 3, 2020

I wonder if we should have a special field, e.g. _term_freq that we could use whenever we want to index a field as a term frequency, like rank_feature does by always indexing into the _feature Lucene field. And this _term_freq field could have several terms including _doc_count. This way we would only use one Lucene field under the hood?

For a start, I want to say that I don't know if having a single special field (e.g. _term_freq) for all term freq stored fields would result in any Lucene optimization, as I am not aware of the internals.

Conceptually, I would prefer _doc_count field being stored in Lucene as _doc_count and every other field should have their own Lucene field, as I believe this is more intuitive. Moreover, it would be simpler to add another term under the _doc_count field if we ever wanted to store another value for _doc_count as term freq.

@jpountz
Copy link
Contributor

jpountz commented Dec 3, 2020

Yeah the bw compat logic shouldn't be too complicated anyway if we wanted to move to a shared field in the future.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left two minor comments, otherwise LGTM.


public CustomTermFreqField(String fieldName, int fieldValue) {
this(fieldName, fieldName, fieldValue);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd rather drop this constructor and require callers of this class to think about what is the right field name and term they should use

@@ -96,17 +93,19 @@ protected void parseCreateField(ParseContext context) throws IOException {
XContentParser parser = context.parser();
XContentParserUtils.ensureExpectedToken(XContentParser.Token.VALUE_NUMBER, parser.currentToken(), parser);

long value = parser.longValue(false);
// Check that _doc_count is a single value and not an array
if (context.doc().getField(NAME) != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is a bit trappy as it does a linear scan of the fields of the document. Can you use addWithKey like BinaryFieldMapper does?

@csoulios csoulios merged commit b554d5b into elastic:master Dec 3, 2020
@csoulios csoulios deleted the doc_count_term_freq branch December 3, 2020 14:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Meta label for search team v7.11.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants