-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Store _doc_count field as custom term frequency #65776
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Pinging @elastic/es-search (Team:Search) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good.
I wonder if we should have a special field, e.g. _term_freq
that we could use whenever we want to index a field as a term frequency, like rank_feature
does by always indexing into the _feature
Lucene field. And this _term_freq
field could have several terms including _doc_count
. This way we would only use one Lucene field under the hood?
} | ||
|
||
@Override | ||
public Query existsQuery(QueryShardContext context) { | ||
return new DocValuesFieldExistsQuery(NAME); | ||
throw new IllegalArgumentException("Field [" + name() + "] of type [" + typeName() + "] does not support exists queries"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe use the same exception type as termQuery
?
@@ -33,17 +33,25 @@ | |||
*/ | |||
public class DocCountProvider { | |||
|
|||
private NumericDocValues docCountValues; | |||
public static final int defaultValue = DocCountFieldMapper.DocCountFieldType.defaultValue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it be DEFAULT_VALUE
since it is a constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally, I had it DEFAULT_VALUE
, but then changed it so that is inline with DocCountFieldMapper.DocCountFieldType.defaultValue
which is also a constant.
Changing it for both fields
Maybe another question: by using NumericDocValuesField, Lucene would enforce for us that this _doc_count field be single-valued. Do we need to check this explicitly now? |
Good catch. I added a check in the |
For a start, I want to say that I don't know if having a single special field (e.g. Conceptually, I would prefer |
Yeah the bw compat logic shouldn't be too complicated anyway if we wanted to move to a shared field in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left two minor comments, otherwise LGTM.
|
||
public CustomTermFreqField(String fieldName, int fieldValue) { | ||
this(fieldName, fieldName, fieldValue); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd rather drop this constructor and require callers of this class to think about what is the right field name and term they should use
@@ -96,17 +93,19 @@ protected void parseCreateField(ParseContext context) throws IOException { | |||
XContentParser parser = context.parser(); | |||
XContentParserUtils.ensureExpectedToken(XContentParser.Token.VALUE_NUMBER, parser.currentToken(), parser); | |||
|
|||
long value = parser.longValue(false); | |||
// Check that _doc_count is a single value and not an array | |||
if (context.doc().getField(NAME) != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is a bit trappy as it does a linear scan of the fields of the document. Can you use addWithKey
like BinaryFieldMapper
does?
A while back, Lucene introduced the ability to index custom term frequencies, ie. giving users the ability to provide a numeric value that should be indexed as a term frequency rather than letting Lucene compute the term frequency by itself based on the number of occurrences of a term.
This PR modifies the
_doc_count
field so that it is stored as Lucene custom term frequency.A benefit of moving to custom term frequencies is that Lucene will automatically compute global term statistics like
totalTermFreq
which will let us know the sum of the values of the_doc_count
field across an entire shard. This could in-turn be useful to generalize optimizations to rollup indices, e.g. buckets aggregations where all documents fall into the same bucket.Relates to #64503