-
Notifications
You must be signed in to change notification settings - Fork 25.2k
When parsing JSON fields, also create tokens prefixed with the field key. #34207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When parsing JSON fields, also create tokens prefixed with the field key. #34207
Conversation
Pinging @elastic/es-search-aggs |
*/ | ||
public class JsonFieldParser { | ||
private static final String SEPARATOR = "\0"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The null character \0
seemed like a reasonable choice for a separator, as (1) it shouldn’t show up too often in field keys, and (2) there is already precedent for it, as we use it when storing percolator queries (PercolatorFieldMapper#FIELD_VALUE_SEPARATOR
).
String currentName, | ||
String value, | ||
List<IndexableField> fields) { | ||
if (value.length() > ignoreAbove) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For prefixed values, the alternative option here would be to check the whole length of the prefixed token, as opposed to just the value. I think that this behavior is more intuitive (and I also don't think we're as concerned about field keys being really long, as opposed to values?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, we should probably put in some kind of soft limit on the depth of these objects at some point and the ignore above plus that soft limit will give us an upper bound on the term lengths here anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me, I'll make a note on the meta-issue to add a limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments but they are very minor
JsonFieldParser(MappedFieldType fieldType, | ||
int ignoreAbove) { | ||
this.fieldType = fieldType; | ||
this.ignoreAbove = ignoreAbove; | ||
|
||
this.fieldName = fieldType.name(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should call this rootFieldName
or something so its clear everywhere exact which field is being used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
} else if (token == XContentParser.Token.VALUE_NULL) { | ||
String value = fieldType.nullValueAsString(); | ||
if (value != null) { | ||
addField(value, fields); | ||
addField(path, currentName, value, fields); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we throw an exception in an else
block here? If we encounter something like an array of arrays we should probably reject the document rather than silently ignoring it? The same probably applies for parseObject
above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I think this is actually a bug, since an array of arrays is valid JSON and should be accepted. Will fix.
I agree it's a good idea to add an else
with an exception, so we'll fail fast when encountering something unexpected rather than attempt to proceed in a potentially wrong state.
String currentName, | ||
String value, | ||
List<IndexableField> fields) { | ||
if (value.length() > ignoreAbove) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, we should probably put in some kind of soft limit on the depth of these objects at some point and the ignore above plus that soft limit will give us an upper bound on the term lengths here anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good. My only nit is that I don't like the name 'prefixed', as this seems to me to be an implementation detail. Something like 'keyed' might be better?
Thanks both of you for taking a look. I also like the 'keyed' naming better, will update to that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks for making the changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM too, thanks Julie!
402b7ac
to
c0ea6d4
Compare
cce97f2
to
21534d9
Compare
This PR updates the parsing for
json
fields to emit tokens prefixed by the field key, in addition to 'root' tokens containing the unprefixed value.As an example, given a
json
field called 'json_field' and the following inputthe mapper will produce untokenized string fields with the values
some value
,key\0some value
,true
, andkey2.key3\0true
.An important note about this change: the behavior we want is for searches on the 'root' JSON field to only consider raw values, and not prefixed tokens. For example, given the JSON field
"headers": { "content-type": "application/json”"}
, it should not match the query"prefix": { "headers": "content" } }
just because one of the indexed tokens iscontent-type\0application/json
. I think this behavior would be confusing, and that we should try to keep the token prefixing as an internal implementation detail. To avoid this issue, I chose to add the prefixed tokens to a new lucene field called<field name>._prefixed
. This won't affect how searches are done (the user will still query using the keyheaders
, orheaders.content-type
).Finally, this PR just updates the indexing process -- these prefixed tokens can’t be searched yet. The search side will be implemented in a subsequent PR (I’m just a fan of keeping PRs small!)