Skip to content

When parsing JSON fields, also create tokens prefixed with the field key. #34207

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Oct 4, 2018

Conversation

jtibshirani
Copy link
Contributor

@jtibshirani jtibshirani commented Oct 1, 2018

This PR updates the parsing for json fields to emit tokens prefixed by the field key, in addition to 'root' tokens containing the unprefixed value.

As an example, given a json field called 'json_field' and the following input

{
  "json_field: {
    "key1": "some value",
    "key2": {
       "key3": true
    }
  }
}

the mapper will produce untokenized string fields with the values some value, key\0some value, true, and key2.key3\0true.

An important note about this change: the behavior we want is for searches on the 'root' JSON field to only consider raw values, and not prefixed tokens. For example, given the JSON field "headers": { "content-type": "application/json”"}, it should not match the query "prefix": { "headers": "content" } } just because one of the indexed tokens is content-type\0application/json. I think this behavior would be confusing, and that we should try to keep the token prefixing as an internal implementation detail. To avoid this issue, I chose to add the prefixed tokens to a new lucene field called <field name>._prefixed. This won't affect how searches are done (the user will still query using the key headers, or headers.content-type).

Finally, this PR just updates the indexing process -- these prefixed tokens can’t be searched yet. The search side will be implemented in a subsequent PR (I’m just a fan of keeping PRs small!)

@jtibshirani jtibshirani added >feature :Search Foundations/Mapping Index mappings, including merging and defining field types labels Oct 1, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@jtibshirani jtibshirani changed the title Prefixed json fields When parsing JSON fields, also create tokens prefixed with the field key. Oct 1, 2018
*/
public class JsonFieldParser {
private static final String SEPARATOR = "\0";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The null character \0 seemed like a reasonable choice for a separator, as (1) it shouldn’t show up too often in field keys, and (2) there is already precedent for it, as we use it when storing percolator queries (PercolatorFieldMapper#FIELD_VALUE_SEPARATOR).

String currentName,
String value,
List<IndexableField> fields) {
if (value.length() > ignoreAbove) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For prefixed values, the alternative option here would be to check the whole length of the prefixed token, as opposed to just the value. I think that this behavior is more intuitive (and I also don't think we're as concerned about field keys being really long, as opposed to values?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, we should probably put in some kind of soft limit on the depth of these objects at some point and the ignore above plus that soft limit will give us an upper bound on the term lengths here anyway

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me, I'll make a note on the meta-issue to add a limit.

Copy link
Contributor

@colings86 colings86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments but they are very minor

JsonFieldParser(MappedFieldType fieldType,
int ignoreAbove) {
this.fieldType = fieldType;
this.ignoreAbove = ignoreAbove;

this.fieldName = fieldType.name();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should call this rootFieldName or something so its clear everywhere exact which field is being used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

} else if (token == XContentParser.Token.VALUE_NULL) {
String value = fieldType.nullValueAsString();
if (value != null) {
addField(value, fields);
addField(path, currentName, value, fields);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we throw an exception in an else block here? If we encounter something like an array of arrays we should probably reject the document rather than silently ignoring it? The same probably applies for parseObject above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I think this is actually a bug, since an array of arrays is valid JSON and should be accepted. Will fix.

I agree it's a good idea to add an else with an exception, so we'll fail fast when encountering something unexpected rather than attempt to proceed in a potentially wrong state.

String currentName,
String value,
List<IndexableField> fields) {
if (value.length() > ignoreAbove) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, we should probably put in some kind of soft limit on the depth of these objects at some point and the ignore above plus that soft limit will give us an upper bound on the term lengths here anyway

Copy link
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good. My only nit is that I don't like the name 'prefixed', as this seems to me to be an implementation detail. Something like 'keyed' might be better?

@jtibshirani
Copy link
Contributor Author

Thanks both of you for taking a look. I also like the 'keyed' naming better, will update to that.

Copy link
Contributor

@colings86 colings86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks for making the changes

Copy link
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too, thanks Julie!

@jtibshirani jtibshirani force-pushed the object-fields branch 3 times, most recently from 402b7ac to c0ea6d4 Compare October 3, 2018 20:58
@jtibshirani jtibshirani force-pushed the prefixed-json-fields branch from cce97f2 to 21534d9 Compare October 3, 2018 20:58
@jtibshirani jtibshirani merged commit 8ed75db into elastic:object-fields Oct 4, 2018
@jtibshirani jtibshirani deleted the prefixed-json-fields branch October 4, 2018 06:29
jtibshirani added a commit to jtibshirani/elasticsearch that referenced this pull request Nov 8, 2018
jtibshirani added a commit to jtibshirani/elasticsearch that referenced this pull request Nov 9, 2018
jtibshirani added a commit to jtibshirani/elasticsearch that referenced this pull request Mar 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :Search Foundations/Mapping Index mappings, including merging and defining field types
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants