Skip to content

Disable graph analysis at query time for shingle and cjk filters producing tokens of different size #23920

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 6, 2017
Merged

Conversation

jimczi
Copy link
Contributor

@jimczi jimczi commented Apr 5, 2017

Shingle filters that produce shingles of different size and CJK filters that produce bigram AND unigram are problematic when
we analyze the graph they produce. The position for each shingle size are not aligned so each position has at least two side paths.
So in order to avoid paths explosion this change disables the graph analysis at query time for field analyzers that contain these filters
with a problematic configuration.

Closes #23918

…ucing tokens of different size

Shingle filters that produce shingles of different size and CJK filters that produce bigram AND unigram are problematic when
 we analyze the graph they produce. The position for each shingle size are not aligned so each position has at least two side paths.
 So in order to avoid paths explosion this change disables the graph analysis at query time for field analyzers that contain these filters
 with a problematic configuration.

Closes #23918
@jimczi jimczi added :Search/Search Search-related issues that do not fall into other categories >bug v5.3.1 v5.4.0 v6.0.0-alpha1 labels Apr 5, 2017
@jimczi jimczi requested a review from jpountz April 5, 2017 15:49
@jimczi jimczi changed the title Disable graph analysis at query time for shingle and cjk filters prod… Disable graph analysis at query time for shingle and cjk filters producing tokens of different size Apr 5, 2017
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good. I'm wondering we should also test that we keep processing the query like before if the shingles are sane?

builder.splitOnWhitespace(false);
Query query = builder.doToQuery(shardContext);
assertThat(expectedQuery, equalTo(query));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also check that the correct query is built if the analyzer is sane and does not output unigrams?

@jimczi
Copy link
Contributor Author

jimczi commented Apr 5, 2017

@jpountz I pushed some changes to address your comment.
I also moved the logic to disable graph analysis at a higher level in order to handle phrase queries too.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jimczi jimczi merged commit 38009ef into elastic:master Apr 6, 2017
@jimczi jimczi deleted the disable_graph_filter branch April 6, 2017 06:55
jimczi added a commit that referenced this pull request Apr 6, 2017
…ucing tokens of different size (#23920)

This change disables graph analysis of token streams containing a shingle or a cjk filters that produce shingle or ngram of different size. The graph analysis is disabled for phrase and boolean queries.

Closes #23918
jimczi added a commit that referenced this pull request Apr 6, 2017
…ucing tokens of different size (#23920)

This change disables graph analysis of token streams containing a shingle or a cjk filters that produce shingle or ngram of different size. The graph analysis is disabled for phrase and boolean queries.

Closes #23918
jimczi added a commit that referenced this pull request Apr 6, 2017
…ers producing tokens of different size (#23920)"

This reverts commit 6cc7df7.
jimczi added a commit that referenced this pull request Apr 6, 2017
…ucing tokens of different size (#23920)

This change disables graph analysis of token streams containing a shingle or a cjk filters that produce shingle or ngram of different size. The graph analysis is disabled for phrase and boolean queries.

Closes #23918
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search/Search Search-related issues that do not fall into other categories v5.3.1 v5.4.0 v6.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants