Allow all token/char filters in normalizers #43803

romseygeek · 2019-07-01T07:33:47Z

Normalizers are analyzers defined on keyword fields, using either keyword
or whitespace tokenizers. There is currently an additional restriction, in that
only character-by-character char_filters and token filters are permitted as
part of a normalizer tokenization chain; however, this restriction only really
makes sense for wildcard normalization, not for keyword fields. This commit
allows all char_filter and token filter types to be used in a normalizer.

Closes #43758

elasticmachine · 2019-07-01T07:33:49Z

Pinging @elastic/es-search

cbuescher

I took a first look and left a few comments, just some additional question: This PR looks like it changes mostly the internal implementation of normalizers, what does it change from the users side? Sorry if I missed that part when reading though #43758, not that familiar with the details mentioned there yet. Also I looked at NormalizingCharFilterFactory and NormalizingTokenFilterFactory mentioned in that issue, to me it looks like they are pure marker interfaces at this point and might be potential candidates for removal? If so we could probably add this to this PR, otherwise let me know what I'm missing.

cbuescher · 2019-07-01T13:14:27Z

server/src/test/java/org/elasticsearch/index/analysis/CustomNormalizerTests.java


-import static java.util.Collections.singletonList;
 import static java.util.Collections.singletonMap;

 public class CustomNormalizerTests extends ESTokenStreamTestCase {


Do we still need this test when CustomNormalizerProvider is gone? I saw there is also no CustomAnalyzerTests, maybe this could be renamed and a few tests added that test the CustomAnalyzer parts that are not covered here?

There's still a distinction in settings between analyzers and normalizers, so I think this test is still useful?

cbuescher · 2019-07-01T13:20:45Z

server/src/main/java/org/elasticsearch/index/analysis/AnalysisRegistry.java

            Map<String, TokenFilterFactory> tokenFilters,
            Map<String, CharFilterFactory> charFilters) {
-        if (tokenizerFactory == null) {
-            throw new IllegalStateException("keyword tokenizer factory is null, normalizers require analysis-common module");


Should we keep this check and move it inside the function further down that calls tokenizerSupplier.get()? I'm not really sure its useful, but it looks like it might provide useful hints about a misconfigured cluster (thats the only way I can see analysis-common missing). In case you re-add this, this should get a test too.

If we're missing analysis-common then I don't think anything is going to work? The error message here is out of date anyway, as we build normalizers with both keyword and whitespace tokenizers, but I'll change the caller to pass KeywordTokenizer::new rather than calling into the tokenizer map to ensure that this would never be a problem.

cbuescher · 2019-07-01T13:24:34Z

server/src/main/java/org/elasticsearch/index/analysis/AnalyzerComponents.java

            final Map<String, CharFilterFactory> charFilters, final Map<String, TokenFilterFactory> tokenFilters) {
        String tokenizerName = analyzerSettings.get("tokenizer");
-        if (tokenizerName == null) {
-            throw new IllegalArgumentException("Custom Analyzer [" + name + "] must be configured with a tokenizer");


Do we make sure the name is set elsewhere or is this not necessary anymore? Otherwise it might be good to keep, I haven'r checked what the follow-up exception would be if the name is "null" and the function is called. Might be more cryptic.

The check has been moved to AnalysisRegistry#buildMapping, line 414

romseygeek · 2019-07-01T13:52:56Z

what does it change from the users side?

You can now use any char filter or token filter when defining normalizers. We have a number of issues asking to make such-and-such a filter available, see eg #37344
#27310
#23142

romseygeek · 2019-07-04T12:55:25Z

I'm closing this, as @jimczi points out that we need to prevent stacked tokens appearing in search-time normalizers. We still need to distinguish between partial-term normalization and whole-term normalization, but just allowing everything isn't going to help here.

Allow all token/char filters in normalizers

f0a14ad

romseygeek added >enhancement :Search Relevance/Analysis How text is split into tokens v8.0.0 v7.3.0 labels Jul 1, 2019

romseygeek requested review from cbuescher and jimczi July 1, 2019 07:33

romseygeek self-assigned this Jul 1, 2019

imports; checkstyle

0d85b9b

cbuescher reviewed Jul 1, 2019

View reviewed changes

docs; explcitly use KeywordTokenizer constructor

ef3002e

jpountz added v7.4.0 and removed v7.3.0 labels Jul 3, 2019

romseygeek closed this Jul 4, 2019

codebrain mentioned this pull request Oct 14, 2019

7.4 meta ticket elastic/elasticsearch-net#4133

Closed

56 tasks

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow all token/char filters in normalizers #43803

Allow all token/char filters in normalizers #43803

romseygeek commented Jul 1, 2019

elasticmachine commented Jul 1, 2019

cbuescher left a comment

cbuescher Jul 1, 2019

romseygeek Jul 1, 2019

cbuescher Jul 1, 2019

romseygeek Jul 1, 2019

cbuescher Jul 1, 2019

romseygeek Jul 1, 2019

romseygeek commented Jul 1, 2019

romseygeek commented Jul 4, 2019

Allow all token/char filters in normalizers #43803

Allow all token/char filters in normalizers #43803

Conversation

romseygeek commented Jul 1, 2019

elasticmachine commented Jul 1, 2019

cbuescher left a comment

Choose a reason for hiding this comment

cbuescher Jul 1, 2019

Choose a reason for hiding this comment

romseygeek Jul 1, 2019

Choose a reason for hiding this comment

cbuescher Jul 1, 2019

Choose a reason for hiding this comment

romseygeek Jul 1, 2019

Choose a reason for hiding this comment

cbuescher Jul 1, 2019

Choose a reason for hiding this comment

romseygeek Jul 1, 2019

Choose a reason for hiding this comment

romseygeek commented Jul 1, 2019

romseygeek commented Jul 4, 2019