-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Allow all token/char filters in normalizers #43803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Pinging @elastic/es-search |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a first look and left a few comments, just some additional question: This PR looks like it changes mostly the internal implementation of normalizers, what does it change from the users side? Sorry if I missed that part when reading though #43758, not that familiar with the details mentioned there yet. Also I looked at NormalizingCharFilterFactory and NormalizingTokenFilterFactory mentioned in that issue, to me it looks like they are pure marker interfaces at this point and might be potential candidates for removal? If so we could probably add this to this PR, otherwise let me know what I'm missing.
|
||
import static java.util.Collections.singletonList; | ||
import static java.util.Collections.singletonMap; | ||
|
||
public class CustomNormalizerTests extends ESTokenStreamTestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need this test when CustomNormalizerProvider is gone? I saw there is also no CustomAnalyzerTests, maybe this could be renamed and a few tests added that test the CustomAnalyzer parts that are not covered here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's still a distinction in settings between analyzers and normalizers, so I think this test is still useful?
Map<String, TokenFilterFactory> tokenFilters, | ||
Map<String, CharFilterFactory> charFilters) { | ||
if (tokenizerFactory == null) { | ||
throw new IllegalStateException("keyword tokenizer factory is null, normalizers require analysis-common module"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we keep this check and move it inside the function further down that calls tokenizerSupplier.get()? I'm not really sure its useful, but it looks like it might provide useful hints about a misconfigured cluster (thats the only way I can see analysis-common missing). In case you re-add this, this should get a test too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're missing analysis-common then I don't think anything is going to work? The error message here is out of date anyway, as we build normalizers with both keyword and whitespace tokenizers, but I'll change the caller to pass KeywordTokenizer::new
rather than calling into the tokenizer map to ensure that this would never be a problem.
final Map<String, CharFilterFactory> charFilters, final Map<String, TokenFilterFactory> tokenFilters) { | ||
String tokenizerName = analyzerSettings.get("tokenizer"); | ||
if (tokenizerName == null) { | ||
throw new IllegalArgumentException("Custom Analyzer [" + name + "] must be configured with a tokenizer"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we make sure the name is set elsewhere or is this not necessary anymore? Otherwise it might be good to keep, I haven'r checked what the follow-up exception would be if the name is "null" and the function is called. Might be more cryptic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check has been moved to AnalysisRegistry#buildMapping
, line 414
I'm closing this, as @jimczi points out that we need to prevent stacked tokens appearing in search-time normalizers. We still need to distinguish between partial-term normalization and whole-term normalization, but just allowing everything isn't going to help here. |
Normalizers are analyzers defined on keyword fields, using either keyword
or whitespace tokenizers. There is currently an additional restriction, in that
only character-by-character char_filters and token filters are permitted as
part of a normalizer tokenization chain; however, this restriction only really
makes sense for wildcard normalization, not for keyword fields. This commit
allows all char_filter and token filter types to be used in a normalizer.
Closes #43758