Skip to content

Allow all token/char filters in normalizers #43803

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 1 addition & 10 deletions docs/reference/analysis/normalizers.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -2,16 +2,7 @@
== Normalizers

Normalizers are similar to analyzers except that they may only emit a single
token. As a consequence, they do not have a tokenizer and only accept a subset
of the available char filters and token filters. Only the filters that work on
a per-character basis are allowed. For instance a lowercasing filter would be
allowed, but not a stemming filter, which needs to look at the keyword as a
whole. The current list of filters that can be used in a normalizer is
following: `arabic_normalization`, `asciifolding`, `bengali_normalization`,
`cjk_width`, `decimal_digit`, `elision`, `german_normalization`,
`hindi_normalization`, `indic_normalization`, `lowercase`,
`persian_normalization`, `scandinavian_folding`, `serbian_normalization`,
`sorani_normalization`, `uppercase`.
token. As a consequence, they do not have a tokenizer.

[float]
=== Custom normalizers
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
package org.elasticsearch.index.analysis;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.core.KeywordTokenizer;
import org.apache.lucene.analysis.core.WhitespaceTokenizer;
import org.elasticsearch.ElasticsearchException;
import org.elasticsearch.Version;
Expand All @@ -44,6 +45,7 @@
import java.util.concurrent.ConcurrentHashMap;
import java.util.function.BiFunction;
import java.util.function.Function;
import java.util.function.Supplier;
import java.util.stream.Collectors;

import static java.util.Collections.unmodifiableMap;
Expand Down Expand Up @@ -409,8 +411,11 @@ private <T> Map<String, T> buildMapping(Component component, IndexSettings setti
continue;
}
} else if (component == Component.NORMALIZER) {
if (currentSettings.hasValue("tokenizer")) {
throw new IllegalArgumentException("Custom normalizer [" + name + "] cannot configure a tokenizer");
}
if (typeName == null || typeName.equals("custom")) {
T factory = (T) new CustomNormalizerProvider(settings, name, currentSettings);
T factory = (T) new CustomAnalyzerProvider(settings, name, currentSettings);
factories.put(name, factory);
continue;
}
Expand Down Expand Up @@ -531,10 +536,10 @@ public IndexAnalyzers build(IndexSettings indexSettings,
});
}
for (Map.Entry<String, AnalyzerProvider<?>> entry : normalizerProviders.entrySet()) {
processNormalizerFactory(entry.getKey(), entry.getValue(), normalizers, "keyword",
tokenizerFactoryFactories.get("keyword"), tokenFilterFactoryFactories, charFilterFactoryFactories);
processNormalizerFactory(entry.getKey(), entry.getValue(), normalizers,
() -> KeywordTokenizer::new, tokenFilterFactoryFactories, charFilterFactoryFactories);
processNormalizerFactory(entry.getKey(), entry.getValue(), whitespaceNormalizers,
"whitespace", () -> new WhitespaceTokenizer(), tokenFilterFactoryFactories, charFilterFactoryFactories);
() -> WhitespaceTokenizer::new, tokenFilterFactoryFactories, charFilterFactoryFactories);
}

if (!analyzers.containsKey(DEFAULT_ANALYZER_NAME)) {
Expand Down Expand Up @@ -575,7 +580,7 @@ private static NamedAnalyzer produceAnalyzer(String name,
*/
int overridePositionIncrementGap = TextFieldMapper.Defaults.POSITION_INCREMENT_GAP;
if (analyzerFactory instanceof CustomAnalyzerProvider) {
((CustomAnalyzerProvider) analyzerFactory).build(tokenizers, charFilters, tokenFilters);
((CustomAnalyzerProvider) analyzerFactory).build(tokenizers::get, charFilters, tokenFilters);
/*
* Custom analyzers already default to the correct, version
* dependent positionIncrementGap and the user is be able to
Expand Down Expand Up @@ -603,20 +608,16 @@ private static NamedAnalyzer produceAnalyzer(String name,
return analyzer;
}

private void processNormalizerFactory(
private static void processNormalizerFactory(
String name,
AnalyzerProvider<?> normalizerFactory,
Map<String, NamedAnalyzer> normalizers,
String tokenizerName,
TokenizerFactory tokenizerFactory,
Supplier<TokenizerFactory> tokenizerSupplier,
Map<String, TokenFilterFactory> tokenFilters,
Map<String, CharFilterFactory> charFilters) {
if (tokenizerFactory == null) {
throw new IllegalStateException("keyword tokenizer factory is null, normalizers require analysis-common module");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep this check and move it inside the function further down that calls tokenizerSupplier.get()? I'm not really sure its useful, but it looks like it might provide useful hints about a misconfigured cluster (thats the only way I can see analysis-common missing). In case you re-add this, this should get a test too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're missing analysis-common then I don't think anything is going to work? The error message here is out of date anyway, as we build normalizers with both keyword and whitespace tokenizers, but I'll change the caller to pass KeywordTokenizer::new rather than calling into the tokenizer map to ensure that this would never be a problem.

}

if (normalizerFactory instanceof CustomNormalizerProvider) {
((CustomNormalizerProvider) normalizerFactory).build(tokenizerName, tokenizerFactory, charFilters, tokenFilters);
if (normalizerFactory instanceof CustomAnalyzerProvider) {
((CustomAnalyzerProvider) normalizerFactory).build(n -> tokenizerSupplier.get(), charFilters, tokenFilters);
}
if (normalizers.containsKey(name)) {
throw new IllegalStateException("already registered analyzer with name: " + name);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.function.Function;

/**
* A class that groups analysis components necessary to produce a custom analyzer.
Expand All @@ -49,14 +50,11 @@ public final class AnalyzerComponents {
this.analysisMode = mode;
}

static AnalyzerComponents createComponents(String name, Settings analyzerSettings, final Map<String, TokenizerFactory> tokenizers,
static AnalyzerComponents createComponents(String name, Settings analyzerSettings, final Function<String, TokenizerFactory> tokenizers,
final Map<String, CharFilterFactory> charFilters, final Map<String, TokenFilterFactory> tokenFilters) {
String tokenizerName = analyzerSettings.get("tokenizer");
if (tokenizerName == null) {
throw new IllegalArgumentException("Custom Analyzer [" + name + "] must be configured with a tokenizer");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we make sure the name is set elsewhere or is this not necessary anymore? Otherwise it might be good to keep, I haven'r checked what the follow-up exception would be if the name is "null" and the function is called. Might be more cryptic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check has been moved to AnalysisRegistry#buildMapping, line 414

}

TokenizerFactory tokenizer = tokenizers.get(tokenizerName);
TokenizerFactory tokenizer = tokenizers.apply(tokenizerName);
if (tokenizer == null) {
throw new IllegalArgumentException(
"Custom Analyzer [" + name + "] failed to find tokenizer under name " + "[" + tokenizerName + "]");
Expand Down Expand Up @@ -108,4 +106,4 @@ public CharFilterFactory[] getCharFilters() {
public AnalysisMode analysisMode() {
return this.analysisMode;
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
import org.elasticsearch.index.mapper.TextFieldMapper;

import java.util.Map;
import java.util.function.Function;

import static org.elasticsearch.index.analysis.AnalyzerComponents.createComponents;

Expand All @@ -44,7 +45,7 @@ public CustomAnalyzerProvider(IndexSettings indexSettings,
this.analyzerSettings = settings;
}

void build(final Map<String, TokenizerFactory> tokenizers,
void build(final Function<String, TokenizerFactory> tokenizers,
final Map<String, CharFilterFactory> charFilters,
final Map<String, TokenFilterFactory> tokenFilters) {
customAnalyzer = create(name(), analyzerSettings, tokenizers, charFilters, tokenFilters);
Expand All @@ -54,7 +55,7 @@ void build(final Map<String, TokenizerFactory> tokenizers,
* Factory method that either returns a plain {@link ReloadableCustomAnalyzer} if the components used for creation are supporting index
* and search time use, or a {@link ReloadableCustomAnalyzer} if the components are intended for search time use only.
*/
private static Analyzer create(String name, Settings analyzerSettings, Map<String, TokenizerFactory> tokenizers,
private static Analyzer create(String name, Settings analyzerSettings, Function<String, TokenizerFactory> tokenizers,
Map<String, CharFilterFactory> charFilters,
Map<String, TokenFilterFactory> tokenFilters) {
int positionIncrementGap = TextFieldMapper.Defaults.POSITION_INCREMENT_GAP;
Expand Down

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,7 @@ public synchronized void reload(String name,
final Map<String, TokenizerFactory> tokenizers,
final Map<String, CharFilterFactory> charFilters,
final Map<String, TokenFilterFactory> tokenFilters) {
AnalyzerComponents components = AnalyzerComponents.createComponents(name, settings, tokenizers, charFilters, tokenFilters);
AnalyzerComponents components = AnalyzerComponents.createComponents(name, settings, tokenizers::get, charFilters, tokenFilters);
this.components = components;
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,6 @@

package org.elasticsearch.index.analysis;

import org.apache.lucene.analysis.MockLowerCaseFilter;
import org.apache.lucene.analysis.MockTokenizer;
import org.apache.lucene.util.BytesRef;
import org.elasticsearch.common.settings.Settings;
Expand All @@ -31,11 +30,8 @@

import java.io.IOException;
import java.io.Reader;
import java.util.List;
import java.util.Map;
import java.util.function.Function;

import static java.util.Collections.singletonList;
import static java.util.Collections.singletonMap;

public class CustomNormalizerTests extends ESTokenStreamTestCase {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this test when CustomNormalizerProvider is gone? I saw there is also no CustomAnalyzerTests, maybe this could be renamed and a few tests added that test the CustomAnalyzer parts that are not covered here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still a distinction in settings between analyzers and normalizers, so I think this test is still useful?

Expand Down Expand Up @@ -103,36 +99,7 @@ public void testCharFilters() throws IOException {
assertEquals(new BytesRef("zbc"), normalizer.normalize("foo", "abc"));
}

public void testIllegalFilters() throws IOException {
Settings settings = Settings.builder()
.putList("index.analysis.normalizer.my_normalizer.filter", "mock_forbidden")
.put(Environment.PATH_HOME_SETTING.getKey(), createTempDir().toString())
.build();
IllegalArgumentException e = expectThrows(IllegalArgumentException.class,
() -> AnalysisTestsHelper.createTestAnalysisFromSettings(settings, MOCK_ANALYSIS_PLUGIN));
assertEquals("Custom normalizer [my_normalizer] may not use filter [mock_forbidden]", e.getMessage());
}

public void testIllegalCharFilters() throws IOException {
Settings settings = Settings.builder()
.putList("index.analysis.normalizer.my_normalizer.char_filter", "mock_forbidden")
.put(Environment.PATH_HOME_SETTING.getKey(), createTempDir().toString())
.build();
IllegalArgumentException e = expectThrows(IllegalArgumentException.class,
() -> AnalysisTestsHelper.createTestAnalysisFromSettings(settings, MOCK_ANALYSIS_PLUGIN));
assertEquals("Custom normalizer [my_normalizer] may not use char filter [mock_forbidden]", e.getMessage());
}

private static class MockAnalysisPlugin implements AnalysisPlugin {
@Override
public List<PreConfiguredTokenFilter> getPreConfiguredTokenFilters() {
return singletonList(PreConfiguredTokenFilter.singleton("mock_forbidden", false, MockLowerCaseFilter::new));
}

@Override
public List<PreConfiguredCharFilter> getPreConfiguredCharFilters() {
return singletonList(PreConfiguredCharFilter.singleton("mock_forbidden", false, Function.identity()));
}

@Override
public Map<String, AnalysisProvider<CharFilterFactory>> getCharFilters() {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -88,8 +88,8 @@ public void testBasicCtor() {
.putList("filter", "my_filter")
.build();

AnalyzerComponents components = createComponents("my_analyzer", analyzerSettings, testAnalysis.tokenizer, testAnalysis.charFilter,
Collections.singletonMap("my_filter", NO_OP_SEARCH_TIME_FILTER));
AnalyzerComponents components = createComponents("my_analyzer", analyzerSettings, testAnalysis.tokenizer::get,
testAnalysis.charFilter, Collections.singletonMap("my_filter", NO_OP_SEARCH_TIME_FILTER));

try (ReloadableCustomAnalyzer analyzer = new ReloadableCustomAnalyzer(components, positionIncrementGap, offsetGap)) {
assertEquals(positionIncrementGap, analyzer.getPositionIncrementGap(randomAlphaOfLength(5)));
Expand All @@ -106,8 +106,8 @@ public void testBasicCtor() {
.put("tokenizer", "standard")
.putList("filter", "lowercase")
.build();
AnalyzerComponents indexAnalyzerComponents = createComponents("my_analyzer", indexAnalyzerSettings, testAnalysis.tokenizer,
testAnalysis.charFilter, testAnalysis.tokenFilter);
AnalyzerComponents indexAnalyzerComponents = createComponents("my_analyzer", indexAnalyzerSettings,
testAnalysis.tokenizer::get, testAnalysis.charFilter, testAnalysis.tokenFilter);
IllegalArgumentException ex = expectThrows(IllegalArgumentException.class,
() -> new ReloadableCustomAnalyzer(indexAnalyzerComponents, positionIncrementGap, offsetGap));
assertEquals("ReloadableCustomAnalyzer must only be initialized with analysis components in AnalysisMode.SEARCH_TIME mode",
Expand All @@ -123,8 +123,8 @@ public void testReloading() throws IOException, InterruptedException {
.putList("filter", "my_filter")
.build();

AnalyzerComponents components = createComponents("my_analyzer", analyzerSettings, testAnalysis.tokenizer, testAnalysis.charFilter,
Collections.singletonMap("my_filter", NO_OP_SEARCH_TIME_FILTER));
AnalyzerComponents components = createComponents("my_analyzer", analyzerSettings, testAnalysis.tokenizer::get,
testAnalysis.charFilter, Collections.singletonMap("my_filter", NO_OP_SEARCH_TIME_FILTER));
int numThreads = randomIntBetween(5, 10);

ExecutorService executorService = Executors.newFixedThreadPool(numThreads);
Expand Down