Skip to content

Commit 63e45c8

Browse files
sohaibiftikharmayya-sharipova
authored andcommitted
Added lenient flag for synonym token filter (#31484) (#31970)
* Added lenient flag for synonym-tokenfilter. Relates to #30968 * added docs for synonym-graph-tokenfilter -- Also made lenient final -- changed from !lenient to lenient == false * Changes after review (1) -- Renamed to ElasticsearchSynonymParser -- Added explanation for ElasticsearchSynonymParser::add method -- Changed ElasticsearchSynonymParser::logger instance to static * Added lenient option for WordnetSynonymParser -- also added more documentation * Added additional documentation * Improved documentation (cherry picked from commit 88c270d)
1 parent a09fc17 commit 63e45c8

File tree

8 files changed

+400
-15
lines changed

8 files changed

+400
-15
lines changed

docs/reference/analysis/tokenfilters/synonym-graph-tokenfilter.asciidoc

Lines changed: 43 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,49 @@ PUT /test_index
5050
The above configures a `search_synonyms` filter, with a path of
5151
`analysis/synonym.txt` (relative to the `config` location). The
5252
`search_synonyms` analyzer is then configured with the filter.
53-
Additional settings are: `expand` (defaults to `true`).
53+
54+
Additional settings are:
55+
56+
* `expand` (defaults to `true`).
57+
* `lenient` (defaults to `false`). If `true` ignores exceptions while parsing the synonym configuration. It is important
58+
to note that only those synonym rules which cannot get parsed are ignored. For instance consider the following request:
59+
60+
[source,js]
61+
--------------------------------------------------
62+
PUT /test_index
63+
{
64+
"settings": {
65+
"index" : {
66+
"analysis" : {
67+
"analyzer" : {
68+
"synonym" : {
69+
"tokenizer" : "standard",
70+
"filter" : ["my_stop", "synonym_graph"]
71+
}
72+
},
73+
"filter" : {
74+
"my_stop": {
75+
"type" : "stop",
76+
"stopwords": ["bar"]
77+
},
78+
"synonym_graph" : {
79+
"type" : "synonym_graph",
80+
"lenient": true,
81+
"synonyms" : ["foo, bar => baz"]
82+
}
83+
}
84+
}
85+
}
86+
}
87+
}
88+
--------------------------------------------------
89+
// CONSOLE
90+
With the above request the word `bar` gets skipped but a mapping `foo => baz` is still added. However, if the mapping
91+
being added was "foo, baz => bar" nothing would get added to the synonym list. This is because the target word for the
92+
mapping is itself eliminated because it was a stop word. Similarly, if the mapping was "bar, foo, baz" and `expand` was
93+
set to `false` no mapping would get added as when `expand=false` the target mapping is the first word. However, if
94+
`expand=true` then the mappings added would be equivalent to `foo, baz => foo, baz` i.e, all mappings other than the
95+
stop word.
5496

5597
[float]
5698
==== `tokenizer` and `ignore_case` are deprecated

docs/reference/analysis/tokenfilters/synonym-tokenfilter.asciidoc

Lines changed: 45 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -33,12 +33,55 @@ PUT /test_index
3333

3434
The above configures a `synonym` filter, with a path of
3535
`analysis/synonym.txt` (relative to the `config` location). The
36-
`synonym` analyzer is then configured with the filter. Additional
37-
settings is: `expand` (defaults to `true`).
36+
`synonym` analyzer is then configured with the filter.
3837

3938
This filter tokenize synonyms with whatever tokenizer and token filters
4039
appear before it in the chain.
4140

41+
Additional settings are:
42+
43+
* `expand` (defaults to `true`).
44+
* `lenient` (defaults to `false`). If `true` ignores exceptions while parsing the synonym configuration. It is important
45+
to note that only those synonym rules which cannot get parsed are ignored. For instance consider the following request:
46+
47+
[source,js]
48+
--------------------------------------------------
49+
PUT /test_index
50+
{
51+
"settings": {
52+
"index" : {
53+
"analysis" : {
54+
"analyzer" : {
55+
"synonym" : {
56+
"tokenizer" : "standard",
57+
"filter" : ["my_stop", "synonym"]
58+
}
59+
},
60+
"filter" : {
61+
"my_stop": {
62+
"type" : "stop",
63+
"stopwords": ["bar"]
64+
},
65+
"synonym" : {
66+
"type" : "synonym",
67+
"lenient": true,
68+
"synonyms" : ["foo, bar => baz"]
69+
}
70+
}
71+
}
72+
}
73+
}
74+
}
75+
--------------------------------------------------
76+
// CONSOLE
77+
With the above request the word `bar` gets skipped but a mapping `foo => baz` is still added. However, if the mapping
78+
being added was "foo, baz => bar" nothing would get added to the synonym list. This is because the target word for the
79+
mapping is itself eliminated because it was a stop word. Similarly, if the mapping was "bar, foo, baz" and `expand` was
80+
set to `false` no mapping would get added as when `expand=false` the target mapping is the first word. However, if
81+
`expand=true` then the mappings added would be equivalent to `foo, baz => foo, baz` i.e, all mappings other than the
82+
stop word.
83+
84+
4285
[float]
4386
==== `tokenizer` and `ignore_case` are deprecated
4487

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
/*
2+
* Licensed to Elasticsearch under one or more contributor
3+
* license agreements. See the NOTICE file distributed with
4+
* this work for additional information regarding copyright
5+
* ownership. Elasticsearch licenses this file to you under
6+
* the Apache License, Version 2.0 (the "License"); you may
7+
* not use this file except in compliance with the License.
8+
* You may obtain a copy of the License at
9+
*
10+
* http://www.apache.org/licenses/LICENSE-2.0
11+
*
12+
* Unless required by applicable law or agreed to in writing,
13+
* software distributed under the License is distributed on an
14+
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
* KIND, either express or implied. See the License for the
16+
* specific language governing permissions and limitations
17+
* under the License.
18+
*/
19+
20+
package org.elasticsearch.index.analysis;
21+
22+
import org.apache.logging.log4j.Logger;
23+
import org.apache.lucene.analysis.Analyzer;
24+
import org.apache.lucene.analysis.synonym.SolrSynonymParser;
25+
import org.apache.lucene.util.CharsRef;
26+
import org.apache.lucene.util.CharsRefBuilder;
27+
import org.elasticsearch.common.logging.Loggers;
28+
29+
import java.io.IOException;
30+
31+
public class ESSolrSynonymParser extends SolrSynonymParser {
32+
33+
private final boolean lenient;
34+
private static final Logger logger =
35+
Loggers.getLogger(ESSolrSynonymParser.class, "ESSolrSynonymParser");
36+
37+
public ESSolrSynonymParser(boolean dedup, boolean expand, boolean lenient, Analyzer analyzer) {
38+
super(dedup, expand, analyzer);
39+
this.lenient = lenient;
40+
}
41+
42+
@Override
43+
public void add(CharsRef input, CharsRef output, boolean includeOrig) {
44+
// This condition follows up on the overridden analyze method. In case lenient was set to true and there was an
45+
// exception during super.analyze we return a zero-length CharsRef for that word which caused an exception. When
46+
// the synonym mappings for the words are added using the add method we skip the ones that were left empty by
47+
// analyze i.e., in the case when lenient is set we only add those combinations which are non-zero-length. The
48+
// else would happen only in the case when the input or output is empty and lenient is set, in which case we
49+
// quietly ignore it. For more details on the control-flow see SolrSynonymParser::addInternal.
50+
if (lenient == false || (input.length > 0 && output.length > 0)) {
51+
super.add(input, output, includeOrig);
52+
}
53+
}
54+
55+
@Override
56+
public CharsRef analyze(String text, CharsRefBuilder reuse) throws IOException {
57+
try {
58+
return super.analyze(text, reuse);
59+
} catch (IllegalArgumentException ex) {
60+
if (lenient) {
61+
logger.info("Synonym rule for [" + text + "] was ignored");
62+
return new CharsRef("");
63+
} else {
64+
throw ex;
65+
}
66+
}
67+
}
68+
}
Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
/*
2+
* Licensed to Elasticsearch under one or more contributor
3+
* license agreements. See the NOTICE file distributed with
4+
* this work for additional information regarding copyright
5+
* ownership. Elasticsearch licenses this file to you under
6+
* the Apache License, Version 2.0 (the "License"); you may
7+
* not use this file except in compliance with the License.
8+
* You may obtain a copy of the License at
9+
*
10+
* http://www.apache.org/licenses/LICENSE-2.0
11+
*
12+
* Unless required by applicable law or agreed to in writing,
13+
* software distributed under the License is distributed on an
14+
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
* KIND, either express or implied. See the License for the
16+
* specific language governing permissions and limitations
17+
* under the License.
18+
*/
19+
20+
package org.elasticsearch.index.analysis;
21+
22+
import org.apache.logging.log4j.Logger;
23+
import org.apache.lucene.analysis.Analyzer;
24+
import org.apache.lucene.analysis.synonym.WordnetSynonymParser;
25+
import org.apache.lucene.util.CharsRef;
26+
import org.apache.lucene.util.CharsRefBuilder;
27+
import org.elasticsearch.common.logging.Loggers;
28+
29+
import java.io.IOException;
30+
31+
public class ESWordnetSynonymParser extends WordnetSynonymParser {
32+
33+
private final boolean lenient;
34+
private static final Logger logger =
35+
Loggers.getLogger(ESSolrSynonymParser.class, "ESWordnetSynonymParser");
36+
37+
public ESWordnetSynonymParser(boolean dedup, boolean expand, boolean lenient, Analyzer analyzer) {
38+
super(dedup, expand, analyzer);
39+
this.lenient = lenient;
40+
}
41+
42+
@Override
43+
public void add(CharsRef input, CharsRef output, boolean includeOrig) {
44+
// This condition follows up on the overridden analyze method. In case lenient was set to true and there was an
45+
// exception during super.analyze we return a zero-length CharsRef for that word which caused an exception. When
46+
// the synonym mappings for the words are added using the add method we skip the ones that were left empty by
47+
// analyze i.e., in the case when lenient is set we only add those combinations which are non-zero-length. The
48+
// else would happen only in the case when the input or output is empty and lenient is set, in which case we
49+
// quietly ignore it. For more details on the control-flow see SolrSynonymParser::addInternal.
50+
if (lenient == false || (input.length > 0 && output.length > 0)) {
51+
super.add(input, output, includeOrig);
52+
}
53+
}
54+
55+
@Override
56+
public CharsRef analyze(String text, CharsRefBuilder reuse) throws IOException {
57+
try {
58+
return super.analyze(text, reuse);
59+
} catch (IllegalArgumentException ex) {
60+
if (lenient) {
61+
logger.info("Synonym rule for [" + text + "] was ignored");
62+
return new CharsRef("");
63+
} else {
64+
throw ex;
65+
}
66+
}
67+
}
68+
}

server/src/main/java/org/elasticsearch/index/analysis/SynonymGraphTokenFilterFactory.java

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -21,10 +21,8 @@
2121

2222
import org.apache.lucene.analysis.Analyzer;
2323
import org.apache.lucene.analysis.TokenStream;
24-
import org.apache.lucene.analysis.synonym.SolrSynonymParser;
2524
import org.apache.lucene.analysis.synonym.SynonymGraphFilter;
2625
import org.apache.lucene.analysis.synonym.SynonymMap;
27-
import org.apache.lucene.analysis.synonym.WordnetSynonymParser;
2826
import org.elasticsearch.common.settings.Settings;
2927
import org.elasticsearch.env.Environment;
3028
import org.elasticsearch.index.IndexSettings;
@@ -58,11 +56,11 @@ public Factory(String name, final Analyzer analyzerForParseSynonym, Reader rules
5856
try {
5957
SynonymMap.Builder parser;
6058
if ("wordnet".equalsIgnoreCase(format)) {
61-
parser = new WordnetSynonymParser(true, expand, analyzerForParseSynonym);
62-
((WordnetSynonymParser) parser).parse(rulesReader);
59+
parser = new ESWordnetSynonymParser(true, expand, lenient, analyzerForParseSynonym);
60+
((ESWordnetSynonymParser) parser).parse(rulesReader);
6361
} else {
64-
parser = new SolrSynonymParser(true, expand, analyzerForParseSynonym);
65-
((SolrSynonymParser) parser).parse(rulesReader);
62+
parser = new ESSolrSynonymParser(true, expand, lenient, analyzerForParseSynonym);
63+
((ESSolrSynonymParser) parser).parse(rulesReader);
6664
}
6765
synonymMap = parser.build();
6866
} catch (Exception e) {

server/src/main/java/org/elasticsearch/index/analysis/SynonymTokenFilterFactory.java

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -23,10 +23,8 @@
2323
import org.apache.lucene.analysis.LowerCaseFilter;
2424
import org.apache.lucene.analysis.TokenStream;
2525
import org.apache.lucene.analysis.Tokenizer;
26-
import org.apache.lucene.analysis.synonym.SolrSynonymParser;
2726
import org.apache.lucene.analysis.synonym.SynonymFilter;
2827
import org.apache.lucene.analysis.synonym.SynonymMap;
29-
import org.apache.lucene.analysis.synonym.WordnetSynonymParser;
3028
import org.elasticsearch.Version;
3129
import org.elasticsearch.common.settings.Settings;
3230
import org.elasticsearch.env.Environment;
@@ -47,6 +45,7 @@ public class SynonymTokenFilterFactory extends AbstractTokenFilterFactory {
4745
protected final boolean ignoreCase;
4846
protected final String format;
4947
protected final boolean expand;
48+
protected final boolean lenient;
5049
protected final Settings settings;
5150

5251
public SynonymTokenFilterFactory(IndexSettings indexSettings, Environment env, AnalysisRegistry analysisRegistry,
@@ -81,6 +80,7 @@ public SynonymTokenFilterFactory(IndexSettings indexSettings, Environment env, A
8180
this.tokenizerFactory = null;
8281
}
8382

83+
this.lenient = settings.getAsBoolean("lenient", false);
8484
this.format = settings.get("format", "");
8585
}
8686

@@ -143,11 +143,11 @@ protected TokenStreamComponents createComponents(String fieldName) {
143143
try {
144144
SynonymMap.Builder parser;
145145
if ("wordnet".equalsIgnoreCase(format)) {
146-
parser = new WordnetSynonymParser(true, expand, analyzer);
147-
((WordnetSynonymParser) parser).parse(rulesReader);
146+
parser = new ESWordnetSynonymParser(true, expand, lenient, analyzerForParseSynonym);
147+
((ESWordnetSynonymParser) parser).parse(rulesReader);
148148
} else {
149-
parser = new SolrSynonymParser(true, expand, analyzer);
150-
((SolrSynonymParser) parser).parse(rulesReader);
149+
parser = new ESSolrSynonymParser(true, expand, lenient, analyzerForParseSynonym);
150+
((ESSolrSynonymParser) parser).parse(rulesReader);
151151
}
152152
synonymMap = parser.build();
153153
} catch (Exception e) {
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
/*
2+
* Licensed to Elasticsearch under one or more contributor
3+
* license agreements. See the NOTICE file distributed with
4+
* this work for additional information regarding copyright
5+
* ownership. Elasticsearch licenses this file to you under
6+
* the Apache License, Version 2.0 (the "License"); you may
7+
* not use this file except in compliance with the License.
8+
* You may obtain a copy of the License at
9+
*
10+
* http://www.apache.org/licenses/LICENSE-2.0
11+
*
12+
* Unless required by applicable law or agreed to in writing,
13+
* software distributed under the License is distributed on an
14+
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
* KIND, either express or implied. See the License for the
16+
* specific language governing permissions and limitations
17+
* under the License.
18+
*/
19+
20+
package org.elasticsearch.index.analysis;
21+
22+
import org.apache.lucene.analysis.CharArraySet;
23+
import org.apache.lucene.analysis.StopFilter;
24+
import org.apache.lucene.analysis.TokenStream;
25+
import org.apache.lucene.analysis.Tokenizer;
26+
import org.apache.lucene.analysis.standard.StandardAnalyzer;
27+
import org.apache.lucene.analysis.standard.StandardTokenizer;
28+
import org.apache.lucene.analysis.synonym.SynonymFilter;
29+
import org.apache.lucene.analysis.synonym.SynonymMap;
30+
import org.elasticsearch.test.ESTokenStreamTestCase;
31+
32+
import java.io.IOException;
33+
import java.io.StringReader;
34+
import java.text.ParseException;
35+
36+
import static org.hamcrest.Matchers.containsString;
37+
38+
public class ESSolrSynonymParserTests extends ESTokenStreamTestCase {
39+
40+
public void testLenientParser() throws IOException, ParseException {
41+
ESSolrSynonymParser parser = new ESSolrSynonymParser(true, false, true, new StandardAnalyzer());
42+
String rules =
43+
"&,and\n" +
44+
"come,advance,approach\n";
45+
StringReader rulesReader = new StringReader(rules);
46+
parser.parse(rulesReader);
47+
SynonymMap synonymMap = parser.build();
48+
Tokenizer tokenizer = new StandardTokenizer();
49+
tokenizer.setReader(new StringReader("approach quietly then advance & destroy"));
50+
TokenStream ts = new SynonymFilter(tokenizer, synonymMap, false);
51+
assertTokenStreamContents(ts, new String[]{"come", "quietly", "then", "come", "destroy"});
52+
}
53+
54+
public void testLenientParserWithSomeIncorrectLines() throws IOException, ParseException {
55+
CharArraySet stopSet = new CharArraySet(1, true);
56+
stopSet.add("bar");
57+
ESSolrSynonymParser parser =
58+
new ESSolrSynonymParser(true, false, true, new StandardAnalyzer(stopSet));
59+
String rules = "foo,bar,baz";
60+
StringReader rulesReader = new StringReader(rules);
61+
parser.parse(rulesReader);
62+
SynonymMap synonymMap = parser.build();
63+
Tokenizer tokenizer = new StandardTokenizer();
64+
tokenizer.setReader(new StringReader("first word is foo, then bar and lastly baz"));
65+
TokenStream ts = new SynonymFilter(new StopFilter(tokenizer, stopSet), synonymMap, false);
66+
assertTokenStreamContents(ts, new String[]{"first", "word", "is", "foo", "then", "and", "lastly", "foo"});
67+
}
68+
69+
public void testNonLenientParser() {
70+
ESSolrSynonymParser parser = new ESSolrSynonymParser(true, false, false, new StandardAnalyzer());
71+
String rules =
72+
"&,and=>and\n" +
73+
"come,advance,approach\n";
74+
StringReader rulesReader = new StringReader(rules);
75+
ParseException ex = expectThrows(ParseException.class, () -> parser.parse(rulesReader));
76+
assertThat(ex.getMessage(), containsString("Invalid synonym rule at line 1"));
77+
}
78+
}

0 commit comments

Comments
 (0)