Skip to content

Add query param to limit highlighting to specified length #67325

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 30 commits into from
Feb 16, 2021
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
ed824b0
Add option to limit highlighting to max offset
matriv Jan 12, 2021
2625d29
fix bwc
matriv Jan 12, 2021
adef3e7
fix version check for bwc
matriv Jan 12, 2021
e93d428
skip for previous versions
matriv Jan 12, 2021
1b5404a
Add suggestion to use new query param to the exception message
matriv Jan 14, 2021
be553cf
Merge remote-tracking branch 'upstream/master' into limitHighlighting
matriv Jan 14, 2021
7de3245
reword error message and adjust tests
matriv Jan 14, 2021
e06e510
change wrapping method, fix AnnotatedTextHighlighter
matriv Jan 14, 2021
b5482e3
remove uneccessary override of method
matriv Jan 14, 2021
29606b0
Merge remote-tracking branch 'upstream/master' into limitHighlighting
matriv Jan 14, 2021
e3e6a19
remove unused import
matriv Jan 14, 2021
9af770e
Merge remote-tracking branch 'upstream/master' into limitHighlighting
matriv Jan 14, 2021
ce3b7ef
revert remove mode
matriv Jan 14, 2021
7bffec8
Merge remote-tracking branch 'upstream/master' into limitHighlighting
matriv Feb 1, 2021
294b1ec
Merge remote-tracking branch 'upstream/master' into limitHighlighting
matriv Feb 4, 2021
18bdb67
Merge remote-tracking branch 'upstream/master' into limitHighlighting
matriv Feb 5, 2021
af1b108
Merge remote-tracking branch 'upstream/master' into limitHighlighting
matriv Feb 5, 2021
f533369
Merge remote-tracking branch 'upstream/master' into limitHighlighting
matriv Feb 11, 2021
1e51405
fix licence
matriv Feb 11, 2021
c0fbd91
Change semantics of the new setting.
matriv Feb 11, 2021
91ca994
revert import order change
matriv Feb 11, 2021
75e3d42
fix checkstyle
matriv Feb 11, 2021
95246cf
fix checkstyle
matriv Feb 11, 2021
410d914
add more tests
matriv Feb 11, 2021
7a53828
Merge remote-tracking branch 'upstream/master' into limitHighlighting
matriv Feb 11, 2021
79cb5ac
fix checkstyle
matriv Feb 12, 2021
c9a82a9
Merge remote-tracking branch 'upstream/master' into limitHighlighting
matriv Feb 12, 2021
b7ae8cc
address comments
matriv Feb 15, 2021
c547825
Merge remote-tracking branch 'upstream/master' into limitHighlighting
matriv Feb 15, 2021
edb2bd8
rephrase error message
matriv Feb 15, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/reference/index-modules.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -223,6 +223,7 @@ specific index module:
The maximum number of tokens that can be produced using _analyze API.
Defaults to `10000`.

[[index-max-analyzed-offset]]
`index.highlight.max_analyzed_offset`::

The maximum number of characters that will be analyzed for a highlight request.
Expand Down
12 changes: 10 additions & 2 deletions docs/reference/search/search-your-data/highlighting.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,7 @@ needs highlighting. The `plain` highlighter always uses plain highlighting.
Plain highlighting for large texts may require substantial amount of time and memory.
To protect against this, the maximum number of text characters that will be analyzed has been
limited to 1000000. This default limit can be changed
for a particular index with the index setting `index.highlight.max_analyzed_offset`.
for a particular index with the index setting <<index-max-analyzed-offset,`index.highlight.max_analyzed_offset`>>.

[discrete]
[[highlighting-settings]]
Expand Down Expand Up @@ -242,6 +242,14 @@ require_field_match:: By default, only fields that contains a query match are
highlighted. Set `require_field_match` to `false` to highlight all fields.
Defaults to `true`.

limit_to_max_analyzed_offset:: By default, the maximum number of characters
analyzed for a highlight request is bounded by the value defined in the
<<index-max-analyzed-offset, `index.highlight.max_analyzed_offset`>> setting,
and when the number of characters exceeds this limit an error is returned. If
this setting is set to `true`, the analysis for the highlighting stops at this
defined maximum limit, and the rest of the text is not processed, thus not
highlighted and no error is returned.

tags_schema:: Set to `styled` to use the built-in tag schema. The `styled`
schema defines the following `pre_tags` and defines `post_tags` as
`</em>`.
Expand Down Expand Up @@ -1119,4 +1127,4 @@ using the passages's `matchStarts` and `matchEnds` information:
I'll be the <em>only</em> <em>fox</em> in the world for you.

This kind of formatted strings are the final result of the highlighter returned
to the user.
to the user.
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@
import org.apache.lucene.search.uhighlight.CustomUnifiedHighlighter;
import org.apache.lucene.search.uhighlight.Snippet;
import org.apache.lucene.search.uhighlight.SplittingBreakIterator;
import org.apache.lucene.search.uhighlight.UnifiedHighlighter;
import org.apache.lucene.store.Directory;
import org.elasticsearch.common.Strings;
import org.elasticsearch.index.mapper.annotatedtext.AnnotatedTextFieldMapper.AnnotatedHighlighterAnalyzer;
Expand All @@ -59,78 +60,97 @@

public class AnnotatedTextHighlighterTests extends ESTestCase {

private void assertHighlightOneDoc(String fieldName, String[] markedUpInputs,
Query query, Locale locale, BreakIterator breakIterator,
int noMatchSize, String[] expectedPassages) throws Exception {

assertHighlightOneDoc(fieldName, markedUpInputs, query, locale, breakIterator, noMatchSize, expectedPassages,
Integer.MAX_VALUE, false);
}

private void assertHighlightOneDoc(String fieldName, String []markedUpInputs,
Query query, Locale locale, BreakIterator breakIterator,
int noMatchSize, String[] expectedPassages) throws Exception {


// Annotated fields wrap the usual analyzer with one that injects extra tokens
Analyzer wrapperAnalyzer = new AnnotationAnalyzerWrapper(new StandardAnalyzer());
Directory dir = newDirectory();
IndexWriterConfig iwc = newIndexWriterConfig(wrapperAnalyzer);
iwc.setMergePolicy(newTieredMergePolicy(random()));
RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwc);
FieldType ft = new FieldType(TextField.TYPE_STORED);
if (randomBoolean()) {
ft.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
} else {
ft.setIndexOptions(IndexOptions.DOCS_AND_FREQS);
}
ft.freeze();
Document doc = new Document();
for (String input : markedUpInputs) {
Field field = new Field(fieldName, "", ft);
field.setStringValue(input);
doc.add(field);
int noMatchSize, String[] expectedPassages,
int maxAnalyzedOffset, boolean limitToMaxAnalyzedOffset) throws Exception {

Directory dir = null;
DirectoryReader reader = null;
try {
dir = newDirectory();

// Annotated fields wrap the usual analyzer with one that injects extra tokens
Analyzer wrapperAnalyzer = new AnnotationAnalyzerWrapper(new StandardAnalyzer());
;
IndexWriterConfig iwc = newIndexWriterConfig(wrapperAnalyzer);
iwc.setMergePolicy(newTieredMergePolicy(random()));
RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwc);
FieldType ft = new FieldType(TextField.TYPE_STORED);
if (randomBoolean()) {
ft.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
} else {
ft.setIndexOptions(IndexOptions.DOCS_AND_FREQS);
}
ft.freeze();
Document doc = new Document();
for (String input : markedUpInputs) {
Field field = new Field(fieldName, "", ft);
field.setStringValue(input);
doc.add(field);
}
iw.addDocument(doc);
reader = iw.getReader();
IndexSearcher searcher = newSearcher(reader);
iw.close();

AnnotatedText[] annotations = new AnnotatedText[markedUpInputs.length];
for (int i = 0; i < markedUpInputs.length; i++) {
annotations[i] = AnnotatedText.parse(markedUpInputs[i]);
}
AnnotatedHighlighterAnalyzer hiliteAnalyzer = new AnnotatedHighlighterAnalyzer(wrapperAnalyzer);
hiliteAnalyzer.setAnnotations(annotations);
AnnotatedPassageFormatter passageFormatter = new AnnotatedPassageFormatter(new DefaultEncoder());
passageFormatter.setAnnotations(annotations);

ArrayList<Object> plainTextForHighlighter = new ArrayList<>(annotations.length);
for (int i = 0; i < annotations.length; i++) {
plainTextForHighlighter.add(annotations[i].textMinusMarkup);
}

TopDocs topDocs = searcher.search(new MatchAllDocsQuery(), 1, Sort.INDEXORDER);
assertThat(topDocs.totalHits.value, equalTo(1L));
String rawValue = Strings.collectionToDelimitedString(plainTextForHighlighter, String.valueOf(MULTIVAL_SEP_CHAR));
CustomUnifiedHighlighter highlighter = new CustomUnifiedHighlighter(
searcher,
hiliteAnalyzer,
UnifiedHighlighter.OffsetSource.ANALYSIS,
passageFormatter,
locale,
breakIterator,
"index",
"text",
query,
noMatchSize,
expectedPassages.length,
name -> "text".equals(name),
maxAnalyzedOffset,
limitToMaxAnalyzedOffset
);
highlighter.setFieldMatcher((name) -> "text".equals(name));
final Snippet[] snippets = highlighter.highlightField(getOnlyLeafReader(reader), topDocs.scoreDocs[0].doc, () -> rawValue);
assertEquals(expectedPassages.length, snippets.length);
for (int i = 0; i < snippets.length; i++) {
assertEquals(expectedPassages[i], snippets[i].getText());
}
} finally {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use try-with-resources here I think?

Copy link
Contributor Author

@matriv matriv Jan 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought about it but we need to tries since, something like that: https://gist.github.com/matriv/ca6ca5fe191c8e0b1e49af282ec90669

so I preferred the traditional way, but happy to change if you prefer the try-with-resources.

if (reader != null) {
reader.close();
}
if (dir != null) {
dir.close();
}
}
iw.addDocument(doc);
DirectoryReader reader = iw.getReader();
IndexSearcher searcher = newSearcher(reader);
iw.close();

AnnotatedText[] annotations = new AnnotatedText[markedUpInputs.length];
for (int i = 0; i < markedUpInputs.length; i++) {
annotations[i] = AnnotatedText.parse(markedUpInputs[i]);
}
AnnotatedHighlighterAnalyzer hiliteAnalyzer = new AnnotatedHighlighterAnalyzer(wrapperAnalyzer);
hiliteAnalyzer.setAnnotations(annotations);
AnnotatedPassageFormatter passageFormatter = new AnnotatedPassageFormatter(new DefaultEncoder());
passageFormatter.setAnnotations(annotations);

ArrayList<Object> plainTextForHighlighter = new ArrayList<>(annotations.length);
for (int i = 0; i < annotations.length; i++) {
plainTextForHighlighter.add(annotations[i].textMinusMarkup);
}

TopDocs topDocs = searcher.search(new MatchAllDocsQuery(), 1, Sort.INDEXORDER);
assertThat(topDocs.totalHits.value, equalTo(1L));
String rawValue = Strings.collectionToDelimitedString(plainTextForHighlighter, String.valueOf(MULTIVAL_SEP_CHAR));
CustomUnifiedHighlighter highlighter = new CustomUnifiedHighlighter(
searcher,
hiliteAnalyzer,
null,
passageFormatter,
locale,
breakIterator,
"index",
"text",
query,
noMatchSize,
expectedPassages.length,
name -> "text".equals(name),
Integer.MAX_VALUE
);
highlighter.setFieldMatcher((name) -> "text".equals(name));
final Snippet[] snippets = highlighter.highlightField(getOnlyLeafReader(reader), topDocs.scoreDocs[0].doc, () -> rawValue);
assertEquals(expectedPassages.length, snippets.length);
for (int i = 0; i < snippets.length; i++) {
assertEquals(expectedPassages[i], snippets[i].getText());
}
reader.close();
dir.close();
}


public void testAnnotatedTextStructuredMatch() throws Exception {
// Check that a structured token eg a URL can be highlighted in a query
// on marked-up
Expand Down Expand Up @@ -202,4 +222,33 @@ public void testBadAnnotation() throws Exception {
assertHighlightOneDoc("text", markedUpInputs, query, Locale.ROOT, breakIterator, 0, expectedPassages);
}

public void testExceedMaxAnalyzedOffset() throws Exception {
TermQuery query = new TermQuery(new Term("text", "exceeds"));
BreakIterator breakIterator = new CustomSeparatorBreakIterator(MULTIVAL_SEP_CHAR);
assertHighlightOneDoc("text", new String[] { "[Short Text](Short+Text)" }, query, Locale.ROOT, breakIterator, 0, new String[] {});

IllegalArgumentException e = expectThrows(
IllegalArgumentException.class,
() -> assertHighlightOneDoc(
"text",
new String[] { "[Long Text exceeds](Long+Text+exceeds) MAX analyzed offset)" },
query,
Locale.ROOT,
breakIterator,
0,
new String[] {},
15,
false
)
);
assertEquals(
"The length of [text] field of [0] doc of [index] index has exceeded [15] - maximum allowed to be analyzed for "
+ "highlighting. This maximum can be set by changing the [index.highlight.max_analyzed_offset] index level setting. "
+ "For large texts, indexing with offsets or term vectors is recommended!",
e.getMessage()
);

assertHighlightOneDoc("text", new String[] { "[Long Text exceeds](Long+Text+exceeds) MAX analyzed offset" },
query, Locale.ROOT, breakIterator, 0, new String[] {"Long Text [exceeds](_hit_term=exceeds) MAX analyzed offset"}, 15, true);
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ setup:
body:
settings:
number_of_shards: 1
index.highlight.max_analyzed_offset: 10
index.highlight.max_analyzed_offset: 20
mappings:
properties:
field1:
Expand Down Expand Up @@ -37,6 +37,16 @@ setup:
body: {"query" : {"match" : {"field1" : "fox"}}, "highlight" : {"type" : "unified", "fields" : {"field1" : {}}}}
- match: { error.root_cause.0.type: "illegal_argument_exception" }

---
"Unified highlighter on a field WITHOUT OFFSETS exceeding index.highlight.max_analyzed_offset with limit_to_max_analyzed_offset=true should SUCCEED":

- do:
search:
rest_total_hits_as_int: true
index: test1
body: {"query" : {"match" : {"field1" : "fox"}}, "highlight" : {"type" : "unified", "fields" : {"field1" : {}}, "limit_to_max_analyzed_offset": "true"}}
- match: {hits.hits.0.highlight.field1.0: "The quick brown <em>fox</em> went to the forest and saw another fox."}


---
"Plain highlighter on a field WITHOUT OFFSETS exceeding index.highlight.max_analyzed_offset should FAIL":
Expand All @@ -46,9 +56,19 @@ setup:
search:
rest_total_hits_as_int: true
index: test1
body: {"query" : {"match" : {"field1" : "fox"}}, "highlight" : {"type" : "unified", "fields" : {"field1" : {}}}}
body: {"query" : {"match" : {"field1" : "fox"}}, "highlight" : {"type" : "plain", "fields" : {"field1" : {}}}}
- match: { error.root_cause.0.type: "illegal_argument_exception" }

---
"Plain highlighter on a field WITHOUT OFFSETS exceeding index.highlight.max_analyzed_offset with limit_to_max_analyzed_offset=true should SUCCEED":

- do:
search:
rest_total_hits_as_int: true
index: test1
body: {"query" : {"match" : {"field1" : "fox"}}, "highlight" : {"type" : "plain", "fields" : {"field1" : {}}, "limit_to_max_analyzed_offset": "true"}}
- match: {hits.hits.0.highlight.field1.0: "The quick brown <em>fox</em> went to the forest and saw another fox."}


---
"Unified highlighter on a field WITH OFFSETS exceeding index.highlight.max_analyzed_offset should SUCCEED":
Expand All @@ -71,3 +91,13 @@ setup:
index: test1
body: {"query" : {"match" : {"field2" : "fox"}}, "highlight" : {"type" : "plain", "fields" : {"field2" : {}}}}
- match: { error.root_cause.0.type: "illegal_argument_exception" }

---
"Plain highlighter on a field WITH OFFSETS exceeding index.highlight.max_analyzed_offset with limit_to_max_analyzed_offset=true should SUCCEED":

- do:
search:
rest_total_hits_as_int: true
index: test1
body: {"query" : {"match" : {"field2" : "fox"}}, "highlight" : {"type" : "plain", "fields" : {"field2" : {}}, "limit_to_max_analyzed_offset": "true"}}
- match: {hits.hits.0.highlight.field2.0: "The quick brown <em>fox</em> went to the forest and saw another fox."}
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@ public class CustomUnifiedHighlighter extends UnifiedHighlighter {
private final int noMatchSize;
private final FieldHighlighter fieldHighlighter;
private final int maxAnalyzedOffset;
private final boolean limitToMaxAnalyzedOffset;

/**
* Creates a new instance of {@link CustomUnifiedHighlighter}
Expand Down Expand Up @@ -96,7 +97,8 @@ public CustomUnifiedHighlighter(IndexSearcher searcher,
int noMatchSize,
int maxPassages,
Predicate<String> fieldMatcher,
int maxAnalyzedOffset) throws IOException {
int maxAnalyzedOffset,
boolean limitToMaxAnalyzedOffset) throws IOException {
super(searcher, analyzer);
this.offsetSource = offsetSource;
this.breakIterator = breakIterator;
Expand All @@ -107,6 +109,7 @@ public CustomUnifiedHighlighter(IndexSearcher searcher,
this.noMatchSize = noMatchSize;
this.setFieldMatcher(fieldMatcher);
this.maxAnalyzedOffset = maxAnalyzedOffset;
this.limitToMaxAnalyzedOffset = limitToMaxAnalyzedOffset;
fieldHighlighter = getFieldHighlighter(field, query, extractTerms(query), maxPassages);
}

Expand All @@ -123,7 +126,7 @@ public Snippet[] highlightField(LeafReader reader, int docId, CheckedSupplier<St
return null;
}
int fieldValueLength = fieldValue.length();
if ((offsetSource == OffsetSource.ANALYSIS) && (fieldValueLength > maxAnalyzedOffset)) {
if (((limitToMaxAnalyzedOffset == false) && (offsetSource == OffsetSource.ANALYSIS) && (fieldValueLength > maxAnalyzedOffset))) {
throw new IllegalArgumentException(
"The length of ["
+ field
Expand Down
Loading