Skip to content

Add query param to limit highlighting to specified length #67325

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 30 commits into from
Feb 16, 2021

Conversation

matriv
Copy link
Contributor

@matriv matriv commented Jan 12, 2021

Add a max_analyzed_offset query parameter to allow users
to limit the highlighting of text fields to a value less than or equal to the
index.highlight.max_analyzed_offset, thus avoiding an exception when
the length of the text field exceeds the limit. The highlighting still takes place,
but stops at the length defined by the new parameter.

Closes: #52155

Add a query parameter `limit_to_max_analyzed_offset` to allow users
to limit the highlighting of text fields to the value of the
`index.highlight.max_analyzed_offset`, thus preventing from throwing
an exception when the length of the text field exceeds the limit.
The highlighting still takes place but up to the length set by the
setting.

Relates to: elastic#52155
@matriv matriv force-pushed the limitHighlighting branch from 37978d1 to ed824b0 Compare January 12, 2021 11:24
@jimczi jimczi removed their request for review January 12, 2021 11:45
Copy link
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Analyzer implementation LGTM, I'll defer to Jim and/or Mark on the API. Does it make sense to have a separate query parameter, rather than doing everything with the index setting? That is, would it be easier to use if we just changed things so that you always get the limiting filter added, and you don't get exceptions any more?

for (int i = 0; i < snippets.length; i++) {
assertEquals(expectedPassages[i], snippets[i].getText());
}
} finally {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use try-with-resources here I think?

Copy link
Contributor Author

@matriv matriv Jan 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought about it but we need to tries since, something like that: https://gist.github.com/matriv/ca6ca5fe191c8e0b1e49af282ec90669

so I preferred the traditional way, but happy to change if you prefer the try-with-resources.

@@ -206,6 +211,7 @@ public final void writeTo(StreamOutput out) throws IOException {
out.writeMap(options);
}
out.writeOptionalBoolean(requireFieldMatch);
out.writeOptionalBoolean(limitToMaxAnalyzedOffset);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you'll need to do a version check here so that mixed-cluster highlighting works

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done it already, thx.

@matriv matriv marked this pull request as ready for review January 12, 2021 12:07
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Jan 12, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@jimczi
Copy link
Contributor

jimczi commented Jan 13, 2021

, I'll defer to Jim and/or Mark on the API. Does it make sense to have a separate query parameter, rather than doing everything with the index setting? That is, would it be easier to use if we just changed things so that you always get the limiting filter added, and you don't get exceptions any more?

+1 for a separate query parameter. I'd also like to remove that setting but not sure if we need to do it in the same PR. Maybe we can consider this new option independently of the fact that the setting to limit highlighting is not useful ?

@matriv
Copy link
Contributor Author

matriv commented Jan 13, 2021

, I'll defer to Jim and/or Mark on the API. Does it make sense to have a separate query parameter, rather than doing everything with the index setting? That is, would it be easier to use if we just changed things so that you always get the limiting filter added, and you don't get exceptions any more?

+1 for a separate query parameter. I'd also like to remove that setting but not sure if we need to do it in the same PR. Maybe we can consider this new option independently of the fact that the setting to limit highlighting is not useful ?

Sorry, I'm a bit confused @jimczi . We should add this query param as is already in this PR, correct?
I don't get the rest, could you please explain a bit more about the the setting to limit highlighting is not useful ?

@jimczi
Copy link
Contributor

jimczi commented Jan 13, 2021

Can you describe what's implemented with an example ? What's the relation between the query parameter and the setting ? Do we need both ? My +1 was for a dedicated query parameter that is independent of the setting.

@markharwood
Copy link
Contributor

would it be easier to use if we just changed things so that you always get the limiting filter added, and you don't get exceptions any more?

Taking the perspectives of the various parties involved and what they want:

  1. The es cluster admin wants to shut down abusive apps with noisy failures (the point of the existing index setting).
  2. The client app developer has a new setting to write a more responsible app which doesn't blow up.
  3. The end user doing a search on big docs no longer gets errors if the app developer has used the flag but doesn't get a "truncated" warning.

The existing error message below for exceeding the index.highlight.max_analyzed_offset is not ideal.

 "The length of [text] field of [0] doc of [index] index has exceeded [xx] - maximum allowed to be analyzed for "
                + "highlighting. This maximum can be set by changing the [index.highlight.max_analyzed_offset] index level setting. "
                + "For large texts, indexing with offsets or term vectors is recommended!",

It should probably offer an alternative remedy. This is the opportunity for the admin to tell the client app developer to write a better app so we should advertise the new limit_to_max_analyzed_offset flag in this message rather than only suggesting upping the index.highlight.max_analyzed_offset limit.

With regards to point 3 - the client app developer might want to tell users when content is truncated. This could be done by testing if the length of the highlighted string is greater than the limit_to_max_analyzed_offset setting but this is not a reliable test. A truncated flag in the response would be more robust but maybe warning users that text has been truncated is not too important given the hit-or-miss nature of highlighters anyway.

@matriv
Copy link
Contributor Author

matriv commented Jan 13, 2021

@markharwood I had in mind to have a follow up PR to have this warning/flag regarding the "truncation" of the highlighting, since I agree this is useful to avoid confusion for the user. For "simple" users, I guess they will first try to highlight a large doc, get the error, and then use the flag to highlight up to the limit, so they are aware of what's going on, therefore, of course I need to add the new param to the error message as a suggestion.

As you mentioned in point 1. seem we want to have both setting on the index and the per-query param.

Maybe, if we have a query param that has a similar name with the setting like: max_analyzed_offset which overrides the index setting and uses the filter to stop highlighting and avoid exceptions? (@jimczi probably what you're suggesting?). To me this doesn't sound bad, but I think we need to consider that in this case, the query param doesn't only override the limit of the index setting, but also changes the behaviour (query now succeeds with truncated highlighting).

@matriv
Copy link
Contributor Author

matriv commented Jan 14, 2021

Reworded the exception message with the help of @romseygeek, and removed the mention to index offsets and term vectors.

matriv added a commit that referenced this pull request Feb 16, 2021
…69016)

Add a `max_analyzed_offset` query parameter to allow users
to limit the highlighting of text fields to a value less than or equal to the
`index.highlight.max_analyzed_offset`, thus avoiding an exception when
the length of the text field exceeds the limit. The highlighting still takes place,
but stops at the length defined by the new parameter.

Closes: #52155
(cherry picked from commit f9af60b)
matriv added a commit to matriv/elasticsearch that referenced this pull request Feb 16, 2021
Since elastic#69016 is merged versions for serialisation/deserialisation
and YAML tests are updated.

Follows: elastic#67325
matriv added a commit that referenced this pull request Feb 16, 2021
Since #69016 is merged versions for serialisation/deserialisation
and YAML tests are updated.

Follows: #67325
matriv added a commit to matriv/elasticsearch that referenced this pull request Feb 16, 2021
With the newly introduced `max_analyzed_offset` the analyzer of
`AnnotatedTextHighlighter` was wrapped twice with the
`LimitTokenOffsetAnalyzer` by mistake.

Follows: elastic#67325
matriv added a commit that referenced this pull request Feb 16, 2021
With the newly introduced `max_analyzed_offset` the analyzer of
`AnnotatedTextHighlighter` was wrapped twice with the
`LimitTokenOffsetAnalyzer` by mistake.

Follows: #67325
matriv added a commit that referenced this pull request Feb 16, 2021
…) (#69058)

With the newly introduced `max_analyzed_offset` the analyzer of
`AnnotatedTextHighlighter` was wrapped twice with the
`LimitTokenOffsetAnalyzer` by mistake.

Follows: #67325
matriv added a commit to matriv/elasticsearch that referenced this pull request Feb 19, 2021
Assign a value different that the original when testing
mutation of the builder.

Follows: elastic#67325
matriv added a commit that referenced this pull request Feb 19, 2021
Assign a value different that the original when testing
mutation of the builder.

Follows: #67325
matriv added a commit that referenced this pull request Feb 19, 2021
Assign a value different that the original when testing
mutation of the builder.

Follows: #67325
(cherry picked from commit 4a6c957)
matriv added a commit that referenced this pull request Feb 19, 2021
Assign a value different that the original when testing
mutation of the builder.

Follows: #67325
(cherry picked from commit 4a6c957)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Highlighters shouldn't error on big documents
7 participants