-
Notifications
You must be signed in to change notification settings - Fork 25.2k
New Annotated_text field type #30364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Pinging @elastic/es-search-aggs |
run gradle build tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the approach, the format for the annotation is simple and can be extended later. I am not totally convinced by the removal of the annotations for highlighting though. There are valuable informations in the annotations that can be used for highlighting that are lost if we just remove them. Would it be possible to keep the annotations in the text and mark the entire annotations when it needs to be highlighted:
So <b>[Madonna](type=person&value=Madonna)<b>
?
return fielddata; | ||
} | ||
|
||
public void setFielddata(boolean fielddata) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be allowed to use fielddata ? This field can be used in the significant text aggregation and the annotated text should be indexed in a doc values field so maybe we can disallow fielddata entirely ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can disallow fielddata entirely ?
I wouldn't be opposed to that - we're looking to phase it out so this makes sense.
private static String readToString(Reader reader) throws IOException{ | ||
//TODO FIXME I think this loses any end-of-text markers that are used | ||
// internally in Lucene to turn arrays of strings into single strings. | ||
// I think the positionIncrementGap logic looks for these special markers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for the position increment gap is not based on markers, multi-valued fields are indexed in multiple fields that share the same field name, the index writer registers the last position for each field and applies the posInc gap to the last position seen for the field being processed. Is this comment related to highlighting ? We don't apply the posInc gap during highlighting because positions are not important (highlighting is based only on offsets).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm currently working on an issue with highlighting on fields that have arrays of text rather than a single value. There's all sorts of "magic" going on in unified highighter to assemble and disassemble strings to and from arrays using special marker characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unified
highlighter uses U+0000
to separate the values but it retrieves the original text first so this should not clash with the annotations? What is the issue with multi-values ? Can you explain the problem you're seeing ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally invalid offsets. I need to give the highlighter both marked-up and non-marked-up versions of the text. Marked-up text is required for any re-analysis to get the annotations but plain text is used for BreakIterator and PassageFormatter. I think the confusion comes when content.length()
is used to denote end of a passage and the highlighter is using the marked-up version so is too long. I had code to reset the last passage's endOffset properly but that doesn't work properly in a multi-value because there's multiple passages with the wrong offsets. Maybe the better approach is to smuggle the marked-up version to the choice of analyzer and keep everything else in the highlighter (PassageFormatters, BreakIterators etc) working with just the plain-text.
* This code is largely a copy of TextFieldMapper which is less than ideal - | ||
* my attempts to subclass TextFieldMapper faild but we can revisit this. | ||
**/ | ||
public class AnnotatedTextFieldMapper extends FieldMapper { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To simplify the usage of this field type I think we should automatically (or add an option to activate it) index the annotated text in a doc_values
field. This field would be available for aggregations and the terms bucket can be used to generate queries for highlighting or matching. I know it's possible to just index the annotations in a different field but I think it would be beneficial to have to link these two fields internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a number of questions around that:
- What if users want the source JSON to include a structured "person" array?
- How do we define the "copy_to" fields in the annotation markup?
- Are there any complexities if the annotated text field appears under a
nested
object?
Could we save that feature for a phase 2 and keep this initial PR simpler for the moment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have additional fields for other use cases (parent/child, prefix text, ...) so it can work here too. The name of the field should be derived from the main field name, it can be hardcoded or the suffix can be added as an option in the field.
We can add this feature to phase 2 but I thought that it would be easier to have a clear link between the two fields in the first iteration. Let's leave this for a follow up.
* This code is largely a copy of TextFieldMapper which is less than ideal - | ||
* my attempts to subclass TextFieldMapper faild but we can revisit this. | ||
**/ | ||
public class AnnotatedTextFieldMapper extends FieldMapper { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to extend TextFieldMapper ? If not can we change TextFieldMapper to allow it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did try subclassing initially and tweaking base class visibility where required but still was hitting issues so opted to copy code while I was proving the concept. I can cycle back around this as an internal optimization once we're happy with the functionality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect the option of subclassing just got harder now that I've removed the FieldData support in AnnotatedTextFieldMapper and the TextFieldMapper base class still has support.
Thanks for the review, @jimczi
The simple case with "Madonna" looks feasible but gets tricky if the query is for the phrase
The highlight begin/end tags would then overlap with the annotation begin/end markers. |
Since we highlight terms individually in the unified highlighter, the tags would be around |
run gradle build tests |
run gradle build tests |
TL;DR: I think it's going to be too messy/complex for the I'd ideally reuse classes like the PhraseFieldType from TextFieldMapper but the way Analyzers are constructed means it is not possible to strip the annotations markup from text before feeding to the shingle filter used in the phrase indexing. |
There is a workaround for the above unwrappabale analyzer problem but it involves the use of ThreadLocal to pass annotations from Reader to the annotations TokenFilter rather than relying on a custom TokenStreamComponents. |
annotationsInjector.setAnnotations(at); | ||
super.setReader(new StringReader(at.textMinusMarkup)); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you do this instead in AnnotatedHighlighterAnalyzer.wrapReader()? That would solve your issues with further wrapping, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use wrapReader in the main AnnotationAnalyzerWrapper which is now a wrappable Analyzer (courtesy of some ThreadLocal ickyness).
The line you've highlighted is from a special Analyzer designed specifically for use with highlighting annotated fields. The challenge there is a different one - the UnifiedHighlighter calls an Analyzer with fragments of the original plain text (eg sentences) and the AnnotatedHighlighterAnalyzer is responsible for mapping the plain-text string sentence presented by the highlighter to it's marked-up counterpart so that it can produce the Annotation tokens.
if (fieldType.typeName().equals (AnnotatedTextFieldMapper.CONTENT_TYPE)){ | ||
AnnotatedHighlighterAnalyzer annotatedHighlighter = new AnnotatedHighlighterAnalyzer(fieldValues, analyzer); | ||
fieldValues = annotatedHighlighter.getPlainTextValuesForHighlighter(); | ||
// use the wrapped analyzer that holds marked-up text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@romseygeek Not sure what you've got cooking with new highlighters but I made need to avoid reusing TextFieldMapper's "phrase" field type and introduce a fork "annotations_phrase" because that's a similar special-case scenario like the plain "annotations" type. Before we feed the original JSON content into the highlighter's passage fragmentation logic we must strip the text of annotation markup using getPlainTextValuesForHighlighter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it in general, that looks useful. Some general comments:
- I see it as an expert feature so I'm wondering whether we want to have it in a plugin.
- I think we should include the type in the term instead of (or in addition to) the type attribute since the latter is not indexed, eg.
person=First Last
as a term rather than justFirst Last
? It should help avoid matching these named entities when trying to match regular text? - Should we try to detect whether the analyzer splits on punctuation? I suspect behavior could be surprising if you used this field eg. with a
keyword
analyzer? - I'd vote to remove indexing of prefixes and phrases to this new field so that it focuses on the indexing of named entities rather than querying of text?
- I'm still debating with myself whether highlighting should retain the annotations or not.
- Can we find another way to propagate annotations to the token stream that doesn't need threadlocals? We should be able to let the annotation filter know about annotations from the reader somehow via TokenStreamComponents.setReader?
- The key part of this PR is the analysis of annotated text so I'd like to see dedicated tests for this that do not initialize a node or mappings and do call things like
BaseTokenStreamTestCase#checkAnalysisConsistency
to make sure it behaves well.
That choice might have impact on the following "core" pieces of software:
Kibana can always build in support for annotated text even when the Java plugin is not installed but the core Java parts needing special-casing for annotations mean we'd have to drag some fork/subclass of that core functionality into the plugin.
I have struggled with this decision. The trade off is readability vs strict matching. I opted for no types for readability - when it comes to strict matching, the majority of named entities are generally mixed case or multi-word so don't over-match.
For "strict" matching you could then search for the token
I got it working with a ThreadLocal and I do see general search for entities near free text as useful but it's questionable whether phrase/prefix are required too. Highlighting for phrase/prefix I suspect is a bit of a mess currently with mixes of fields but I think @romseygeek is working on something new there.
I have an impl working for basic highlighting and it's certainly useful - just need to check how it behaves with various highlighter options.
If we don't support phrase/prefix etc sub-fields and the related "analyzer wrapping" we can use a subclass of TokenStreamComponents (but I have that Lucene issue on this being a generally bad approach). |
We should avoid making highlighters aware of specificities of some field types. Of course there will always be exceptions, but this feature doesn't sound like a common need enough to trade maintenability.
Can this be addressed with an include regex that only matches terms that start with
Strict matching feels important to me: if some terms are both common terms and annotations, how do you filter on the annotation specifically? Your idea to require tokens at the same position would work but I'm a bit unhappy about corner-cases such as the fact that
+1 to this trade-off |
OK I'll look into bundling a highlighter subclass in a plugin
I'm not sure that's enough. The annotations represent the bits of the text where we are certain about what is in there so would generally prefer to take the annotation, not the underlying text string but for the rest of the text we still want significant words. So the policy options I thought might be required for discovery were:
I guess users who want that could also always choose their own choice of uniquifying prefix and put that in their values? That would provide a workaround for those that want strictness but there would be no workaround for users who prefer readability if we always use token type as a prefix for values. |
@@ -56,6 +57,7 @@ | |||
} | |||
ALLOWED_QUERY_MAPPER_TYPES.add("scaled_float"); | |||
ALLOWED_QUERY_MAPPER_TYPES.add(TextFieldMapper.CONTENT_TYPE); | |||
ALLOWED_QUERY_MAPPER_TYPES.add(AnnotatedTextFieldMapper.CONTENT_TYPE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpountz this is currently a bit of a blocker for moving annotated_text to a plugin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this one is annoying. I started a discussion at #31655.
@@ -90,7 +91,7 @@ | |||
public static final boolean DEFAULT_FAIL_ON_UNSUPPORTED_FIELDS = true; | |||
|
|||
private static final Set<Class<? extends MappedFieldType>> SUPPORTED_FIELD_TYPES = new HashSet<>( | |||
Arrays.asList(TextFieldType.class, KeywordFieldType.class)); | |||
Arrays.asList(TextFieldType.class, KeywordFieldType.class, AnnotatedTextFieldType.class)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another example of core hard-coding text types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree supporting more-like-this would be nice ultimately, but let's just skip it for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to workaround the lack of MLT using the TermVectorsService but encountered a similar hard-coded road-block:
elasticsearch/server/src/main/java/org/elasticsearch/index/termvectors/TermVectorsService.java
Line 165 in 00b0e10
if (fieldType instanceof KeywordFieldMapper.KeywordFieldType == false |
04f4287
to
5ac2935
Compare
1c75bf5
to
e8acd14
Compare
e8acd14
to
f93ae0f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I like the guidance that you provided in the docs around avoiding clashes with regular tokens.
Largely a copy of `text` field type but adds ability to include markdown-like syntax in the text. The “AnnotatedText” class parses text+markup and converts into plain text and AnnotationTokens. The annotation token values are injected unchanged alongside the regular text tokens to provide a form of additional indexed overlay useful in positional searches and highlighting. Annotated_text fields do not support fielddata as we want to phase this out. Also includes a new "annotated" highlighter type that retains annotations and merges in search hits as additional annotation markup. Closes elastic#29467
…ng (originally required for index_phrase/prefix which we no longer support)
…ed fields, remove `typed` annotations from doc examples
…sticsearchParseException if a document uses these
5bfc69f
to
6dd2085
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. A couple of formatting nits, and one question that can be done in a follow-up if you prefer.
public AnnotatedTextFieldType clone() { | ||
return new AnnotatedTextFieldType(this); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra newline?
return mpqb.build(); | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and again
import static org.apache.lucene.search.uhighlight.CustomUnifiedHighlighter.MULTIVAL_SEP_CHAR; | ||
|
||
public class AnnotatedTextHighlighter extends UnifiedHighlighter { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra newline
} | ||
return null; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of this is just a copy of the parent method; is it worth trying to break it up a bit more and use inheritance to avoid copying so much code? The only differences are in the Analyzer
and PassageFormatter
types, I think?
…otatedTextHighlighter duplicate code by opening up UnifiedHighlighter for necessary overrides of Analyzer, field loading and PassageFormatter
protected Analyzer wrap(Analyzer analyzer) { | ||
return analyzer; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we merge wrap() into getAnalyzer()? It feels wrong to have both methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getAnalyzer is static so can't be overriden?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I mean change it so that it isn't static. Or have a different, non-static getAnalyzer() method that calls out to the static version in the base class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Static method prob needs to stay as already called from PlainHighlighter and can't have instance method with same name so least disruptive change is a new getThisAnalyzer()
(not crazy about that name) that calls existing static getAnalyzer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about moving the static method to HighlightUtils
? That seems to me to be a better place for it anyway, if it's being called from multiple highlighters.
Thanks, that looks much better. I left one comment, but LGTM otherwise. |
+1 |
Largely a copy of
text
field type but adds ability to include markdown-like syntax in the text.The “AnnotatedText” class parses text+markup and converts into plain text and AnnotationTokens.
The annotation token values are injected unchanged alongside the regular text tokens to provide a
form of additional indexed overlay useful in positional searches and highlighting.
See example use in https://gist.github.com/markharwood/13dfd386d1c3ae76e116dd3d42413f58
Closes #29467