Skip to content

New Annotated_text field type #30364

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Sep 18, 2018
Merged

Conversation

markharwood
Copy link
Contributor

Largely a copy of text field type but adds ability to include markdown-like syntax in the text.
The “AnnotatedText” class parses text+markup and converts into plain text and AnnotationTokens.
The annotation token values are injected unchanged alongside the regular text tokens to provide a
form of additional indexed overlay useful in positional searches and highlighting.

See example use in https://gist.github.com/markharwood/13dfd386d1c3ae76e116dd3d42413f58

Closes #29467

@markharwood markharwood added >feature :Search Relevance/Highlighting How a query matched a document :Search Foundations/Mapping Index mappings, including merging and defining field types labels May 3, 2018
@markharwood markharwood requested review from jpountz and jimczi May 3, 2018 15:03
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@markharwood
Copy link
Contributor Author

run gradle build tests

@colings86 colings86 removed the :Search Relevance/Highlighting How a query matched a document label May 8, 2018
Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the approach, the format for the annotation is simple and can be extended later. I am not totally convinced by the removal of the annotations for highlighting though. There are valuable informations in the annotations that can be used for highlighting that are lost if we just remove them. Would it be possible to keep the annotations in the text and mark the entire annotations when it needs to be highlighted:

So <b>[Madonna](type=person&value=Madonna)<b>

?

return fielddata;
}

public void setFielddata(boolean fielddata) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be allowed to use fielddata ? This field can be used in the significant text aggregation and the annotated text should be indexed in a doc values field so maybe we can disallow fielddata entirely ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can disallow fielddata entirely ?

I wouldn't be opposed to that - we're looking to phase it out so this makes sense.

private static String readToString(Reader reader) throws IOException{
//TODO FIXME I think this loses any end-of-text markers that are used
// internally in Lucene to turn arrays of strings into single strings.
// I think the positionIncrementGap logic looks for these special markers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic for the position increment gap is not based on markers, multi-valued fields are indexed in multiple fields that share the same field name, the index writer registers the last position for each field and applies the posInc gap to the last position seen for the field being processed. Is this comment related to highlighting ? We don't apply the posInc gap during highlighting because positions are not important (highlighting is based only on offsets).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm currently working on an issue with highlighting on fields that have arrays of text rather than a single value. There's all sorts of "magic" going on in unified highighter to assemble and disassemble strings to and from arrays using special marker characters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unified highlighter uses U+0000 to separate the values but it retrieves the original text first so this should not clash with the annotations? What is the issue with multi-values ? Can you explain the problem you're seeing ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally invalid offsets. I need to give the highlighter both marked-up and non-marked-up versions of the text. Marked-up text is required for any re-analysis to get the annotations but plain text is used for BreakIterator and PassageFormatter. I think the confusion comes when content.length() is used to denote end of a passage and the highlighter is using the marked-up version so is too long. I had code to reset the last passage's endOffset properly but that doesn't work properly in a multi-value because there's multiple passages with the wrong offsets. Maybe the better approach is to smuggle the marked-up version to the choice of analyzer and keep everything else in the highlighter (PassageFormatters, BreakIterators etc) working with just the plain-text.

* This code is largely a copy of TextFieldMapper which is less than ideal -
* my attempts to subclass TextFieldMapper faild but we can revisit this.
**/
public class AnnotatedTextFieldMapper extends FieldMapper {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To simplify the usage of this field type I think we should automatically (or add an option to activate it) index the annotated text in a doc_values field. This field would be available for aggregations and the terms bucket can be used to generate queries for highlighting or matching. I know it's possible to just index the annotations in a different field but I think it would be beneficial to have to link these two fields internally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a number of questions around that:

  1. What if users want the source JSON to include a structured "person" array?
  2. How do we define the "copy_to" fields in the annotation markup?
  3. Are there any complexities if the annotated text field appears under a nested object?

Could we save that feature for a phase 2 and keep this initial PR simpler for the moment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have additional fields for other use cases (parent/child, prefix text, ...) so it can work here too. The name of the field should be derived from the main field name, it can be hardcoded or the suffix can be added as an option in the field.
We can add this feature to phase 2 but I thought that it would be easier to have a clear link between the two fields in the first iteration. Let's leave this for a follow up.

* This code is largely a copy of TextFieldMapper which is less than ideal -
* my attempts to subclass TextFieldMapper faild but we can revisit this.
**/
public class AnnotatedTextFieldMapper extends FieldMapper {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to extend TextFieldMapper ? If not can we change TextFieldMapper to allow it ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did try subclassing initially and tweaking base class visibility where required but still was hitting issues so opted to copy code while I was proving the concept. I can cycle back around this as an internal optimization once we're happy with the functionality

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect the option of subclassing just got harder now that I've removed the FieldData support in AnnotatedTextFieldMapper and the TextFieldMapper base class still has support.

@markharwood
Copy link
Contributor Author

Thanks for the review, @jimczi

Would it be possible to keep the annotations in the text and mark the entire annotations when it needs to be highlighted:

The simple case with "Madonna" looks feasible but gets tricky if the query is for the phrase "paid stormy" and the annotated text is

he paid [Stormy Daniels](type=person&value=Stephanie+Clifford) $130,000

The highlight begin/end tags would then overlap with the annotation begin/end markers.

@jimczi
Copy link
Contributor

jimczi commented May 9, 2018

The highlight begin/end tags would then overlap with the annotation begin/end markers.

Since we highlight terms individually in the unified highlighter, the tags would be around stormy alone so I guess that it can be handled in the client side ? Though I don't know how the sentence break iterator would behave with the annotations format (it could split passage in the middle of an annotation) so removing the annotations makes sense for the moment. We could also add the ability in the custom passage formatter to transform an annotation into an html highlighting that extracts the type and the value to present them nicely ? I think you had some ideas regarding this.

@markharwood
Copy link
Contributor Author

run gradle build tests

@markharwood
Copy link
Contributor Author

run gradle build tests

@markharwood
Copy link
Contributor Author

TL;DR: I think it's going to be too messy/complex for the annotated_text field to support phrase and prefix sub-fields like the text field currently does.

I'd ideally reuse classes like the PhraseFieldType from TextFieldMapper but the way Analyzers are constructed means it is not possible to strip the annotations markup from text before feeding to the shingle filter used in the phrase indexing.
The reason for this is that the annotated_text fields rely on the special markup-stripping analyzer being the outermost component in the analysis pipeline. The standard Lucene approach for wrapping analysis components prevents this special markup-stripping logic from being invoked.
The Phrase-indexing logic in PhraseWrappedAnalyzer.wrapComponents() is where the annotated_text field's existing TokenStreamComponents is replaced with a new one, taking the Tokenizer and components from the old TSC and adding another TokenFilter to the TokenStream. This is unfortunate because annotated_text uses a custom TokenStreamComponents impl with a custom setReader() which is thrown away as part of this substitution. The Phrase indexer's replacement TokenStreamComponents has the default setReader() behaviour which calls tokenizer.setReader() which is final so there is really no opportunity for the annotation field to intercept setReader when wrapped in this way.

@markharwood
Copy link
Contributor Author

markharwood commented Jun 11, 2018

There is a workaround for the above unwrappabale analyzer problem but it involves the use of ThreadLocal to pass annotations from Reader to the annotations TokenFilter rather than relying on a custom TokenStreamComponents.
(Note I opened a Lucene issue - should TokenStreamComponents be made final if any subclasses are removed by the typical component-wrapping process?)

annotationsInjector.setAnnotations(at);
super.setReader(new StringReader(at.textMinusMarkup));
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do this instead in AnnotatedHighlighterAnalyzer.wrapReader()? That would solve your issues with further wrapping, I think.

Copy link
Contributor Author

@markharwood markharwood Jun 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use wrapReader in the main AnnotationAnalyzerWrapper which is now a wrappable Analyzer (courtesy of some ThreadLocal ickyness).
The line you've highlighted is from a special Analyzer designed specifically for use with highlighting annotated fields. The challenge there is a different one - the UnifiedHighlighter calls an Analyzer with fragments of the original plain text (eg sentences) and the AnnotatedHighlighterAnalyzer is responsible for mapping the plain-text string sentence presented by the highlighter to it's marked-up counterpart so that it can produce the Annotation tokens.

if (fieldType.typeName().equals (AnnotatedTextFieldMapper.CONTENT_TYPE)){
AnnotatedHighlighterAnalyzer annotatedHighlighter = new AnnotatedHighlighterAnalyzer(fieldValues, analyzer);
fieldValues = annotatedHighlighter.getPlainTextValuesForHighlighter();
// use the wrapped analyzer that holds marked-up text
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romseygeek Not sure what you've got cooking with new highlighters but I made need to avoid reusing TextFieldMapper's "phrase" field type and introduce a fork "annotations_phrase" because that's a similar special-case scenario like the plain "annotations" type. Before we feed the original JSON content into the highlighter's passage fragmentation logic we must strip the text of annotation markup using getPlainTextValuesForHighlighter

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it in general, that looks useful. Some general comments:

  • I see it as an expert feature so I'm wondering whether we want to have it in a plugin.
  • I think we should include the type in the term instead of (or in addition to) the type attribute since the latter is not indexed, eg. person=First Last as a term rather than just First Last? It should help avoid matching these named entities when trying to match regular text?
  • Should we try to detect whether the analyzer splits on punctuation? I suspect behavior could be surprising if you used this field eg. with a keyword analyzer?
  • I'd vote to remove indexing of prefixes and phrases to this new field so that it focuses on the indexing of named entities rather than querying of text?
  • I'm still debating with myself whether highlighting should retain the annotations or not.
  • Can we find another way to propagate annotations to the token stream that doesn't need threadlocals? We should be able to let the annotation filter know about annotations from the reader somehow via TokenStreamComponents.setReader?
  • The key part of this PR is the analysis of annotated text so I'd like to see dedicated tests for this that do not initialize a node or mappings and do call things like BaseTokenStreamTestCase#checkAnalysisConsistency to make sure it behaves well.

@markharwood
Copy link
Contributor Author

markharwood commented Jun 12, 2018

I see it as an expert feature so I'm wondering whether we want to have it in a plugin.

That choice might have impact on the following "core" pieces of software:

  • Unified highlighter currently has some special-case code to use a custom Analyzer when highlighting annotated fields.
  • Significant_text agg - I'd like to give users options to choose between selecting annotations and/or the underlying text. Currently results on Wikipedia content have a mix of annotation tokens e.g. Donald Trump and regular text tokens like donald and trump. This can be a little repetitive/annoying.
  • Kibana will need to render text appropriately

Kibana can always build in support for annotated text even when the Java plugin is not installed but the core Java parts needing special-casing for annotations mean we'd have to drag some fork/subclass of that core functionality into the plugin.

I think we should include the type in the term

I have struggled with this decision. The trade off is readability vs strict matching. I opted for no types for readability - when it comes to strict matching, the majority of named entities are generally mixed case or multi-word so don't over-match. Apple the company doesn't match the fruit apple but I suppose a lower-case company like elastic would match elastic band. One other workaround is to embed type info in the annotation (I did this to all of Wikipedia content using dbpedia's type info) which means you end up with annotations like this:

[elastic](wikipedia_item=elastic&dbpedia_type=Company) is highly elastic

For "strict" matching you could then search for the token elastic and Company at the same position.

I'd vote to remove indexing of prefixes and phrases to this new field so that it focuses on the indexing of named entities rather than querying of text?

I got it working with a ThreadLocal and I do see general search for entities near free text as useful but it's questionable whether phrase/prefix are required too. Highlighting for phrase/prefix I suspect is a bit of a mess currently with mixes of fields but I think @romseygeek is working on something new there.

I'm still debating with myself whether highlighting should retain the annotations or not.

I have an impl working for basic highlighting and it's certainly useful - just need to check how it behaves with various highlighter options.

Can we find another way to propagate annotations to the token stream

If we don't support phrase/prefix etc sub-fields and the related "analyzer wrapping" we can use a subclass of TokenStreamComponents (but I have that Lucene issue on this being a generally bad approach).

@jpountz
Copy link
Contributor

jpountz commented Jun 13, 2018

Unified highlighter currently has some special-case code to use a custom Analyzer when highlighting annotated fields.

We should avoid making highlighters aware of specificities of some field types. Of course there will always be exceptions, but this feature doesn't sound like a common need enough to trade maintenability.

Significant_text agg - I'd like to give users options to choose between selecting annotations and/or the underlying text. Currently results on Wikipedia content have a mix of annotation tokens e.g. Donald Trump and regular text tokens like donald and trump. This can be a little repetitive/annoying.

Can this be addressed with an include regex that only matches terms that start with person: or an exclude regex that matches everything that contains a :?

The trade off is readability vs strict matching

Strict matching feels important to me: if some terms are both common terms and annotations, how do you filter on the annotation specifically? Your idea to require tokens at the same position would work but I'm a bit unhappy about corner-cases such as the fact that foo[type=bar] and bar[type=foo] would still be hard to distinguish? Also, the fact that some entities may match full-text queries is problematic for relevance since the term frequency will be artificially bumped up.

If we don't support phrase/prefix etc sub-fields and the related "analyzer wrapping" we can use a subclass of TokenStreamComponents

+1 to this trade-off

@markharwood
Copy link
Contributor Author

his feature doesn't sound like a common need enough to trade maintenability.

OK I'll look into bundling a highlighter subclass in a plugin

Can this be addressed with an include regex

I'm not sure that's enough. The annotations represent the bits of the text where we are certain about what is in there so would generally prefer to take the annotation, not the underlying text string but for the rest of the text we still want significant words. So the policy options I thought might be required for discovery were:

  • text only
  • text and annotations
  • annotations only
  • annotations and only text outside of the annotations

Strict matching feels important to me

I guess users who want that could also always choose their own choice of uniquifying prefix and put that in their values? That would provide a workaround for those that want strictness but there would be no workaround for users who prefer readability if we always use token type as a prefix for values.

@@ -56,6 +57,7 @@
}
ALLOWED_QUERY_MAPPER_TYPES.add("scaled_float");
ALLOWED_QUERY_MAPPER_TYPES.add(TextFieldMapper.CONTENT_TYPE);
ALLOWED_QUERY_MAPPER_TYPES.add(AnnotatedTextFieldMapper.CONTENT_TYPE);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpountz this is currently a bit of a blocker for moving annotated_text to a plugin.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this one is annoying. I started a discussion at #31655.

@@ -90,7 +91,7 @@
public static final boolean DEFAULT_FAIL_ON_UNSUPPORTED_FIELDS = true;

private static final Set<Class<? extends MappedFieldType>> SUPPORTED_FIELD_TYPES = new HashSet<>(
Arrays.asList(TextFieldType.class, KeywordFieldType.class));
Arrays.asList(TextFieldType.class, KeywordFieldType.class, AnnotatedTextFieldType.class));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another example of core hard-coding text types

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree supporting more-like-this would be nice ultimately, but let's just skip it for now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to workaround the lack of MLT using the TermVectorsService but encountered a similar hard-coded road-block:

if (fieldType instanceof KeywordFieldMapper.KeywordFieldType == false

@markharwood markharwood force-pushed the fix/29467 branch 2 times, most recently from 04f4287 to 5ac2935 Compare July 5, 2018 10:43
@markharwood
Copy link
Contributor Author

I've removed type support now and build is back to green.

Good to go @jpountz @jimczi ?

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I like the guidance that you provided in the docs around avoiding clashes with regular tokens.

Largely a copy of `text` field type but adds ability to include markdown-like syntax in the text.
The “AnnotatedText” class parses text+markup and converts into plain text and AnnotationTokens.
The annotation token values are injected unchanged alongside the regular text tokens to provide a
form of additional indexed overlay useful in positional searches and highlighting.
Annotated_text fields do not support fielddata as we want to phase this out.
Also includes a new "annotated" highlighter type that retains annotations and merges in search
hits as additional annotation markup.

Closes elastic#29467
…ng (originally required for index_phrase/prefix which we no longer support)
…ed fields, remove `typed` annotations from doc examples
…sticsearchParseException if a document uses these
Copy link
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. A couple of formatting nits, and one question that can be done in a follow-up if you prefer.

public AnnotatedTextFieldType clone() {
return new AnnotatedTextFieldType(this);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra newline?

return mpqb.build();
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and again

import static org.apache.lucene.search.uhighlight.CustomUnifiedHighlighter.MULTIVAL_SEP_CHAR;

public class AnnotatedTextHighlighter extends UnifiedHighlighter {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra newline

}
return null;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of this is just a copy of the parent method; is it worth trying to break it up a bit more and use inheritance to avoid copying so much code? The only differences are in the Analyzer and PassageFormatter types, I think?

…otatedTextHighlighter duplicate code by opening up UnifiedHighlighter for necessary overrides of Analyzer, field loading and PassageFormatter
protected Analyzer wrap(Analyzer analyzer) {
return analyzer;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we merge wrap() into getAnalyzer()? It feels wrong to have both methods

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getAnalyzer is static so can't be overriden?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I mean change it so that it isn't static. Or have a different, non-static getAnalyzer() method that calls out to the static version in the base class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Static method prob needs to stay as already called from PlainHighlighter and can't have instance method with same name so least disruptive change is a new getThisAnalyzer() (not crazy about that name) that calls existing static getAnalyzer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about moving the static method to HighlightUtils? That seems to me to be a better place for it anyway, if it's being called from multiple highlighters.

@romseygeek
Copy link
Contributor

Thanks, that looks much better. I left one comment, but LGTM otherwise.

@romseygeek
Copy link
Contributor

+1

@markharwood markharwood merged commit 2fa09f0 into elastic:master Sep 18, 2018
@jimczi jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :Search Foundations/Mapping Index mappings, including merging and defining field types v6.5.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants