Skip to content

Commit 2fa09f0

Browse files
authored
New plugin - Annotated_text field type (#30364)
New plugin for annotated_text field type. Largely a copy of `text` field type but adds ability to include markdown-like syntax in the text. The “AnnotatedText” class parses text+markup and converts into plain text and AnnotationTokens. The annotation token values are injected unchanged alongside the regular text tokens to provide a form of additional indexed overlay useful in positional searches and highlighting. Annotated_text fields do not support fielddata as we want to phase this out. Also includes a new "annotated" highlighter type that retains annotations and merges in search hits as additional annotation markup. Closes #29467
1 parent ab9c28a commit 2fa09f0

File tree

18 files changed

+2523
-30
lines changed

18 files changed

+2523
-30
lines changed
Lines changed: 328 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,328 @@
1+
[[mapper-annotated-text]]
2+
=== Mapper Annotated Text Plugin
3+
4+
experimental[]
5+
6+
The mapper-annotated-text plugin provides the ability to index text that is a
7+
combination of free-text and special markup that is typically used to identify
8+
items of interest such as people or organisations (see NER or Named Entity Recognition
9+
tools).
10+
11+
12+
The elasticsearch markup allows one or more additional tokens to be injected, unchanged, into the token
13+
stream at the same position as the underlying text it annotates.
14+
15+
:plugin_name: mapper-annotated-text
16+
include::install_remove.asciidoc[]
17+
18+
[[mapper-annotated-text-usage]]
19+
==== Using the `annotated-text` field
20+
21+
The `annotated-text` tokenizes text content as per the more common `text` field (see
22+
"limitations" below) but also injects any marked-up annotation tokens directly into
23+
the search index:
24+
25+
[source,js]
26+
--------------------------
27+
PUT my_index
28+
{
29+
"mappings": {
30+
"_doc": {
31+
"properties": {
32+
"my_field": {
33+
"type": "annotated_text"
34+
}
35+
}
36+
}
37+
}
38+
}
39+
--------------------------
40+
// CONSOLE
41+
42+
Such a mapping would allow marked-up text eg wikipedia articles to be indexed as both text
43+
and structured tokens. The annotations use a markdown-like syntax using URL encoding of
44+
one or more values separated by the `&` symbol.
45+
46+
47+
We can use the "_analyze" api to test how an example annotation would be stored as tokens
48+
in the search index:
49+
50+
51+
[source,js]
52+
--------------------------
53+
GET my_index/_analyze
54+
{
55+
"field": "my_field",
56+
"text":"Investors in [Apple](Apple+Inc.) rejoiced."
57+
}
58+
--------------------------
59+
// NOTCONSOLE
60+
61+
Response:
62+
63+
[source,js]
64+
--------------------------------------------------
65+
{
66+
"tokens": [
67+
{
68+
"token": "investors",
69+
"start_offset": 0,
70+
"end_offset": 9,
71+
"type": "<ALPHANUM>",
72+
"position": 0
73+
},
74+
{
75+
"token": "in",
76+
"start_offset": 10,
77+
"end_offset": 12,
78+
"type": "<ALPHANUM>",
79+
"position": 1
80+
},
81+
{
82+
"token": "Apple Inc.", <1>
83+
"start_offset": 13,
84+
"end_offset": 18,
85+
"type": "annotation",
86+
"position": 2
87+
},
88+
{
89+
"token": "apple",
90+
"start_offset": 13,
91+
"end_offset": 18,
92+
"type": "<ALPHANUM>",
93+
"position": 2
94+
},
95+
{
96+
"token": "rejoiced",
97+
"start_offset": 19,
98+
"end_offset": 27,
99+
"type": "<ALPHANUM>",
100+
"position": 3
101+
}
102+
]
103+
}
104+
--------------------------------------------------
105+
// NOTCONSOLE
106+
107+
<1> Note the whole annotation token `Apple Inc.` is placed, unchanged as a single token in
108+
the token stream and at the same position (position 2) as the text token (`apple`) it annotates.
109+
110+
111+
We can now perform searches for annotations using regular `term` queries that don't tokenize
112+
the provided search values. Annotations are a more precise way of matching as can be seen
113+
in this example where a search for `Beck` will not match `Jeff Beck` :
114+
115+
[source,js]
116+
--------------------------
117+
# Example documents
118+
PUT my_index/_doc/1
119+
{
120+
"my_field": "[Beck](Beck) announced a new tour"<2>
121+
}
122+
123+
PUT my_index/_doc/2
124+
{
125+
"my_field": "[Jeff Beck](Jeff+Beck&Guitarist) plays a strat"<1>
126+
}
127+
128+
# Example search
129+
GET my_index/_search
130+
{
131+
"query": {
132+
"term": {
133+
"my_field": "Beck" <3>
134+
}
135+
}
136+
}
137+
--------------------------
138+
// CONSOLE
139+
140+
<1> As well as tokenising the plain text into single words e.g. `beck`, here we
141+
inject the single token value `Beck` at the same position as `beck` in the token stream.
142+
<2> Note annotations can inject multiple tokens at the same position - here we inject both
143+
the very specific value `Jeff Beck` and the broader term `Guitarist`. This enables
144+
broader positional queries e.g. finding mentions of a `Guitarist` near to `strat`.
145+
<3> A benefit of searching with these carefully defined annotation tokens is that a query for
146+
`Beck` will not match document 2 that contains the tokens `jeff`, `beck` and `Jeff Beck`
147+
148+
WARNING: Any use of `=` signs in annotation values eg `[Prince](person=Prince)` will
149+
cause the document to be rejected with a parse failure. In future we hope to have a use for
150+
the equals signs so wil actively reject documents that contain this today.
151+
152+
153+
[[mapper-annotated-text-tips]]
154+
==== Data modelling tips
155+
===== Use structured and unstructured fields
156+
157+
Annotations are normally a way of weaving structured information into unstructured text for
158+
higher-precision search.
159+
160+
`Entity resolution` is a form of document enrichment undertaken by specialist software or people
161+
where references to entities in a document are disambiguated by attaching a canonical ID.
162+
The ID is used to resolve any number of aliases or distinguish between people with the
163+
same name. The hyperlinks connecting Wikipedia's articles are a good example of resolved
164+
entity IDs woven into text.
165+
166+
These IDs can be embedded as annotations in an annotated_text field but it often makes
167+
sense to include them in dedicated structured fields to support discovery via aggregations:
168+
169+
[source,js]
170+
--------------------------
171+
PUT my_index
172+
{
173+
"mappings": {
174+
"_doc": {
175+
"properties": {
176+
"my_unstructured_text_field": {
177+
"type": "annotated_text"
178+
},
179+
"my_structured_people_field": {
180+
"type": "text",
181+
"fields": {
182+
"keyword" :{
183+
"type": "keyword"
184+
}
185+
}
186+
}
187+
}
188+
}
189+
}
190+
}
191+
--------------------------
192+
// CONSOLE
193+
194+
Applications would then typically provide content and discover it as follows:
195+
196+
[source,js]
197+
--------------------------
198+
# Example documents
199+
PUT my_index/_doc/1
200+
{
201+
"my_unstructured_text_field": "[Shay](%40kimchy) created elasticsearch",
202+
"my_twitter_handles": ["@kimchy"] <1>
203+
}
204+
205+
GET my_index/_search
206+
{
207+
"query": {
208+
"query_string": {
209+
"query": "elasticsearch OR logstash OR kibana",<2>
210+
"default_field": "my_unstructured_text_field"
211+
}
212+
},
213+
"aggregations": {
214+
"top_people" :{
215+
"significant_terms" : { <3>
216+
"field" : "my_twitter_handles.keyword"
217+
}
218+
}
219+
}
220+
}
221+
--------------------------
222+
// CONSOLE
223+
224+
<1> Note the `my_twitter_handles` contains a list of the annotation values
225+
also used in the unstructured text. (Note the annotated_text syntax requires escaping).
226+
By repeating the annotation values in a structured field this application has ensured that
227+
the tokens discovered in the structured field can be used for search and highlighting
228+
in the unstructured field.
229+
<2> In this example we search for documents that talk about components of the elastic stack
230+
<3> We use the `my_twitter_handles` field here to discover people who are significantly
231+
associated with the elastic stack.
232+
233+
===== Avoiding over-matching annotations
234+
By design, the regular text tokens and the annotation tokens co-exist in the same indexed
235+
field but in rare cases this can lead to some over-matching.
236+
237+
The value of an annotation often denotes a _named entity_ (a person, place or company).
238+
The tokens for these named entities are inserted untokenized, and differ from typical text
239+
tokens because they are normally:
240+
241+
* Mixed case e.g. `Madonna`
242+
* Multiple words e.g. `Jeff Beck`
243+
* Can have punctuation or numbers e.g. `Apple Inc.` or `@kimchy`
244+
245+
This means, for the most part, a search for a named entity in the annotated text field will
246+
not have any false positives e.g. when selecting `Apple Inc.` from an aggregation result
247+
you can drill down to highlight uses in the text without "over matching" on any text tokens
248+
like the word `apple` in this context:
249+
250+
the apple was very juicy
251+
252+
However, a problem arises if your named entity happens to be a single term and lower-case e.g. the
253+
company `elastic`. In this case, a search on the annotated text field for the token `elastic`
254+
may match a text document such as this:
255+
256+
he fired an elastic band
257+
258+
To avoid such false matches users should consider prefixing annotation values to ensure
259+
they don't name clash with text tokens e.g.
260+
261+
[elastic](Company_elastic) released version 7.0 of the elastic stack today
262+
263+
264+
265+
266+
[[mapper-annotated-text-highlighter]]
267+
==== Using the `annotated` highlighter
268+
269+
The `annotated-text` plugin includes a custom highlighter designed to mark up search hits
270+
in a way which is respectful of the original markup:
271+
272+
[source,js]
273+
--------------------------
274+
# Example documents
275+
PUT my_index/_doc/1
276+
{
277+
"my_field": "The cat sat on the [mat](sku3578)"
278+
}
279+
280+
GET my_index/_search
281+
{
282+
"query": {
283+
"query_string": {
284+
"query": "cats"
285+
}
286+
},
287+
"highlight": {
288+
"fields": {
289+
"my_field": {
290+
"type": "annotated", <1>
291+
"require_field_match": false
292+
}
293+
}
294+
}
295+
}
296+
--------------------------
297+
// CONSOLE
298+
<1> The `annotated` highlighter type is designed for use with annotated_text fields
299+
300+
The annotated highlighter is based on the `unified` highlighter and supports the same
301+
settings but does not use the `pre_tags` or `post_tags` parameters. Rather than using
302+
html-like markup such as `<em>cat</em>` the annotated highlighter uses the same
303+
markdown-like syntax used for annotations and injects a key=value annotation where `_hit_term`
304+
is the key and the matched search term is the value e.g.
305+
306+
The [cat](_hit_term=cat) sat on the [mat](sku3578)
307+
308+
The annotated highlighter tries to be respectful of any existing markup in the original
309+
text:
310+
311+
* If the search term matches exactly the location of an existing annotation then the
312+
`_hit_term` key is merged into the url-like syntax used in the `(...)` part of the
313+
existing annotation.
314+
* However, if the search term overlaps the span of an existing annotation it would break
315+
the markup formatting so the original annotation is removed in favour of a new annotation
316+
with just the search hit information in the results.
317+
* Any non-overlapping annotations in the original text are preserved in highlighter
318+
selections
319+
320+
321+
[[mapper-annotated-text-limitations]]
322+
==== Limitations
323+
324+
The annotated_text field type supports the same mapping settings as the `text` field type
325+
but with the following exceptions:
326+
327+
* No support for `fielddata` or `fielddata_frequency_filter`
328+
* No support for `index_prefixes` or `index_phrases` indexing

docs/plugins/mapper.asciidoc

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,5 +19,13 @@ indexes the size in bytes of the original
1919
The mapper-murmur3 plugin allows hashes to be computed at index-time and stored
2020
in the index for later use with the `cardinality` aggregation.
2121

22+
<<mapper-annotated-text>>::
23+
24+
The annotated text plugin provides the ability to index text that is a
25+
combination of free-text and special markup that is typically used to identify
26+
items of interest such as people or organisations (see NER or Named Entity Recognition
27+
tools).
28+
2229
include::mapper-size.asciidoc[]
2330
include::mapper-murmur3.asciidoc[]
31+
include::mapper-annotated-text.asciidoc[]

docs/reference/cat/plugins.asciidoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ U7321H6 discovery-gce {version} The Google Compute Engine (GCE) Discov
2828
U7321H6 ingest-attachment {version} Ingest processor that uses Apache Tika to extract contents
2929
U7321H6 ingest-geoip {version} Ingest processor that uses looksup geo data based on ip adresses using the Maxmind geo database
3030
U7321H6 ingest-user-agent {version} Ingest processor that extracts information from a user agent
31+
U7321H6 mapper-annotated-text {version} The Mapper Annotated_text plugin adds support for text fields with markup used to inject annotation tokens into the index.
3132
U7321H6 mapper-murmur3 {version} The Mapper Murmur3 plugin allows to compute hashes of a field's values at index-time and to store them in the index.
3233
U7321H6 mapper-size {version} The Mapper Size plugin allows document to record their uncompressed size at index time.
3334
U7321H6 store-smb {version} The Store SMB plugin adds support for SMB stores.

docs/reference/mapping/types.asciidoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ string:: <<text,`text`>> and <<keyword,`keyword`>>
3535
`completion` to provide auto-complete suggestions
3636
<<token-count>>:: `token_count` to count the number of tokens in a string
3737
{plugins}/mapper-murmur3.html[`mapper-murmur3`]:: `murmur3` to compute hashes of values at index-time and store them in the index
38+
{plugins}/mapper-annotated-text.html[`mapper-annotated-text`]:: `annotated-text` to index text containing special markup (typically used for identifying named entities)
3839

3940
<<percolator>>:: Accepts queries from the query-dsl
4041

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
/*
2+
* Licensed to Elasticsearch under one or more contributor
3+
* license agreements. See the NOTICE file distributed with
4+
* this work for additional information regarding copyright
5+
* ownership. Elasticsearch licenses this file to you under
6+
* the Apache License, Version 2.0 (the "License"); you may
7+
* not use this file except in compliance with the License.
8+
* You may obtain a copy of the License at
9+
*
10+
* http://www.apache.org/licenses/LICENSE-2.0
11+
*
12+
* Unless required by applicable law or agreed to in writing,
13+
* software distributed under the License is distributed on an
14+
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
* KIND, either express or implied. See the License for the
16+
* specific language governing permissions and limitations
17+
* under the License.
18+
*/
19+
20+
esplugin {
21+
description 'The Mapper Annotated_text plugin adds support for text fields with markup used to inject annotation tokens into the index.'
22+
classname 'org.elasticsearch.plugin.mapper.AnnotatedTextPlugin'
23+
}

0 commit comments

Comments
 (0)