Skip to content

Commit 2e0ed04

Browse files
committed
[DOCS] Rewrite analysis intro (#51184)
* [DOCS] Rewrite analysis intro. Move index/search analysis content. * Rewrites 'Text analysis' page intro as high-level definition. Adds guidance on when users should configure text analysis * Rewrites and splits index/search analysis content: * Conceptual content -> 'Index and search analysis' under 'Concepts' * Task-based content -> 'Specify an analyzer' under 'Configure...' * Adds detailed examples for when to use the same index/search analyzer and when not. * Adds new example snippets for specifying search analyzers * clarifications * Add toc. Decrement headings. * Reword 'When to configure' section * Remove sentence from tip
1 parent 3ba58c9 commit 2e0ed04

File tree

5 files changed

+411
-131
lines changed

5 files changed

+411
-131
lines changed

docs/reference/analysis.asciidoc

Lines changed: 27 additions & 129 deletions
Original file line numberDiff line numberDiff line change
@@ -4,141 +4,40 @@
44
[partintro]
55
--
66

7-
_Text analysis_ is the process of converting text, like the body of any email,
8-
into _tokens_ or _terms_ which are added to the inverted index for searching.
9-
Analysis is performed by an <<analysis-analyzers,_analyzer_>> which can be
10-
either a built-in analyzer or a <<analysis-custom-analyzer,`custom`>> analyzer
11-
defined per index.
7+
_Text analysis_ is the process of converting unstructured text, like
8+
the body of an email or a product description, into a structured format that's
9+
optimized for search.
1210

1311
[float]
14-
== Index time analysis
12+
[[when-to-configure-analysis]]
13+
=== When to configure text analysis
1514

16-
For instance, at index time the built-in <<english-analyzer,`english`>> _analyzer_
17-
will first convert the sentence:
15+
{es} performs text analysis when indexing or searching <<text,`text`>> fields.
1816

19-
[source,text]
20-
------
21-
"The QUICK brown foxes jumped over the lazy dog!"
22-
------
17+
If your index doesn't contain `text` fields, no further setup is needed; you can
18+
skip the pages in this section.
2319

24-
into distinct tokens. It will then lowercase each token, remove frequent
25-
stopwords ("the") and reduce the terms to their word stems (foxes -> fox,
26-
jumped -> jump, lazy -> lazi). In the end, the following terms will be added
27-
to the inverted index:
20+
However, if you use `text` fields or your text searches aren't returning results
21+
as expected, configuring text analysis can often help. You should also look into
22+
analysis configuration if you're using {es} to:
2823

29-
[source,text]
30-
------
31-
[ quick, brown, fox, jump, over, lazi, dog ]
32-
------
24+
* Build a search engine
25+
* Mine unstructured data
26+
* Fine-tune search for a specific language
27+
* Perform lexicographic or linguistic research
3328

3429
[float]
35-
[[specify-index-time-analyzer]]
36-
=== Specifying an index time analyzer
37-
38-
{es} determines which index-time analyzer to use by
39-
checking the following parameters in order:
40-
41-
. The <<analyzer,`analyzer`>> mapping parameter of the field
42-
. The `default` analyzer parameter in the index settings
43-
44-
If none of these parameters are specified, the
45-
<<analysis-standard-analyzer,`standard` analyzer>> is used.
46-
47-
[discrete]
48-
[[specify-index-time-field-analyzer]]
49-
==== Specify the index-time analyzer for a field
50-
51-
Each <<text,`text`>> field in a mapping can specify its own
52-
<<analyzer,`analyzer`>>:
53-
54-
[source,console]
55-
-------------------------
56-
PUT my_index
57-
{
58-
"mappings": {
59-
"properties": {
60-
"title": {
61-
"type": "text",
62-
"analyzer": "standard"
63-
}
64-
}
65-
}
66-
}
67-
-------------------------
68-
69-
[discrete]
70-
[[specify-index-time-default-analyzer]]
71-
==== Specify a default index-time analyzer
72-
73-
When <<indices-create-index,creating an index>>, you can set a default
74-
index-time analyzer using the `default` analyzer setting:
75-
76-
[source,console]
77-
----
78-
PUT my_index
79-
{
80-
"settings": {
81-
"analysis": {
82-
"analyzer": {
83-
"default": {
84-
"type": "whitespace"
85-
}
86-
}
87-
}
88-
}
89-
}
90-
----
91-
92-
A default index-time analyzer is useful when mapping multiple `text` fields that
93-
use the same analyzer. It's also used as a general fallback analyzer for both
94-
index-time and search-time analysis.
95-
96-
[float]
97-
== Search time analysis
98-
99-
This same analysis process is applied to the query string at search time in
100-
<<full-text-queries,full text queries>> like the
101-
<<query-dsl-match-query,`match` query>>
102-
to convert the text in the query string into terms of the same form as those
103-
that are stored in the inverted index.
104-
105-
For instance, a user might search for:
106-
107-
[source,text]
108-
------
109-
"a quick fox"
110-
------
111-
112-
which would be analysed by the same `english` analyzer into the following terms:
113-
114-
[source,text]
115-
------
116-
[ quick, fox ]
117-
------
118-
119-
Even though the exact words used in the query string don't appear in the
120-
original text (`quick` vs `QUICK`, `fox` vs `foxes`), because we have applied
121-
the same analyzer to both the text and the query string, the terms from the
122-
query string exactly match the terms from the text in the inverted index,
123-
which means that this query would match our example document.
124-
125-
[float]
126-
=== Specifying a search time analyzer
127-
128-
Usually the same analyzer should be used both at
129-
index time and at search time, and <<full-text-queries,full text queries>>
130-
like the <<query-dsl-match-query,`match` query>> will use the mapping to look
131-
up the analyzer to use for each field.
132-
133-
The analyzer to use to search a particular field is determined by
134-
looking for:
135-
136-
* An `analyzer` specified in the query itself.
137-
* The <<search-analyzer,`search_analyzer`>> mapping parameter.
138-
* The <<analyzer,`analyzer`>> mapping parameter.
139-
* An analyzer in the index settings called `default_search`.
140-
* An analyzer in the index settings called `default`.
141-
* The `standard` analyzer.
30+
[[analysis-toc]]
31+
=== In this section
32+
33+
* <<analysis-overview>>
34+
* <<analysis-concepts>>
35+
* <<configure-text-analysis>>
36+
* <<analysis-analyzers>>
37+
* <<analysis-tokenizers>>
38+
* <<analysis-tokenfilters>>
39+
* <<analysis-charfilters>>
40+
* <<analysis-normalizers>>
14241

14342
--
14443

@@ -156,5 +55,4 @@ include::analysis/tokenfilters.asciidoc[]
15655

15756
include::analysis/charfilters.asciidoc[]
15857

159-
include::analysis/normalizers.asciidoc[]
160-
58+
include::analysis/normalizers.asciidoc[]

docs/reference/analysis/concepts.asciidoc

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,5 +7,7 @@
77
This section explains the fundamental concepts of text analysis in {es}.
88

99
* <<analyzer-anatomy>>
10+
* <<analysis-index-search-time>>
1011

11-
include::anatomy.asciidoc[]
12+
include::anatomy.asciidoc[]
13+
include::index-search-time.asciidoc[]

docs/reference/analysis/configure-text-analysis.asciidoc

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,10 +20,13 @@ the process.
2020
* <<test-analyzer>>
2121
* <<configuring-analyzers>>
2222
* <<analysis-custom-analyzer>>
23+
* <specify-analyer>>
2324

2425

2526
include::testing.asciidoc[]
2627

2728
include::analyzers/configuring.asciidoc[]
2829

29-
include::analyzers/custom-analyzer.asciidoc[]
30+
include::analyzers/custom-analyzer.asciidoc[]
31+
32+
include::specify-analyzer.asciidoc[]
Lines changed: 175 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,175 @@
1+
[[analysis-index-search-time]]
2+
=== Index and search analysis
3+
4+
Text analysis occurs at two times:
5+
6+
Index time::
7+
When a document is indexed, any <<text,`text`>> field values are analyzed.
8+
9+
Search time::
10+
When running a <<full-text-queries,full-text search>> on a `text` field,
11+
the query string (the text the user is searching for) is analyzed.
12+
+
13+
Search time is also called _query time_.
14+
15+
The analyzer, or set of analysis rules, used at each time is called the _index
16+
analyzer_ or _search analyzer_ respectively.
17+
18+
[[analysis-same-index-search-analyzer]]
19+
==== How the index and search analyzer work together
20+
21+
In most cases, the same analyzer should be used at index and search time. This
22+
ensures the values and query strings for a field are changed into the same form
23+
of tokens. In turn, this ensures the tokens match as expected during a search.
24+
25+
.**Example**
26+
[%collapsible]
27+
====
28+
29+
A document is indexed with the following value in a `text` field:
30+
31+
[source,text]
32+
------
33+
The QUICK brown foxes jumped over the dog!
34+
------
35+
36+
The index analyzer for the field converts the value into tokens and normalizes
37+
them. In this case, each of the tokens represents a word:
38+
39+
[source,text]
40+
------
41+
[ quick, brown, fox, jump, over, dog ]
42+
------
43+
44+
These tokens are then indexed.
45+
46+
Later, a user searches the same `text` field for:
47+
48+
[source,text]
49+
------
50+
"Quick fox"
51+
------
52+
53+
The user expects this search to match the sentence indexed earlier,
54+
`The QUICK brown foxes jumped over the dog!`.
55+
56+
However, the query string does not contain the exact words used in the
57+
document's original text:
58+
59+
* `quick` vs `QUICK`
60+
* `fox` vs `foxes`
61+
62+
To account for this, the query string is analyzed using the same analyzer. This
63+
analyzer produces the following tokens:
64+
65+
[source,text]
66+
------
67+
[ quick, fox ]
68+
------
69+
70+
To execute the serach, {es} compares these query string tokens to the tokens
71+
indexed in the `text` field.
72+
73+
[options="header"]
74+
|===
75+
|Token | Query string | `text` field
76+
|`quick` | X | X
77+
|`brown` | | X
78+
|`fox` | X | X
79+
|`jump` | | X
80+
|`over` | | X
81+
|`dog` | | X
82+
|===
83+
84+
Because the field value are query string were analyzed in the same way, they
85+
created similar tokens. The tokens `quick` and `fox` are exact matches. This
86+
means the search matches the document containing `"The QUICK brown foxes jumped
87+
over the dog!"`, just as the user expects.
88+
====
89+
90+
[[different-analyzers]]
91+
==== When to use a different search analyzer
92+
93+
While less common, it sometimes makes sense to use different analyzers at index
94+
and search time. To enable this, {es} allows you to
95+
<<specify-search-analyzer,specify a separate search analyzer>>.
96+
97+
Generally, a separate search analyzer should only be specified when using the
98+
same form of tokens for field values and query strings would create unexpected
99+
or irrelevant search matches.
100+
101+
[[different-analyzer-ex]]
102+
.*Example*
103+
[%collapsible]
104+
====
105+
{es} is used to create a search engine that matches only words that start with
106+
a provided prefix. For instance, a search for `tr` should return `tram` or
107+
`trope`—but never `taxi` or `bat`.
108+
109+
A document is added to the search engine's index; this document contains one
110+
such word in a `text` field:
111+
112+
[source,text]
113+
------
114+
"Apple"
115+
------
116+
117+
The index analyzer for the field converts the value into tokens and normalizes
118+
them. In this case, each of the tokens represents a potential prefix for
119+
the word:
120+
121+
[source,text]
122+
------
123+
[ a, ap, app, appl, apple]
124+
------
125+
126+
These tokens are then indexed.
127+
128+
Later, a user searches the same `text` field for:
129+
130+
[source,text]
131+
------
132+
"appli"
133+
------
134+
135+
The user expects this search to match only words that start with `appli`,
136+
such as `appliance` or `application`. The search should not match `apple`.
137+
138+
However, if the index analyzer is used to analyze this query string, it would
139+
produce the following tokens:
140+
141+
[source,text]
142+
------
143+
[ a, ap, app, appl, appli ]
144+
------
145+
146+
When {es} compares these query string tokens to the ones indexed for `apple`,
147+
it finds several matches.
148+
149+
[options="header"]
150+
|===
151+
|Token | `appli` | `apple`
152+
|`a` | X | X
153+
|`ap` | X | X
154+
|`app` | X | X
155+
|`appl` | X | X
156+
|`appli` | | X
157+
|===
158+
159+
This means the search would erroneously match `apple`. Not only that, it would
160+
match any word starting with `a`.
161+
162+
To fix this, you can specify a different search analyzer for query strings used
163+
on the `text` field.
164+
165+
In this case, you could specify a search analyzer that produces a single token
166+
rather than a set of prefixes:
167+
168+
[source,text]
169+
------
170+
[ appli ]
171+
------
172+
173+
This query string token would only match tokens for words that start with
174+
`appli`, which better aligns with the user's search expectations.
175+
====

0 commit comments

Comments
 (0)