Skip to content

Commit 9d1567b

Browse files
committed
[DOCS] Add overview page to analysis topic (#50515)
Adds a 'text analysis overview' page to the analysis topic docs. The goals of this page are: * Concisely summarize the analysis process while avoiding in-depth concepts, tutorials, or API examples * Explain why analysis is important, largely through highlighting problems with full-text searches missing analysis * Highlight how analysis can be used to improve search results
1 parent 90e66a7 commit 9d1567b

File tree

2 files changed

+83
-3
lines changed

2 files changed

+83
-3
lines changed

docs/reference/analysis.asciidoc

+5-3
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
11
[[analysis]]
2-
= Analysis
2+
= Text analysis
33

44
[partintro]
55
--
66

7-
_Analysis_ is the process of converting text, like the body of any email, into
8-
_tokens_ or _terms_ which are added to the inverted index for searching.
7+
_Text analysis_ is the process of converting text, like the body of any email,
8+
into _tokens_ or _terms_ which are added to the inverted index for searching.
99
Analysis is performed by an <<analysis-analyzers,_analyzer_>> which can be
1010
either a built-in analyzer or a <<analysis-custom-analyzer,`custom`>> analyzer
1111
defined per index.
@@ -142,6 +142,8 @@ looking for:
142142

143143
--
144144

145+
include::analysis/overview.asciidoc[]
146+
145147
include::analysis/anatomy.asciidoc[]
146148

147149
include::analysis/testing.asciidoc[]
+78
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
2+
== Text analysis overview
3+
++++
4+
<titleabbrev>Overview</titleabbrev>
5+
++++
6+
7+
Text analysis enables {es} to perform full-text search, where the search returns
8+
all _relevant_ results rather than just exact matches.
9+
10+
If you search for `Quick fox jumps`, you probably want the document that
11+
contains `A quick brown fox jumps over the lazy dog`, and you might also want
12+
documents that contain related words like `fast fox` or `foxes leap`.
13+
14+
[discrete]
15+
[[tokenization]]
16+
=== Tokenization
17+
18+
Analysis makes full-text search possible through _tokenization_: breaking a text
19+
down into smaller chunks, called _tokens_. In most cases, these tokens are
20+
individual words.
21+
22+
If you index the phrase `the quick brown fox jumps` as a single string and the
23+
user searches for `quick fox`, it isn't considered a match. However, if you
24+
tokenize the phrase and index each word separately, the terms in the query
25+
string can be looked up individually. This means they can be matched by searches
26+
for `quick fox`, `fox brown`, or other variations.
27+
28+
[discrete]
29+
[[normalization]]
30+
=== Normalization
31+
32+
Tokenization enables matching on individual terms, but each token is still
33+
matched literally. This means:
34+
35+
* A search for `Quick` would not match `quick`, even though you likely want
36+
either term to match the other
37+
38+
* Although `fox` and `foxes` share the same root word, a search for `foxes`
39+
would not match `fox` or vice versa.
40+
41+
* A search for `jumps` would not match `leaps`. While they don't share a root
42+
word, they are synonyms and have a similar meaning.
43+
44+
To solve these problems, text analysis can _normalize_ these tokens into a
45+
standard format. This allows you to match tokens that are not exactly the same
46+
as the search terms, but similar enough to still be relevant. For example:
47+
48+
* `Quick` can be lowercased: `quick`.
49+
50+
* `foxes` can be _stemmed_, or reduced to its root word: `fox`.
51+
52+
* `jump` and `leap` are synonyms and can be indexed as a single word: `jump`.
53+
54+
To ensure search terms match these words as intended, you can apply the same
55+
tokenization and normalization rules to the query string. For example, a search
56+
for `Foxes leap` can be normalized to a search for `fox jump`.
57+
58+
[discrete]
59+
[[analysis-customization]]
60+
=== Customize text analysis
61+
62+
Text analysis is performed by an <<analyzer-anatomy,_analyzer_>>, a set of rules
63+
that govern the entire process.
64+
65+
{es} includes a default analyzer, called the
66+
<<analysis-standard-analyzer,standard analyzer>>, which works well for most use
67+
cases right out of the box.
68+
69+
If you want to tailor your search experience, you can choose a different
70+
<<analysis-analyzers,built-in analyzer>> or even
71+
<<analysis-custom-analyzer,configure a custom one>>. A custom analyzer gives you
72+
control over each step of the analysis process, including:
73+
74+
* Changes to the text _before_ tokenization
75+
76+
* How text is converted to tokens
77+
78+
* Normalization changes made to tokens before indexing or search

0 commit comments

Comments
 (0)