Skip to content

Commit 9881bfa

Browse files
authored
Docs: Document how to rebuild analyzers (#30498)
Adds documentation for how to rebuild all the built in analyzers and tests for that documentation using the mechanism added in #29535. Closes #29499
1 parent 7f47ff9 commit 9881bfa

7 files changed

+284
-75
lines changed

docs/reference/analysis/analyzers/fingerprint-analyzer.asciidoc

+43-14
Original file line numberDiff line numberDiff line change
@@ -9,20 +9,6 @@ Input text is lowercased, normalized to remove extended characters, sorted,
99
deduplicated and concatenated into a single token. If a stopword list is
1010
configured, stop words will also be removed.
1111

12-
[float]
13-
=== Definition
14-
15-
It consists of:
16-
17-
Tokenizer::
18-
* <<analysis-standard-tokenizer,Standard Tokenizer>>
19-
20-
Token Filters (in order)::
21-
1. <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
22-
2. <<analysis-asciifolding-tokenfilter>>
23-
3. <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
24-
4. <<analysis-fingerprint-tokenfilter>>
25-
2612
[float]
2713
=== Example output
2814

@@ -149,3 +135,46 @@ The above example produces the following term:
149135
---------------------------
150136
[ consistent godel said sentence yes ]
151137
---------------------------
138+
139+
[float]
140+
=== Definition
141+
142+
The `fingerprint` tokenizer consists of:
143+
144+
Tokenizer::
145+
* <<analysis-standard-tokenizer,Standard Tokenizer>>
146+
147+
Token Filters (in order)::
148+
* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
149+
* <<analysis-asciifolding-tokenfilter>>
150+
* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
151+
* <<analysis-fingerprint-tokenfilter>>
152+
153+
If you need to customize the `fingerprint` analyzer beyond the configuration
154+
parameters then you need to recreate it as a `custom` analyzer and modify
155+
it, usually by adding token filters. This would recreate the built-in
156+
`fingerprint` analyzer and you can use it as a starting point for further
157+
customization:
158+
159+
[source,js]
160+
----------------------------------------------------
161+
PUT /fingerprint_example
162+
{
163+
"settings": {
164+
"analysis": {
165+
"analyzer": {
166+
"rebuilt_fingerprint": {
167+
"tokenizer": "standard",
168+
"filter": [
169+
"lowercase",
170+
"asciifolding",
171+
"fingerprint"
172+
]
173+
}
174+
}
175+
}
176+
}
177+
}
178+
----------------------------------------------------
179+
// CONSOLE
180+
// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: fingerprint_example, first: fingerprint, second: rebuilt_fingerprint}\nendyaml\n/]

docs/reference/analysis/analyzers/keyword-analyzer.asciidoc

+37-8
Original file line numberDiff line numberDiff line change
@@ -4,14 +4,6 @@
44
The `keyword` analyzer is a ``noop'' analyzer which returns the entire input
55
string as a single token.
66

7-
[float]
8-
=== Definition
9-
10-
It consists of:
11-
12-
Tokenizer::
13-
* <<analysis-keyword-tokenizer,Keyword Tokenizer>>
14-
157
[float]
168
=== Example output
179

@@ -57,3 +49,40 @@ The above sentence would produce the following single term:
5749
=== Configuration
5850

5951
The `keyword` analyzer is not configurable.
52+
53+
[float]
54+
=== Definition
55+
56+
The `keyword` analyzer consists of:
57+
58+
Tokenizer::
59+
* <<analysis-keyword-tokenizer,Keyword Tokenizer>>
60+
61+
If you need to customize the `keyword` analyzer then you need to
62+
recreate it as a `custom` analyzer and modify it, usually by adding
63+
token filters. Usually, you should prefer the
64+
<<keyword, Keyword type>> when you want strings that are not split
65+
into tokens, but just in case you need it, this would recreate the
66+
built-in `keyword` analyzer and you can use it as a starting point
67+
for further customization:
68+
69+
[source,js]
70+
----------------------------------------------------
71+
PUT /keyword_example
72+
{
73+
"settings": {
74+
"analysis": {
75+
"analyzer": {
76+
"rebuilt_keyword": {
77+
"tokenizer": "keyword",
78+
"filter": [ <1>
79+
]
80+
}
81+
}
82+
}
83+
}
84+
}
85+
----------------------------------------------------
86+
// CONSOLE
87+
// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: keyword_example, first: keyword, second: rebuilt_keyword}\nendyaml\n/]
88+
<1> You'd add any token filters here.

docs/reference/analysis/analyzers/pattern-analyzer.asciidoc

+48-13
Original file line numberDiff line numberDiff line change
@@ -19,19 +19,6 @@ Read more about http://www.regular-expressions.info/catastrophic.html[pathologic
1919
2020
========================================
2121

22-
23-
[float]
24-
=== Definition
25-
26-
It consists of:
27-
28-
Tokenizer::
29-
* <<analysis-pattern-tokenizer,Pattern Tokenizer>>
30-
31-
Token Filters::
32-
* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
33-
* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
34-
3522
[float]
3623
=== Example output
3724

@@ -378,3 +365,51 @@ The regex above is easier to understand as:
378365
[\p{L}&&[^\p{Lu}]] # then lower case
379366
)
380367
--------------------------------------------------
368+
369+
[float]
370+
=== Definition
371+
372+
The `pattern` anlayzer consists of:
373+
374+
Tokenizer::
375+
* <<analysis-pattern-tokenizer,Pattern Tokenizer>>
376+
377+
Token Filters::
378+
* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
379+
* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
380+
381+
If you need to customize the `pattern` analyzer beyond the configuration
382+
parameters then you need to recreate it as a `custom` analyzer and modify
383+
it, usually by adding token filters. This would recreate the built-in
384+
`pattern` analyzer and you can use it as a starting point for further
385+
customization:
386+
387+
[source,js]
388+
----------------------------------------------------
389+
PUT /pattern_example
390+
{
391+
"settings": {
392+
"analysis": {
393+
"tokenizer": {
394+
"split_on_non_word": {
395+
"type": "pattern",
396+
"pattern": "\\W+" <1>
397+
}
398+
},
399+
"analyzer": {
400+
"rebuilt_pattern": {
401+
"tokenizer": "split_on_non_word",
402+
"filter": [
403+
"lowercase" <2>
404+
]
405+
}
406+
}
407+
}
408+
}
409+
}
410+
----------------------------------------------------
411+
// CONSOLE
412+
// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: pattern_example, first: pattern, second: rebuilt_pattern}\nendyaml\n/]
413+
<1> The default pattern is `\W+` which splits on non-word characters
414+
and this is where you'd change it.
415+
<2> You'd add other token filters after `lowercase`.

docs/reference/analysis/analyzers/simple-analyzer.asciidoc

+34-8
Original file line numberDiff line numberDiff line change
@@ -4,14 +4,6 @@
44
The `simple` analyzer breaks text into terms whenever it encounters a
55
character which is not a letter. All terms are lower cased.
66

7-
[float]
8-
=== Definition
9-
10-
It consists of:
11-
12-
Tokenizer::
13-
* <<analysis-lowercase-tokenizer,Lower Case Tokenizer>>
14-
157
[float]
168
=== Example output
179

@@ -127,3 +119,37 @@ The above sentence would produce the following terms:
127119
=== Configuration
128120

129121
The `simple` analyzer is not configurable.
122+
123+
[float]
124+
=== Definition
125+
126+
The `simple` analzyer consists of:
127+
128+
Tokenizer::
129+
* <<analysis-lowercase-tokenizer,Lower Case Tokenizer>>
130+
131+
If you need to customize the `simple` analyzer then you need to recreate
132+
it as a `custom` analyzer and modify it, usually by adding token filters.
133+
This would recreate the built-in `simple` analyzer and you can use it as
134+
a starting point for further customization:
135+
136+
[source,js]
137+
----------------------------------------------------
138+
PUT /simple_example
139+
{
140+
"settings": {
141+
"analysis": {
142+
"analyzer": {
143+
"rebuilt_simple": {
144+
"tokenizer": "lowercase",
145+
"filter": [ <1>
146+
]
147+
}
148+
}
149+
}
150+
}
151+
}
152+
----------------------------------------------------
153+
// CONSOLE
154+
// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: simple_example, first: simple, second: rebuilt_simple}\nendyaml\n/]
155+
<1> You'd add any token filters here.

docs/reference/analysis/analyzers/standard-analyzer.asciidoc

+41-13
Original file line numberDiff line numberDiff line change
@@ -7,19 +7,6 @@ Segmentation algorithm, as specified in
77
http://unicode.org/reports/tr29/[Unicode Standard Annex #29]) and works well
88
for most languages.
99

10-
[float]
11-
=== Definition
12-
13-
It consists of:
14-
15-
Tokenizer::
16-
* <<analysis-standard-tokenizer,Standard Tokenizer>>
17-
18-
Token Filters::
19-
* <<analysis-standard-tokenfilter,Standard Token Filter>>
20-
* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
21-
* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
22-
2310
[float]
2411
=== Example output
2512

@@ -276,3 +263,44 @@ The above example produces the following terms:
276263
---------------------------
277264
[ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ]
278265
---------------------------
266+
267+
[float]
268+
=== Definition
269+
270+
The `standard` analyzer consists of:
271+
272+
Tokenizer::
273+
* <<analysis-standard-tokenizer,Standard Tokenizer>>
274+
275+
Token Filters::
276+
* <<analysis-standard-tokenfilter,Standard Token Filter>>
277+
* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
278+
* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
279+
280+
If you need to customize the `standard` analyzer beyond the configuration
281+
parameters then you need to recreate it as a `custom` analyzer and modify
282+
it, usually by adding token filters. This would recreate the built-in
283+
`standard` analyzer and you can use it as a starting point:
284+
285+
[source,js]
286+
----------------------------------------------------
287+
PUT /standard_example
288+
{
289+
"settings": {
290+
"analysis": {
291+
"analyzer": {
292+
"rebuilt_standard": {
293+
"tokenizer": "standard",
294+
"filter": [
295+
"standard",
296+
"lowercase" <1>
297+
]
298+
}
299+
}
300+
}
301+
}
302+
}
303+
----------------------------------------------------
304+
// CONSOLE
305+
// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: standard_example, first: standard, second: rebuilt_standard}\nendyaml\n/]
306+
<1> You'd add any token filters after `lowercase`.

docs/reference/analysis/analyzers/stop-analyzer.asciidoc

+47-11
Original file line numberDiff line numberDiff line change
@@ -5,17 +5,6 @@ The `stop` analyzer is the same as the <<analysis-simple-analyzer,`simple` analy
55
but adds support for removing stop words. It defaults to using the
66
`_english_` stop words.
77

8-
[float]
9-
=== Definition
10-
11-
It consists of:
12-
13-
Tokenizer::
14-
* <<analysis-lowercase-tokenizer,Lower Case Tokenizer>>
15-
16-
Token filters::
17-
* <<analysis-stop-tokenfilter,Stop Token Filter>>
18-
198
[float]
209
=== Example output
2110

@@ -239,3 +228,50 @@ The above example produces the following terms:
239228
---------------------------
240229
[ quick, brown, foxes, jumped, lazy, dog, s, bone ]
241230
---------------------------
231+
232+
[float]
233+
=== Definition
234+
235+
It consists of:
236+
237+
Tokenizer::
238+
* <<analysis-lowercase-tokenizer,Lower Case Tokenizer>>
239+
240+
Token filters::
241+
* <<analysis-stop-tokenfilter,Stop Token Filter>>
242+
243+
If you need to customize the `stop` analyzer beyond the configuration
244+
parameters then you need to recreate it as a `custom` analyzer and modify
245+
it, usually by adding token filters. This would recreate the built-in
246+
`stop` analyzer and you can use it as a starting point for further
247+
customization:
248+
249+
[source,js]
250+
----------------------------------------------------
251+
PUT /stop_example
252+
{
253+
"settings": {
254+
"analysis": {
255+
"filter": {
256+
"english_stop": {
257+
"type": "stop",
258+
"stopwords": "_english_" <1>
259+
}
260+
},
261+
"analyzer": {
262+
"rebuilt_stop": {
263+
"tokenizer": "lowercase",
264+
"filter": [
265+
"english_stop" <2>
266+
]
267+
}
268+
}
269+
}
270+
}
271+
}
272+
----------------------------------------------------
273+
// CONSOLE
274+
// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: stop_example, first: stop, second: rebuilt_stop}\nendyaml\n/]
275+
<1> The default stopwords can be overridden with the `stopwords`
276+
or `stopwords_path` parameters.
277+
<2> You'd add any token filters after `english_stop`.

0 commit comments

Comments
 (0)