Skip to content

Commit 902ab4f

Browse files
jrodewigdebadair
andcommitted
[DOCS] Reformat compound word token filters (#49006)
* Separates the compound token filters doc pages into separate token filter pages: * Dictionary decompounder token filter * Hyphenation decompounder token filter * Adds analyze API examples for each compound token filter * Adds a redirect for the removed compound token filters page Co-Authored-By: debadair <[email protected]>
1 parent bafcc18 commit 902ab4f

5 files changed

+337
-117
lines changed

docs/reference/analysis/tokenfilters.asciidoc

+4-2
Original file line numberDiff line numberDiff line change
@@ -22,14 +22,14 @@ include::tokenfilters/classic-tokenfilter.asciidoc[]
2222

2323
include::tokenfilters/common-grams-tokenfilter.asciidoc[]
2424

25-
include::tokenfilters/compound-word-tokenfilter.asciidoc[]
26-
2725
include::tokenfilters/condition-tokenfilter.asciidoc[]
2826

2927
include::tokenfilters/decimal-digit-tokenfilter.asciidoc[]
3028

3129
include::tokenfilters/delimited-payload-tokenfilter.asciidoc[]
3230

31+
include::tokenfilters/dictionary-decompounder-tokenfilter.asciidoc[]
32+
3333
include::tokenfilters/edgengram-tokenfilter.asciidoc[]
3434

3535
include::tokenfilters/elision-tokenfilter.asciidoc[]
@@ -40,6 +40,8 @@ include::tokenfilters/flatten-graph-tokenfilter.asciidoc[]
4040

4141
include::tokenfilters/hunspell-tokenfilter.asciidoc[]
4242

43+
include::tokenfilters/hyphenation-decompounder-tokenfilter.asciidoc[]
44+
4345
include::tokenfilters/keep-types-tokenfilter.asciidoc[]
4446

4547
include::tokenfilters/keep-words-tokenfilter.asciidoc[]

docs/reference/analysis/tokenfilters/compound-word-tokenfilter.asciidoc

-115
This file was deleted.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,173 @@
1+
[[analysis-dict-decomp-tokenfilter]]
2+
=== Dictionary decompounder token filter
3+
++++
4+
<titleabbrev>Dictionary decompounder</titleabbrev>
5+
++++
6+
7+
[NOTE]
8+
====
9+
In most cases, we recommend using the faster
10+
<<analysis-hyp-decomp-tokenfilter,`hyphenation_decompounder`>> token filter
11+
in place of this filter. However, you can use the
12+
`dictionary_decompounder` filter to check the quality of a word list before
13+
implementing it in the `hyphenation_decompounder` filter.
14+
====
15+
16+
Uses a specified list of words and a brute force approach to find subwords in
17+
compound words. If found, these subwords are included in the token output.
18+
19+
This filter uses Lucene's
20+
https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilter.html[DictionaryCompoundWordTokenFilter],
21+
which was built for Germanic languages.
22+
23+
[[analysis-dict-decomp-tokenfilter-analyze-ex]]
24+
==== Example
25+
26+
The following <<indices-analyze,analyze API>> request uses the
27+
`dictionary_decompounder` filter to find subwords in `Donaudampfschiff`. The
28+
filter then checks these subwords against the specified list of words: `Donau`,
29+
`dampf`, `meer`, and `schiff`.
30+
31+
[source,console]
32+
--------------------------------------------------
33+
GET _analyze
34+
{
35+
"tokenizer": "standard",
36+
"filter": [
37+
{
38+
"type": "dictionary_decompounder",
39+
"word_list": ["Donau", "dampf", "meer", "schiff"]
40+
}
41+
],
42+
"text": "Donaudampfschiff"
43+
}
44+
--------------------------------------------------
45+
46+
The filter produces the following tokens:
47+
48+
[source,text]
49+
--------------------------------------------------
50+
[ Donaudampfschiff, Donau, dampf, schiff ]
51+
--------------------------------------------------
52+
53+
/////////////////////
54+
[source,console-result]
55+
--------------------------------------------------
56+
{
57+
"tokens" : [
58+
{
59+
"token" : "Donaudampfschiff",
60+
"start_offset" : 0,
61+
"end_offset" : 16,
62+
"type" : "<ALPHANUM>",
63+
"position" : 0
64+
},
65+
{
66+
"token" : "Donau",
67+
"start_offset" : 0,
68+
"end_offset" : 16,
69+
"type" : "<ALPHANUM>",
70+
"position" : 0
71+
},
72+
{
73+
"token" : "dampf",
74+
"start_offset" : 0,
75+
"end_offset" : 16,
76+
"type" : "<ALPHANUM>",
77+
"position" : 0
78+
},
79+
{
80+
"token" : "schiff",
81+
"start_offset" : 0,
82+
"end_offset" : 16,
83+
"type" : "<ALPHANUM>",
84+
"position" : 0
85+
}
86+
]
87+
}
88+
--------------------------------------------------
89+
/////////////////////
90+
91+
[[analysis-dict-decomp-tokenfilter-configure-parms]]
92+
==== Configurable parameters
93+
94+
`word_list`::
95+
+
96+
--
97+
(Required+++*+++, array of strings)
98+
A list of subwords to look for in the token stream. If found, the subword is
99+
included in the token output.
100+
101+
Either this parameter or `word_list_path` must be specified.
102+
--
103+
104+
`word_list_path`::
105+
+
106+
--
107+
(Required+++*+++, string)
108+
Path to a file that contains a list of subwords to find in the token stream. If
109+
found, the subword is included in the token output.
110+
111+
This path must be absolute or relative to the `config` location, and the file
112+
must be UTF-8 encoded. Each token in the file must be separated by a line break.
113+
114+
Either this parameter or `word_list` must be specified.
115+
--
116+
117+
`max_subword_size`::
118+
(Optional, integer)
119+
Maximum subword character length. Longer subword tokens are excluded from the
120+
output. Defaults to `15`.
121+
122+
`min_subword_size`::
123+
(Optional, integer)
124+
Minimum subword character length. Shorter subword tokens are excluded from the
125+
output. Defaults to `2`.
126+
127+
`min_word_size`::
128+
(Optional, integer)
129+
Minimum word character length. Shorter word tokens are excluded from the
130+
output. Defaults to `5`.
131+
132+
`only_longest_match`::
133+
(Optional, boolean)
134+
If `true`, only include the longest matching subword. Defaults to `false`.
135+
136+
[[analysis-dict-decomp-tokenfilter-customize]]
137+
==== Customize and add to an analyzer
138+
139+
To customize the `dictionary_decompounder` filter, duplicate it to create the
140+
basis for a new custom token filter. You can modify the filter using its
141+
configurable parameters.
142+
143+
For example, the following <<indices-create-index,create index API>> request
144+
uses a custom `dictionary_decompounder` filter to configure a new
145+
<<analysis-custom-analyzer,custom analyzer>>.
146+
147+
The custom `dictionary_decompounder` filter find subwords in the
148+
`analysis/example_word_list.txt` file. Subwords longer than 22 characters are
149+
excluded from the token output.
150+
151+
[source,console]
152+
--------------------------------------------------
153+
PUT dictionary_decompound_example
154+
{
155+
"settings": {
156+
"analysis": {
157+
"analyzer": {
158+
"standard_dictionary_decompound": {
159+
"tokenizer": "standard",
160+
"filter": [ "22_char_dictionary_decompound" ]
161+
}
162+
},
163+
"filter": {
164+
"22_char_dictionary_decompound": {
165+
"type": "dictionary_decompounder",
166+
"word_list_path": "analysis/example_word_list.txt",
167+
"max_subword_size": 22
168+
}
169+
}
170+
}
171+
}
172+
}
173+
--------------------------------------------------

0 commit comments

Comments
 (0)