|
| 1 | +[[analysis-dict-decomp-tokenfilter]] |
| 2 | +=== Dictionary decompounder token filter |
| 3 | +++++ |
| 4 | +<titleabbrev>Dictionary decompounder</titleabbrev> |
| 5 | +++++ |
| 6 | + |
| 7 | +[NOTE] |
| 8 | +==== |
| 9 | +In most cases, we recommend using the faster |
| 10 | +<<analysis-hyp-decomp-tokenfilter,`hyphenation_decompounder`>> token filter |
| 11 | +in place of this filter. However, you can use the |
| 12 | +`dictionary_decompounder` filter to check the quality of a word list before |
| 13 | +implementing it in the `hyphenation_decompounder` filter. |
| 14 | +==== |
| 15 | + |
| 16 | +Uses a specified list of words and a brute force approach to find subwords in |
| 17 | +compound words. If found, these subwords are included in the token output. |
| 18 | + |
| 19 | +This filter uses Lucene's |
| 20 | +https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilter.html[DictionaryCompoundWordTokenFilter], |
| 21 | +which was built for Germanic languages. |
| 22 | + |
| 23 | +[[analysis-dict-decomp-tokenfilter-analyze-ex]] |
| 24 | +==== Example |
| 25 | + |
| 26 | +The following <<indices-analyze,analyze API>> request uses the |
| 27 | +`dictionary_decompounder` filter to find subwords in `Donaudampfschiff`. The |
| 28 | +filter then checks these subwords against the specified list of words: `Donau`, |
| 29 | +`dampf`, `meer`, and `schiff`. |
| 30 | + |
| 31 | +[source,console] |
| 32 | +-------------------------------------------------- |
| 33 | +GET _analyze |
| 34 | +{ |
| 35 | + "tokenizer": "standard", |
| 36 | + "filter": [ |
| 37 | + { |
| 38 | + "type": "dictionary_decompounder", |
| 39 | + "word_list": ["Donau", "dampf", "meer", "schiff"] |
| 40 | + } |
| 41 | + ], |
| 42 | + "text": "Donaudampfschiff" |
| 43 | +} |
| 44 | +-------------------------------------------------- |
| 45 | + |
| 46 | +The filter produces the following tokens: |
| 47 | + |
| 48 | +[source,text] |
| 49 | +-------------------------------------------------- |
| 50 | +[ Donaudampfschiff, Donau, dampf, schiff ] |
| 51 | +-------------------------------------------------- |
| 52 | + |
| 53 | +///////////////////// |
| 54 | +[source,console-result] |
| 55 | +-------------------------------------------------- |
| 56 | +{ |
| 57 | + "tokens" : [ |
| 58 | + { |
| 59 | + "token" : "Donaudampfschiff", |
| 60 | + "start_offset" : 0, |
| 61 | + "end_offset" : 16, |
| 62 | + "type" : "<ALPHANUM>", |
| 63 | + "position" : 0 |
| 64 | + }, |
| 65 | + { |
| 66 | + "token" : "Donau", |
| 67 | + "start_offset" : 0, |
| 68 | + "end_offset" : 16, |
| 69 | + "type" : "<ALPHANUM>", |
| 70 | + "position" : 0 |
| 71 | + }, |
| 72 | + { |
| 73 | + "token" : "dampf", |
| 74 | + "start_offset" : 0, |
| 75 | + "end_offset" : 16, |
| 76 | + "type" : "<ALPHANUM>", |
| 77 | + "position" : 0 |
| 78 | + }, |
| 79 | + { |
| 80 | + "token" : "schiff", |
| 81 | + "start_offset" : 0, |
| 82 | + "end_offset" : 16, |
| 83 | + "type" : "<ALPHANUM>", |
| 84 | + "position" : 0 |
| 85 | + } |
| 86 | + ] |
| 87 | +} |
| 88 | +-------------------------------------------------- |
| 89 | +///////////////////// |
| 90 | + |
| 91 | +[[analysis-dict-decomp-tokenfilter-configure-parms]] |
| 92 | +==== Configurable parameters |
| 93 | + |
| 94 | +`word_list`:: |
| 95 | ++ |
| 96 | +-- |
| 97 | +(Required+++*+++, array of strings) |
| 98 | +A list of subwords to look for in the token stream. If found, the subword is |
| 99 | +included in the token output. |
| 100 | + |
| 101 | +Either this parameter or `word_list_path` must be specified. |
| 102 | +-- |
| 103 | + |
| 104 | +`word_list_path`:: |
| 105 | ++ |
| 106 | +-- |
| 107 | +(Required+++*+++, string) |
| 108 | +Path to a file that contains a list of subwords to find in the token stream. If |
| 109 | +found, the subword is included in the token output. |
| 110 | + |
| 111 | +This path must be absolute or relative to the `config` location, and the file |
| 112 | +must be UTF-8 encoded. Each token in the file must be separated by a line break. |
| 113 | + |
| 114 | +Either this parameter or `word_list` must be specified. |
| 115 | +-- |
| 116 | + |
| 117 | +`max_subword_size`:: |
| 118 | +(Optional, integer) |
| 119 | +Maximum subword character length. Longer subword tokens are excluded from the |
| 120 | +output. Defaults to `15`. |
| 121 | + |
| 122 | +`min_subword_size`:: |
| 123 | +(Optional, integer) |
| 124 | +Minimum subword character length. Shorter subword tokens are excluded from the |
| 125 | +output. Defaults to `2`. |
| 126 | + |
| 127 | +`min_word_size`:: |
| 128 | +(Optional, integer) |
| 129 | +Minimum word character length. Shorter word tokens are excluded from the |
| 130 | +output. Defaults to `5`. |
| 131 | + |
| 132 | +`only_longest_match`:: |
| 133 | +(Optional, boolean) |
| 134 | +If `true`, only include the longest matching subword. Defaults to `false`. |
| 135 | + |
| 136 | +[[analysis-dict-decomp-tokenfilter-customize]] |
| 137 | +==== Customize and add to an analyzer |
| 138 | + |
| 139 | +To customize the `dictionary_decompounder` filter, duplicate it to create the |
| 140 | +basis for a new custom token filter. You can modify the filter using its |
| 141 | +configurable parameters. |
| 142 | + |
| 143 | +For example, the following <<indices-create-index,create index API>> request |
| 144 | +uses a custom `dictionary_decompounder` filter to configure a new |
| 145 | +<<analysis-custom-analyzer,custom analyzer>>. |
| 146 | + |
| 147 | +The custom `dictionary_decompounder` filter find subwords in the |
| 148 | +`analysis/example_word_list.txt` file. Subwords longer than 22 characters are |
| 149 | +excluded from the token output. |
| 150 | + |
| 151 | +[source,console] |
| 152 | +-------------------------------------------------- |
| 153 | +PUT dictionary_decompound_example |
| 154 | +{ |
| 155 | + "settings": { |
| 156 | + "analysis": { |
| 157 | + "analyzer": { |
| 158 | + "standard_dictionary_decompound": { |
| 159 | + "tokenizer": "standard", |
| 160 | + "filter": [ "22_char_dictionary_decompound" ] |
| 161 | + } |
| 162 | + }, |
| 163 | + "filter": { |
| 164 | + "22_char_dictionary_decompound": { |
| 165 | + "type": "dictionary_decompounder", |
| 166 | + "word_list_path": "analysis/example_word_list.txt", |
| 167 | + "max_subword_size": 22 |
| 168 | + } |
| 169 | + } |
| 170 | + } |
| 171 | + } |
| 172 | +} |
| 173 | +-------------------------------------------------- |
0 commit comments