1
1
[[analysis-compound-word-tokenfilter]]
2
2
=== Compound Word Token Filter
3
3
4
- Token filters that allow to decompose compound words. There are two
5
- types available: `dictionary_decompounder` and
6
- `hyphenation_decompounder`.
4
+ The `hyphenation_decompounder` and `dictionary_decompounder` token filters can
5
+ decompose compound words found in many German languages into word parts.
7
6
8
- The following are settings that can be set for a compound word token
9
- filter type :
7
+ Both token filters require a dictionary of word parts, which can be provided
8
+ as :
10
9
11
- [cols="<,<",options="header",]
12
- |=======================================================================
13
- |Setting |Description
14
- |`word_list` |A list of words to use.
10
+ [horizontal]
11
+ `word_list`::
15
12
16
- |`word_list_path` |A path (either relative to `config` location, or
17
- absolute) to a list of words.
13
+ An array of words, specified inline in the token filter configuration, or
18
14
19
- |`hyphenation_patterns_path` |A path (either relative to `config` location, or
20
- absolute) to a FOP XML hyphenation pattern file. (See http://offo.sourceforge.net/hyphenation/)
21
- Required for `hyphenation_decompounder`.
15
+ `word_list_path`::
22
16
23
- |`min_word_size` |Minimum word size(Integer). Defaults to 5.
17
+ The path (either absolute or relative to the `config` directory) to a UTF-8
18
+ encoded file containing one word per line.
24
19
25
- |`min_subword_size` |Minimum subword size(Integer). Defaults to 2.
20
+ [float]
21
+ === Hyphenation decompounder
26
22
27
- |`max_subword_size` |Maximum subword size(Integer). Defaults to 15.
23
+ The `hyphenation_decompounder` uses hyphenation grammars to find potential
24
+ subwords that are then checked against the word dictionary. The quality of the
25
+ output tokens is directly connected to the quality of the grammar file you
26
+ use. For languages like German they are quite good.
27
+
28
+ XML based hyphenation grammar files can be found in the
29
+ http://offo.sourceforge.net/hyphenation/#FOP+XML+Hyphenation+Patterns[Objects For Formatting Objects]
30
+ (OFFO) Sourceforge project. You can download http://downloads.sourceforge.net/offo/offo-hyphenation.zip[offo-hyphenation.zip]
31
+ directly and look in the `offo-hyphenation/hyph/` directory.
32
+ Credits for the hyphenation code go to the Apache FOP project .
33
+
34
+ [float]
35
+ === Dictionary decompounder
36
+
37
+ The `dictionary_decompounder` uses a brute force approach in conjuction with
38
+ only the word dictionary to find subwords in a compound word. It is much
39
+ slower than the hyphenation decompounder but can be used as a first start to
40
+ check the quality of your dictionary.
41
+
42
+ [float]
43
+ === Compound token filter parameters
44
+
45
+ The following parameters can be used to configure a compound word token
46
+ filter:
47
+
48
+ [horizontal]
49
+ `type`::
50
+
51
+ Either `dictionary_decompounder` or `hyphenation_decompounder`.
52
+
53
+ `word_list`::
54
+
55
+ A array containing a list of words to use for the word dictionary.
56
+
57
+ `word_list_path`::
58
+
59
+ The path (either absolute or relative to the `config` directory) to the word dictionary.
60
+
61
+ `hyphenation_patterns_path`::
62
+
63
+ The path (either absolute or relative to the `config` directory) to a FOP XML hyphenation pattern file. (required for hyphenation)
64
+
65
+ `min_word_size`::
66
+
67
+ Minimum word size. Defaults to 5.
68
+
69
+ `min_subword_size`::
70
+
71
+ Minimum subword size. Defaults to 2.
72
+
73
+ `max_subword_size`::
74
+
75
+ Maximum subword size. Defaults to 15.
76
+
77
+ `only_longest_match`::
78
+
79
+ Whether to include only the longest matching subword or not. Defaults to `false`
28
80
29
- |`only_longest_match` |Only matching the longest(Boolean). Defaults to
30
- `false`
31
- |=======================================================================
32
81
33
82
Here is an example:
34
83
@@ -44,9 +93,10 @@ index :
44
93
filter :
45
94
myTokenFilter1 :
46
95
type : dictionary_decompounder
47
- word_list: [one, two, three]
96
+ word_list: [one, two, three]
48
97
myTokenFilter2 :
49
98
type : hyphenation_decompounder
50
99
word_list_path: path/to/words.txt
100
+ hyphenation_patterns_path: path/to/fop.xml
51
101
max_subword_size : 22
52
102
--------------------------------------------------
0 commit comments