Skip to content

Commit 1f76f49

Browse files
Update compound-word-tokenfilter.asciidoc
Improved the docs for compound work token filter. Closes #13670 Closes #13595
1 parent 3f94b5a commit 1f76f49

File tree

1 file changed

+71
-21
lines changed

1 file changed

+71
-21
lines changed
Lines changed: 71 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,34 +1,83 @@
11
[[analysis-compound-word-tokenfilter]]
22
=== Compound Word Token Filter
33

4-
Token filters that allow to decompose compound words. There are two
5-
types available: `dictionary_decompounder` and
6-
`hyphenation_decompounder`.
4+
The `hyphenation_decompounder` and `dictionary_decompounder` token filters can
5+
decompose compound words found in many German languages into word parts.
76

8-
The following are settings that can be set for a compound word token
9-
filter type:
7+
Both token filters require a dictionary of word parts, which can be provided
8+
as:
109

11-
[cols="<,<",options="header",]
12-
|=======================================================================
13-
|Setting |Description
14-
|`word_list` |A list of words to use.
10+
[horizontal]
11+
`word_list`::
1512

16-
|`word_list_path` |A path (either relative to `config` location, or
17-
absolute) to a list of words.
13+
An array of words, specified inline in the token filter configuration, or
1814

19-
|`hyphenation_patterns_path` |A path (either relative to `config` location, or
20-
absolute) to a FOP XML hyphenation pattern file. (See http://offo.sourceforge.net/hyphenation/)
21-
Required for `hyphenation_decompounder`.
15+
`word_list_path`::
2216

23-
|`min_word_size` |Minimum word size(Integer). Defaults to 5.
17+
The path (either absolute or relative to the `config` directory) to a UTF-8
18+
encoded file containing one word per line.
2419

25-
|`min_subword_size` |Minimum subword size(Integer). Defaults to 2.
20+
[float]
21+
=== Hyphenation decompounder
2622

27-
|`max_subword_size` |Maximum subword size(Integer). Defaults to 15.
23+
The `hyphenation_decompounder` uses hyphenation grammars to find potential
24+
subwords that are then checked against the word dictionary. The quality of the
25+
output tokens is directly connected to the quality of the grammar file you
26+
use. For languages like German they are quite good.
27+
28+
XML based hyphenation grammar files can be found in the
29+
http://offo.sourceforge.net/hyphenation/#FOP+XML+Hyphenation+Patterns[Objects For Formatting Objects]
30+
(OFFO) Sourceforge project. You can download http://downloads.sourceforge.net/offo/offo-hyphenation.zip[offo-hyphenation.zip]
31+
directly and look in the `offo-hyphenation/hyph/` directory.
32+
Credits for the hyphenation code go to the Apache FOP project .
33+
34+
[float]
35+
=== Dictionary decompounder
36+
37+
The `dictionary_decompounder` uses a brute force approach in conjuction with
38+
only the word dictionary to find subwords in a compound word. It is much
39+
slower than the hyphenation decompounder but can be used as a first start to
40+
check the quality of your dictionary.
41+
42+
[float]
43+
=== Compound token filter parameters
44+
45+
The following parameters can be used to configure a compound word token
46+
filter:
47+
48+
[horizontal]
49+
`type`::
50+
51+
Either `dictionary_decompounder` or `hyphenation_decompounder`.
52+
53+
`word_list`::
54+
55+
A array containing a list of words to use for the word dictionary.
56+
57+
`word_list_path`::
58+
59+
The path (either absolute or relative to the `config` directory) to the word dictionary.
60+
61+
`hyphenation_patterns_path`::
62+
63+
The path (either absolute or relative to the `config` directory) to a FOP XML hyphenation pattern file. (required for hyphenation)
64+
65+
`min_word_size`::
66+
67+
Minimum word size. Defaults to 5.
68+
69+
`min_subword_size`::
70+
71+
Minimum subword size. Defaults to 2.
72+
73+
`max_subword_size`::
74+
75+
Maximum subword size. Defaults to 15.
76+
77+
`only_longest_match`::
78+
79+
Whether to include only the longest matching subword or not. Defaults to `false`
2880

29-
|`only_longest_match` |Only matching the longest(Boolean). Defaults to
30-
`false`
31-
|=======================================================================
3281

3382
Here is an example:
3483

@@ -44,9 +93,10 @@ index :
4493
filter :
4594
myTokenFilter1 :
4695
type : dictionary_decompounder
47-
word_list: [one, two, three]
96+
word_list: [one, two, three]
4897
myTokenFilter2 :
4998
type : hyphenation_decompounder
5099
word_list_path: path/to/words.txt
100+
hyphenation_patterns_path: path/to/fop.xml
51101
max_subword_size : 22
52102
--------------------------------------------------

0 commit comments

Comments
 (0)