1
1
[[analysis-cjk-bigram-tokenfilter]]
2
- === CJK Bigram Token Filter
2
+ === CJK bigram token filter
3
+ ++++
4
+ <titleabbrev>CJK bigram</titleabbrev>
5
+ ++++
3
6
4
- The `cjk_bigram` token filter forms bigrams out of the CJK
5
- terms that are generated by the <<analysis-standard-tokenizer,`standard` tokenizer>>
6
- or the `icu_tokenizer` (see {plugins}/analysis-icu-tokenizer.html[`analysis-icu` plugin]).
7
+ Forms https://en.wikipedia.org/wiki/Bigram[bigrams] out of CJK (Chinese,
8
+ Japanese, and Korean) tokens.
7
9
8
- By default, when a CJK character has no adjacent characters to form a bigram,
9
- it is output in unigram form. If you always want to output both unigrams and
10
- bigrams, set the `output_unigrams` flag to `true`. This can be used for a
11
- combined unigram+bigram approach.
10
+ This filter is included in {es}'s built-in <<cjk-analyzer,CJK language
11
+ analyzer>>. It uses Lucene's
12
+ https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis/cjk/CJKBigramFilter.html[CJKBigramFilter].
12
13
13
- Bigrams are generated for characters in `han`, `hiragana`, `katakana` and
14
- `hangul`, but bigrams can be disabled for particular scripts with the
15
- `ignored_scripts` parameter. All non-CJK input is passed through unmodified.
14
+
15
+ [[analysis-cjk-bigram-tokenfilter-analyze-ex]]
16
+ ==== Example
17
+
18
+ The following <<indices-analyze,analyze API>> request demonstrates how the
19
+ CJK bigram token filter works.
20
+
21
+ [source,console]
22
+ --------------------------------------------------
23
+ GET /_analyze
24
+ {
25
+ "tokenizer" : "standard",
26
+ "filter" : ["cjk_bigram"],
27
+ "text" : "東京都は、日本の首都であり"
28
+ }
29
+ --------------------------------------------------
30
+
31
+ The filter produces the following tokens:
32
+
33
+ [source,text]
34
+ --------------------------------------------------
35
+ [ 東京, 京都, 都は, 日本, 本の, の首, 首都, 都で, であ, あり ]
36
+ --------------------------------------------------
37
+
38
+ /////////////////////
39
+ [source,console-result]
40
+ --------------------------------------------------
41
+ {
42
+ "tokens" : [
43
+ {
44
+ "token" : "東京",
45
+ "start_offset" : 0,
46
+ "end_offset" : 2,
47
+ "type" : "<DOUBLE>",
48
+ "position" : 0
49
+ },
50
+ {
51
+ "token" : "京都",
52
+ "start_offset" : 1,
53
+ "end_offset" : 3,
54
+ "type" : "<DOUBLE>",
55
+ "position" : 1
56
+ },
57
+ {
58
+ "token" : "都は",
59
+ "start_offset" : 2,
60
+ "end_offset" : 4,
61
+ "type" : "<DOUBLE>",
62
+ "position" : 2
63
+ },
64
+ {
65
+ "token" : "日本",
66
+ "start_offset" : 5,
67
+ "end_offset" : 7,
68
+ "type" : "<DOUBLE>",
69
+ "position" : 3
70
+ },
71
+ {
72
+ "token" : "本の",
73
+ "start_offset" : 6,
74
+ "end_offset" : 8,
75
+ "type" : "<DOUBLE>",
76
+ "position" : 4
77
+ },
78
+ {
79
+ "token" : "の首",
80
+ "start_offset" : 7,
81
+ "end_offset" : 9,
82
+ "type" : "<DOUBLE>",
83
+ "position" : 5
84
+ },
85
+ {
86
+ "token" : "首都",
87
+ "start_offset" : 8,
88
+ "end_offset" : 10,
89
+ "type" : "<DOUBLE>",
90
+ "position" : 6
91
+ },
92
+ {
93
+ "token" : "都で",
94
+ "start_offset" : 9,
95
+ "end_offset" : 11,
96
+ "type" : "<DOUBLE>",
97
+ "position" : 7
98
+ },
99
+ {
100
+ "token" : "であ",
101
+ "start_offset" : 10,
102
+ "end_offset" : 12,
103
+ "type" : "<DOUBLE>",
104
+ "position" : 8
105
+ },
106
+ {
107
+ "token" : "あり",
108
+ "start_offset" : 11,
109
+ "end_offset" : 13,
110
+ "type" : "<DOUBLE>",
111
+ "position" : 9
112
+ }
113
+ ]
114
+ }
115
+ --------------------------------------------------
116
+ /////////////////////
117
+
118
+ [[analysis-cjk-bigram-tokenfilter-analyzer-ex]]
119
+ ==== Add to an analyzer
120
+
121
+ The following <<indices-create-index,create index API>> request uses the
122
+ CJK bigram token filter to configure a new
123
+ <<analysis-custom-analyzer,custom analyzer>>.
124
+
125
+ [source,console]
126
+ --------------------------------------------------
127
+ PUT /cjk_bigram_example
128
+ {
129
+ "settings" : {
130
+ "analysis" : {
131
+ "analyzer" : {
132
+ "standard_cjk_bigram" : {
133
+ "tokenizer" : "standard",
134
+ "filter" : ["cjk_bigram"]
135
+ }
136
+ }
137
+ }
138
+ }
139
+ }
140
+ --------------------------------------------------
141
+
142
+
143
+ [[analysis-cjk-bigram-tokenfilter-configure-parms]]
144
+ ==== Configurable parameters
145
+
146
+ `ignored_scripts`::
147
+ +
148
+ --
149
+ (Optional, array of character scripts)
150
+ Array of character scripts for which to disable bigrams.
151
+ Possible values:
152
+
153
+ * `han`
154
+ * `hangul`
155
+ * `hiragana`
156
+ * `katakana`
157
+
158
+ All non-CJK input is passed through unmodified.
159
+ --
160
+
161
+ `output_unigrams`
162
+ (Optional, boolean)
163
+ If `true`, emit tokens in both bigram and
164
+ https://en.wikipedia.org/wiki/N-gram[unigram] form. If `false`, a CJK character
165
+ is output in unigram form when it has no adjacent characters. Defaults to
166
+ `false`.
167
+
168
+ [[analysis-cjk-bigram-tokenfilter-customize]]
169
+ ==== Customize
170
+
171
+ To customize the CJK bigram token filter, duplicate it to create the basis
172
+ for a new custom token filter. You can modify the filter using its configurable
173
+ parameters.
16
174
17
175
[source,js]
18
176
--------------------------------------------------
@@ -30,9 +188,9 @@ PUT /cjk_bigram_example
30
188
"han_bigrams_filter" : {
31
189
"type" : "cjk_bigram",
32
190
"ignored_scripts": [
191
+ "hangul",
33
192
"hiragana",
34
- "katakana",
35
- "hangul"
193
+ "katakana"
36
194
],
37
195
"output_unigrams" : true
38
196
}
0 commit comments