Skip to content

Commit b257ec2

Browse files
committed
[DOCS] Reformat CJK bigram and CJK width token filter docs (elastic#48210)
1 parent fc3df16 commit b257ec2

File tree

2 files changed

+249
-20
lines changed

2 files changed

+249
-20
lines changed

docs/reference/analysis/tokenfilters/cjk-bigram-tokenfilter.asciidoc

+171-13
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,176 @@
11
[[analysis-cjk-bigram-tokenfilter]]
2-
=== CJK Bigram Token Filter
2+
=== CJK bigram token filter
3+
++++
4+
<titleabbrev>CJK bigram</titleabbrev>
5+
++++
36

4-
The `cjk_bigram` token filter forms bigrams out of the CJK
5-
terms that are generated by the <<analysis-standard-tokenizer,`standard` tokenizer>>
6-
or the `icu_tokenizer` (see {plugins}/analysis-icu-tokenizer.html[`analysis-icu` plugin]).
7+
Forms https://en.wikipedia.org/wiki/Bigram[bigrams] out of CJK (Chinese,
8+
Japanese, and Korean) tokens.
79

8-
By default, when a CJK character has no adjacent characters to form a bigram,
9-
it is output in unigram form. If you always want to output both unigrams and
10-
bigrams, set the `output_unigrams` flag to `true`. This can be used for a
11-
combined unigram+bigram approach.
10+
This filter is included in {es}'s built-in <<cjk-analyzer,CJK language
11+
analyzer>>. It uses Lucene's
12+
https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis/cjk/CJKBigramFilter.html[CJKBigramFilter].
1213

13-
Bigrams are generated for characters in `han`, `hiragana`, `katakana` and
14-
`hangul`, but bigrams can be disabled for particular scripts with the
15-
`ignored_scripts` parameter. All non-CJK input is passed through unmodified.
14+
15+
[[analysis-cjk-bigram-tokenfilter-analyze-ex]]
16+
==== Example
17+
18+
The following <<indices-analyze,analyze API>> request demonstrates how the
19+
CJK bigram token filter works.
20+
21+
[source,console]
22+
--------------------------------------------------
23+
GET /_analyze
24+
{
25+
"tokenizer" : "standard",
26+
"filter" : ["cjk_bigram"],
27+
"text" : "東京都は、日本の首都であり"
28+
}
29+
--------------------------------------------------
30+
31+
The filter produces the following tokens:
32+
33+
[source,text]
34+
--------------------------------------------------
35+
[ 東京, 京都, 都は, 日本, 本の, の首, 首都, 都で, であ, あり ]
36+
--------------------------------------------------
37+
38+
/////////////////////
39+
[source,console-result]
40+
--------------------------------------------------
41+
{
42+
"tokens" : [
43+
{
44+
"token" : "東京",
45+
"start_offset" : 0,
46+
"end_offset" : 2,
47+
"type" : "<DOUBLE>",
48+
"position" : 0
49+
},
50+
{
51+
"token" : "京都",
52+
"start_offset" : 1,
53+
"end_offset" : 3,
54+
"type" : "<DOUBLE>",
55+
"position" : 1
56+
},
57+
{
58+
"token" : "都は",
59+
"start_offset" : 2,
60+
"end_offset" : 4,
61+
"type" : "<DOUBLE>",
62+
"position" : 2
63+
},
64+
{
65+
"token" : "日本",
66+
"start_offset" : 5,
67+
"end_offset" : 7,
68+
"type" : "<DOUBLE>",
69+
"position" : 3
70+
},
71+
{
72+
"token" : "本の",
73+
"start_offset" : 6,
74+
"end_offset" : 8,
75+
"type" : "<DOUBLE>",
76+
"position" : 4
77+
},
78+
{
79+
"token" : "の首",
80+
"start_offset" : 7,
81+
"end_offset" : 9,
82+
"type" : "<DOUBLE>",
83+
"position" : 5
84+
},
85+
{
86+
"token" : "首都",
87+
"start_offset" : 8,
88+
"end_offset" : 10,
89+
"type" : "<DOUBLE>",
90+
"position" : 6
91+
},
92+
{
93+
"token" : "都で",
94+
"start_offset" : 9,
95+
"end_offset" : 11,
96+
"type" : "<DOUBLE>",
97+
"position" : 7
98+
},
99+
{
100+
"token" : "であ",
101+
"start_offset" : 10,
102+
"end_offset" : 12,
103+
"type" : "<DOUBLE>",
104+
"position" : 8
105+
},
106+
{
107+
"token" : "あり",
108+
"start_offset" : 11,
109+
"end_offset" : 13,
110+
"type" : "<DOUBLE>",
111+
"position" : 9
112+
}
113+
]
114+
}
115+
--------------------------------------------------
116+
/////////////////////
117+
118+
[[analysis-cjk-bigram-tokenfilter-analyzer-ex]]
119+
==== Add to an analyzer
120+
121+
The following <<indices-create-index,create index API>> request uses the
122+
CJK bigram token filter to configure a new
123+
<<analysis-custom-analyzer,custom analyzer>>.
124+
125+
[source,console]
126+
--------------------------------------------------
127+
PUT /cjk_bigram_example
128+
{
129+
"settings" : {
130+
"analysis" : {
131+
"analyzer" : {
132+
"standard_cjk_bigram" : {
133+
"tokenizer" : "standard",
134+
"filter" : ["cjk_bigram"]
135+
}
136+
}
137+
}
138+
}
139+
}
140+
--------------------------------------------------
141+
142+
143+
[[analysis-cjk-bigram-tokenfilter-configure-parms]]
144+
==== Configurable parameters
145+
146+
`ignored_scripts`::
147+
+
148+
--
149+
(Optional, array of character scripts)
150+
Array of character scripts for which to disable bigrams.
151+
Possible values:
152+
153+
* `han`
154+
* `hangul`
155+
* `hiragana`
156+
* `katakana`
157+
158+
All non-CJK input is passed through unmodified.
159+
--
160+
161+
`output_unigrams`
162+
(Optional, boolean)
163+
If `true`, emit tokens in both bigram and
164+
https://en.wikipedia.org/wiki/N-gram[unigram] form. If `false`, a CJK character
165+
is output in unigram form when it has no adjacent characters. Defaults to
166+
`false`.
167+
168+
[[analysis-cjk-bigram-tokenfilter-customize]]
169+
==== Customize
170+
171+
To customize the CJK bigram token filter, duplicate it to create the basis
172+
for a new custom token filter. You can modify the filter using its configurable
173+
parameters.
16174

17175
[source,js]
18176
--------------------------------------------------
@@ -30,9 +188,9 @@ PUT /cjk_bigram_example
30188
"han_bigrams_filter" : {
31189
"type" : "cjk_bigram",
32190
"ignored_scripts": [
191+
"hangul",
33192
"hiragana",
34-
"katakana",
35-
"hangul"
193+
"katakana"
36194
],
37195
"output_unigrams" : true
38196
}
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,83 @@
11
[[analysis-cjk-width-tokenfilter]]
2-
=== CJK Width Token Filter
2+
=== CJK width token filter
3+
++++
4+
<titleabbrev>CJK width</titleabbrev>
5+
++++
36

4-
The `cjk_width` token filter normalizes CJK width differences:
7+
Normalizes width differences in CJK (Chinese, Japanese, and Korean) characters
8+
as follows:
59

6-
* Folds fullwidth ASCII variants into the equivalent basic Latin
7-
* Folds halfwidth Katakana variants into the equivalent Kana
10+
* Folds full-width ASCII character variants into the equivalent basic Latin
11+
characters
12+
* Folds half-width Katakana character variants into the equivalent Kana
13+
characters
814

9-
NOTE: This token filter can be viewed as a subset of NFKC/NFKD
10-
Unicode normalization. See the {plugins}/analysis-icu-normalization-charfilter.html[`analysis-icu` plugin]
11-
for full normalization support.
15+
This filter is included in {es}'s built-in <<cjk-analyzer,CJK language
16+
analyzer>>. It uses Lucene's
17+
https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html[CJKWidthFilter].
1218

19+
NOTE: This token filter can be viewed as a subset of NFKC/NFKD Unicode
20+
normalization. See the
21+
{plugins}/analysis-icu-normalization-charfilter.html[`analysis-icu` plugin] for
22+
full normalization support.
23+
24+
[[analysis-cjk-width-tokenfilter-analyze-ex]]
25+
==== Example
26+
27+
[source,console]
28+
--------------------------------------------------
29+
GET /_analyze
30+
{
31+
"tokenizer" : "standard",
32+
"filter" : ["cjk_width"],
33+
"text" : "シーサイドライナー"
34+
}
35+
--------------------------------------------------
36+
37+
The filter produces the following token:
38+
39+
[source,text]
40+
--------------------------------------------------
41+
シーサイドライナー
42+
--------------------------------------------------
43+
44+
/////////////////////
45+
[source,console-result]
46+
--------------------------------------------------
47+
{
48+
"tokens" : [
49+
{
50+
"token" : "シーサイドライナー",
51+
"start_offset" : 0,
52+
"end_offset" : 10,
53+
"type" : "<KATAKANA>",
54+
"position" : 0
55+
}
56+
]
57+
}
58+
--------------------------------------------------
59+
/////////////////////
60+
61+
[[analysis-cjk-width-tokenfilter-analyzer-ex]]
62+
==== Add to an analyzer
63+
64+
The following <<indices-create-index,create index API>> request uses the
65+
CJK width token filter to configure a new
66+
<<analysis-custom-analyzer,custom analyzer>>.
67+
68+
[source,console]
69+
--------------------------------------------------
70+
PUT /cjk_width_example
71+
{
72+
"settings" : {
73+
"analysis" : {
74+
"analyzer" : {
75+
"standard_cjk_width" : {
76+
"tokenizer" : "standard",
77+
"filter" : ["cjk_width"]
78+
}
79+
}
80+
}
81+
}
82+
}
83+
--------------------------------------------------

0 commit comments

Comments
 (0)