1
1
[[analysis-common-grams-tokenfilter]]
2
- === Common Grams Token Filter
2
+ === Common grams token filter
3
+ ++++
4
+ <titleabbrev>Common grams</titleabbrev>
5
+ ++++
3
6
4
- Token filter that generates bigrams for frequently occurring terms.
5
- Single terms are still indexed. It can be used as an alternative to the
6
- <<analysis-stop-tokenfilter,Stop
7
- Token Filter>> when we don't want to completely ignore common terms.
7
+ Generates https://en.wikipedia.org/wiki/Bigram[bigrams] for a specified set of
8
+ common words.
8
9
9
- For example, the text "the quick brown is a fox" will be tokenized as
10
- " the", "the_quick", " quick", "brown", "brown_is", "is", "is_a", "a" ,
11
- "a_fox", "fox". Assuming "the", "is" and "a" are common words .
10
+ For example, you can specify `is` and `the` as common words. This filter then
11
+ converts the tokens `[the, quick, fox, is, brown]` to `[the, the_quick, quick ,
12
+ fox, fox_is, is, is_brown, brown]` .
12
13
13
- When `query_mode` is enabled, the token filter removes common words and
14
- single terms followed by a common word. This parameter should be enabled
15
- in the search analyzer .
14
+ You can use the `common_grams` filter in place of the
15
+ <<analysis-stop-tokenfilter,stop token filter>> when you don't want to
16
+ completely ignore common words .
16
17
17
- For example, the query "the quick brown is a fox" will be tokenized as
18
- "the_quick", "quick", "brown_is", "is_a", "a_fox", "fox" .
18
+ This filter uses Lucene's
19
+ https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis/commongrams/CommonGramsFilter.html[CommonGramsFilter] .
19
20
20
- The following are settings that can be set:
21
+ [[analysis-common-grams-analyze-ex]]
22
+ ==== Example
21
23
22
- [cols="<,<",options="header",]
23
- |=======================================================================
24
- |Setting |Description
25
- |`common_words` |A list of common words to use.
24
+ The following <<indices-analyze,analyze API>> request creates bigrams for `is`
25
+ and `the`:
26
26
27
- |`common_words_path` |A path (either relative to `config` location, or
28
- absolute) to a list of common words. Each word should be in its own
29
- "line" (separated by a line break). The file must be UTF-8 encoded.
30
-
31
- |`ignore_case` |If true, common words matching will be case insensitive
32
- (defaults to `false`).
33
-
34
- |`query_mode` |Generates bigrams then removes common words and single
35
- terms followed by a common word (defaults to `false`).
36
- |=======================================================================
37
-
38
- Note, `common_words` or `common_words_path` field is required.
39
-
40
- Here is an example:
41
-
42
- [source,js]
27
+ [source,console]
43
28
--------------------------------------------------
44
- PUT /common_grams_example
29
+ GET /_analyze
45
30
{
46
- "settings": {
47
- "analysis": {
48
- "analyzer": {
49
- "index_grams": {
50
- "tokenizer": "whitespace",
51
- "filter": ["common_grams"]
52
- },
53
- "search_grams": {
54
- "tokenizer": "whitespace",
55
- "filter": ["common_grams_query"]
56
- }
57
- },
58
- "filter": {
59
- "common_grams": {
60
- "type": "common_grams",
61
- "common_words": ["the", "is", "a"]
62
- },
63
- "common_grams_query": {
64
- "type": "common_grams",
65
- "query_mode": true,
66
- "common_words": ["the", "is", "a"]
67
- }
68
- }
69
- }
31
+ "tokenizer" : "whitespace",
32
+ "filter" : [
33
+ "common_grams", {
34
+ "type": "common_grams",
35
+ "common_words": ["is", "the"]
70
36
}
37
+ ],
38
+ "text" : "the quick fox is brown"
71
39
}
72
40
--------------------------------------------------
73
41
// CONSOLE
74
42
75
- You can see the output by using e.g. the `_analyze` endpoint :
43
+ The filter produces the following tokens :
76
44
77
- [source,js ]
45
+ [source,text ]
78
46
--------------------------------------------------
79
- POST /common_grams_example/_analyze
80
- {
81
- "analyzer" : "index_grams",
82
- "text" : "the quick brown is a fox"
83
- }
47
+ [ the, the_quick, quick, fox, fox_is, is, is_brown, brown ]
84
48
--------------------------------------------------
85
- // CONSOLE
86
- // TEST[continued]
87
49
88
- And the response will be:
89
-
90
- [source,js]
50
+ /////////////////////
51
+ [source,console-result]
91
52
--------------------------------------------------
92
53
{
93
54
"tokens" : [
@@ -114,58 +75,155 @@ And the response will be:
114
75
"position" : 1
115
76
},
116
77
{
117
- "token" : "brown ",
78
+ "token" : "fox ",
118
79
"start_offset" : 10,
119
- "end_offset" : 15 ,
80
+ "end_offset" : 13 ,
120
81
"type" : "word",
121
82
"position" : 2
122
83
},
123
84
{
124
- "token" : "brown_is ",
85
+ "token" : "fox_is ",
125
86
"start_offset" : 10,
126
- "end_offset" : 18 ,
87
+ "end_offset" : 16 ,
127
88
"type" : "gram",
128
89
"position" : 2,
129
90
"positionLength" : 2
130
91
},
131
92
{
132
93
"token" : "is",
133
- "start_offset" : 16 ,
134
- "end_offset" : 18 ,
94
+ "start_offset" : 14 ,
95
+ "end_offset" : 16 ,
135
96
"type" : "word",
136
97
"position" : 3
137
98
},
138
99
{
139
- "token" : "is_a ",
140
- "start_offset" : 16 ,
141
- "end_offset" : 20 ,
100
+ "token" : "is_brown ",
101
+ "start_offset" : 14 ,
102
+ "end_offset" : 22 ,
142
103
"type" : "gram",
143
104
"position" : 3,
144
105
"positionLength" : 2
145
106
},
146
107
{
147
- "token" : "a ",
148
- "start_offset" : 19 ,
149
- "end_offset" : 20 ,
108
+ "token" : "brown ",
109
+ "start_offset" : 17 ,
110
+ "end_offset" : 22 ,
150
111
"type" : "word",
151
112
"position" : 4
152
- },
153
- {
154
- "token" : "a_fox",
155
- "start_offset" : 19,
156
- "end_offset" : 24,
157
- "type" : "gram",
158
- "position" : 4,
159
- "positionLength" : 2
160
- },
161
- {
162
- "token" : "fox",
163
- "start_offset" : 21,
164
- "end_offset" : 24,
165
- "type" : "word",
166
- "position" : 5
167
113
}
168
114
]
169
115
}
170
116
--------------------------------------------------
171
- // TESTRESPONSE
117
+ /////////////////////
118
+
119
+ [[analysis-common-grams-tokenfilter-analyzer-ex]]
120
+ ==== Add to an analyzer
121
+
122
+ The following <<indices-create-index,create index API>> request uses the
123
+ `common_grams` filter to configure a new
124
+ <<analysis-custom-analyzer,custom analyzer>>:
125
+
126
+ [source,console]
127
+ --------------------------------------------------
128
+ PUT /common_grams_example
129
+ {
130
+ "settings": {
131
+ "analysis": {
132
+ "analyzer": {
133
+ "index_grams": {
134
+ "tokenizer": "whitespace",
135
+ "filter": ["common_grams"]
136
+ }
137
+ },
138
+ "filter": {
139
+ "common_grams": {
140
+ "type": "common_grams",
141
+ "common_words": ["a", "is", "the"]
142
+ }
143
+ }
144
+ }
145
+ }
146
+ }
147
+ --------------------------------------------------
148
+
149
+ [[analysis-common-grams-tokenfilter-configure-parms]]
150
+ ==== Configurable parameters
151
+
152
+ `common_words`::
153
+ +
154
+ --
155
+ (Required+++*+++, array of strings)
156
+ A list of tokens. The filter generates bigrams for these tokens.
157
+
158
+ Either this or the `common_words_path` parameter is required.
159
+ --
160
+
161
+ `common_words_path`::
162
+ +
163
+ --
164
+ (Required+++*+++, string)
165
+ Path to a file containing a list of tokens. The filter generates bigrams for
166
+ these tokens.
167
+
168
+ This path must be absolute or relative to the `config` location. The file must
169
+ be UTF-8 encoded. Each token in the file must be separated by a line break.
170
+
171
+ Either this or the `common_words` parameter is required.
172
+ --
173
+
174
+ `ignore_case`::
175
+ (Optional, boolean)
176
+ If `true`, matches for common words matching are case-insensitive.
177
+ Defaults to `false`.
178
+
179
+ `query_mode`::
180
+ +
181
+ --
182
+ (Optional, boolean)
183
+ If `true`, the filter excludes the following tokens from the output:
184
+
185
+ * Unigrams for common words
186
+ * Unigrams for terms followed by common words
187
+
188
+ Defaults to `false`. We recommend enabling this parameter for
189
+ <<search-analyzer,search analyzers>>.
190
+
191
+ For example, you can enable this parameter and specify `is` and `the` as
192
+ common words. This filter converts the tokens `[the, quick, fox, is, brown]` to
193
+ `[the_quick, quick, fox_is, is_brown,]`.
194
+ --
195
+
196
+ [[analysis-common-grams-tokenfilter-customize]]
197
+ ==== Customize
198
+
199
+ To customize the `common_grams` filter, duplicate it to create the basis
200
+ for a new custom token filter. You can modify the filter using its configurable
201
+ parameters.
202
+
203
+ For example, the following request creates a custom `common_grams` filter with
204
+ `ignore_case` and `query_mode` set to `true`:
205
+
206
+ [source,console]
207
+ --------------------------------------------------
208
+ PUT /common_grams_example
209
+ {
210
+ "settings": {
211
+ "analysis": {
212
+ "analyzer": {
213
+ "index_grams": {
214
+ "tokenizer": "whitespace",
215
+ "filter": ["common_grams_query"]
216
+ }
217
+ },
218
+ "filter": {
219
+ "common_grams_query": {
220
+ "type": "common_grams",
221
+ "common_words": ["a", "is", "the"],
222
+ "ignore_case": true,
223
+ "query_mode": true
224
+ }
225
+ }
226
+ }
227
+ }
228
+ }
229
+ --------------------------------------------------
0 commit comments