Skip to content

Commit 569fb14

Browse files
committed
[DOCS] Reformat common grams token filter (#48426)
1 parent 4ce5c2d commit 569fb14

File tree

1 file changed

+157
-99
lines changed

1 file changed

+157
-99
lines changed
Original file line numberDiff line numberDiff line change
@@ -1,93 +1,54 @@
11
[[analysis-common-grams-tokenfilter]]
2-
=== Common Grams Token Filter
2+
=== Common grams token filter
3+
++++
4+
<titleabbrev>Common grams</titleabbrev>
5+
++++
36

4-
Token filter that generates bigrams for frequently occurring terms.
5-
Single terms are still indexed. It can be used as an alternative to the
6-
<<analysis-stop-tokenfilter,Stop
7-
Token Filter>> when we don't want to completely ignore common terms.
7+
Generates https://en.wikipedia.org/wiki/Bigram[bigrams] for a specified set of
8+
common words.
89

9-
For example, the text "the quick brown is a fox" will be tokenized as
10-
"the", "the_quick", "quick", "brown", "brown_is", "is", "is_a", "a",
11-
"a_fox", "fox". Assuming "the", "is" and "a" are common words.
10+
For example, you can specify `is` and `the` as common words. This filter then
11+
converts the tokens `[the, quick, fox, is, brown]` to `[the, the_quick, quick,
12+
fox, fox_is, is, is_brown, brown]`.
1213

13-
When `query_mode` is enabled, the token filter removes common words and
14-
single terms followed by a common word. This parameter should be enabled
15-
in the search analyzer.
14+
You can use the `common_grams` filter in place of the
15+
<<analysis-stop-tokenfilter,stop token filter>> when you don't want to
16+
completely ignore common words.
1617

17-
For example, the query "the quick brown is a fox" will be tokenized as
18-
"the_quick", "quick", "brown_is", "is_a", "a_fox", "fox".
18+
This filter uses Lucene's
19+
https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis/commongrams/CommonGramsFilter.html[CommonGramsFilter].
1920

20-
The following are settings that can be set:
21+
[[analysis-common-grams-analyze-ex]]
22+
==== Example
2123

22-
[cols="<,<",options="header",]
23-
|=======================================================================
24-
|Setting |Description
25-
|`common_words` |A list of common words to use.
24+
The following <<indices-analyze,analyze API>> request creates bigrams for `is`
25+
and `the`:
2626

27-
|`common_words_path` |A path (either relative to `config` location, or
28-
absolute) to a list of common words. Each word should be in its own
29-
"line" (separated by a line break). The file must be UTF-8 encoded.
30-
31-
|`ignore_case` |If true, common words matching will be case insensitive
32-
(defaults to `false`).
33-
34-
|`query_mode` |Generates bigrams then removes common words and single
35-
terms followed by a common word (defaults to `false`).
36-
|=======================================================================
37-
38-
Note, `common_words` or `common_words_path` field is required.
39-
40-
Here is an example:
41-
42-
[source,js]
27+
[source,console]
4328
--------------------------------------------------
44-
PUT /common_grams_example
29+
GET /_analyze
4530
{
46-
"settings": {
47-
"analysis": {
48-
"analyzer": {
49-
"index_grams": {
50-
"tokenizer": "whitespace",
51-
"filter": ["common_grams"]
52-
},
53-
"search_grams": {
54-
"tokenizer": "whitespace",
55-
"filter": ["common_grams_query"]
56-
}
57-
},
58-
"filter": {
59-
"common_grams": {
60-
"type": "common_grams",
61-
"common_words": ["the", "is", "a"]
62-
},
63-
"common_grams_query": {
64-
"type": "common_grams",
65-
"query_mode": true,
66-
"common_words": ["the", "is", "a"]
67-
}
68-
}
69-
}
31+
"tokenizer" : "whitespace",
32+
"filter" : [
33+
"common_grams", {
34+
"type": "common_grams",
35+
"common_words": ["is", "the"]
7036
}
37+
],
38+
"text" : "the quick fox is brown"
7139
}
7240
--------------------------------------------------
7341
// CONSOLE
7442

75-
You can see the output by using e.g. the `_analyze` endpoint:
43+
The filter produces the following tokens:
7644

77-
[source,js]
45+
[source,text]
7846
--------------------------------------------------
79-
POST /common_grams_example/_analyze
80-
{
81-
"analyzer" : "index_grams",
82-
"text" : "the quick brown is a fox"
83-
}
47+
[ the, the_quick, quick, fox, fox_is, is, is_brown, brown ]
8448
--------------------------------------------------
85-
// CONSOLE
86-
// TEST[continued]
8749

88-
And the response will be:
89-
90-
[source,js]
50+
/////////////////////
51+
[source,console-result]
9152
--------------------------------------------------
9253
{
9354
"tokens" : [
@@ -114,58 +75,155 @@ And the response will be:
11475
"position" : 1
11576
},
11677
{
117-
"token" : "brown",
78+
"token" : "fox",
11879
"start_offset" : 10,
119-
"end_offset" : 15,
80+
"end_offset" : 13,
12081
"type" : "word",
12182
"position" : 2
12283
},
12384
{
124-
"token" : "brown_is",
85+
"token" : "fox_is",
12586
"start_offset" : 10,
126-
"end_offset" : 18,
87+
"end_offset" : 16,
12788
"type" : "gram",
12889
"position" : 2,
12990
"positionLength" : 2
13091
},
13192
{
13293
"token" : "is",
133-
"start_offset" : 16,
134-
"end_offset" : 18,
94+
"start_offset" : 14,
95+
"end_offset" : 16,
13596
"type" : "word",
13697
"position" : 3
13798
},
13899
{
139-
"token" : "is_a",
140-
"start_offset" : 16,
141-
"end_offset" : 20,
100+
"token" : "is_brown",
101+
"start_offset" : 14,
102+
"end_offset" : 22,
142103
"type" : "gram",
143104
"position" : 3,
144105
"positionLength" : 2
145106
},
146107
{
147-
"token" : "a",
148-
"start_offset" : 19,
149-
"end_offset" : 20,
108+
"token" : "brown",
109+
"start_offset" : 17,
110+
"end_offset" : 22,
150111
"type" : "word",
151112
"position" : 4
152-
},
153-
{
154-
"token" : "a_fox",
155-
"start_offset" : 19,
156-
"end_offset" : 24,
157-
"type" : "gram",
158-
"position" : 4,
159-
"positionLength" : 2
160-
},
161-
{
162-
"token" : "fox",
163-
"start_offset" : 21,
164-
"end_offset" : 24,
165-
"type" : "word",
166-
"position" : 5
167113
}
168114
]
169115
}
170116
--------------------------------------------------
171-
// TESTRESPONSE
117+
/////////////////////
118+
119+
[[analysis-common-grams-tokenfilter-analyzer-ex]]
120+
==== Add to an analyzer
121+
122+
The following <<indices-create-index,create index API>> request uses the
123+
`common_grams` filter to configure a new
124+
<<analysis-custom-analyzer,custom analyzer>>:
125+
126+
[source,console]
127+
--------------------------------------------------
128+
PUT /common_grams_example
129+
{
130+
"settings": {
131+
"analysis": {
132+
"analyzer": {
133+
"index_grams": {
134+
"tokenizer": "whitespace",
135+
"filter": ["common_grams"]
136+
}
137+
},
138+
"filter": {
139+
"common_grams": {
140+
"type": "common_grams",
141+
"common_words": ["a", "is", "the"]
142+
}
143+
}
144+
}
145+
}
146+
}
147+
--------------------------------------------------
148+
149+
[[analysis-common-grams-tokenfilter-configure-parms]]
150+
==== Configurable parameters
151+
152+
`common_words`::
153+
+
154+
--
155+
(Required+++*+++, array of strings)
156+
A list of tokens. The filter generates bigrams for these tokens.
157+
158+
Either this or the `common_words_path` parameter is required.
159+
--
160+
161+
`common_words_path`::
162+
+
163+
--
164+
(Required+++*+++, string)
165+
Path to a file containing a list of tokens. The filter generates bigrams for
166+
these tokens.
167+
168+
This path must be absolute or relative to the `config` location. The file must
169+
be UTF-8 encoded. Each token in the file must be separated by a line break.
170+
171+
Either this or the `common_words` parameter is required.
172+
--
173+
174+
`ignore_case`::
175+
(Optional, boolean)
176+
If `true`, matches for common words matching are case-insensitive.
177+
Defaults to `false`.
178+
179+
`query_mode`::
180+
+
181+
--
182+
(Optional, boolean)
183+
If `true`, the filter excludes the following tokens from the output:
184+
185+
* Unigrams for common words
186+
* Unigrams for terms followed by common words
187+
188+
Defaults to `false`. We recommend enabling this parameter for
189+
<<search-analyzer,search analyzers>>.
190+
191+
For example, you can enable this parameter and specify `is` and `the` as
192+
common words. This filter converts the tokens `[the, quick, fox, is, brown]` to
193+
`[the_quick, quick, fox_is, is_brown,]`.
194+
--
195+
196+
[[analysis-common-grams-tokenfilter-customize]]
197+
==== Customize
198+
199+
To customize the `common_grams` filter, duplicate it to create the basis
200+
for a new custom token filter. You can modify the filter using its configurable
201+
parameters.
202+
203+
For example, the following request creates a custom `common_grams` filter with
204+
`ignore_case` and `query_mode` set to `true`:
205+
206+
[source,console]
207+
--------------------------------------------------
208+
PUT /common_grams_example
209+
{
210+
"settings": {
211+
"analysis": {
212+
"analyzer": {
213+
"index_grams": {
214+
"tokenizer": "whitespace",
215+
"filter": ["common_grams_query"]
216+
}
217+
},
218+
"filter": {
219+
"common_grams_query": {
220+
"type": "common_grams",
221+
"common_words": ["a", "is", "the"],
222+
"ignore_case": true,
223+
"query_mode": true
224+
}
225+
}
226+
}
227+
}
228+
}
229+
--------------------------------------------------

0 commit comments

Comments
 (0)