Skip to content

Commit 2a5cd9f

Browse files
committed
[DOCS] Reformat elision token filter docs (#49262)
1 parent d855970 commit 2a5cd9f

File tree

1 file changed

+163
-13
lines changed

1 file changed

+163
-13
lines changed

docs/reference/analysis/tokenfilters/elision-tokenfilter.asciidoc

Lines changed: 163 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,96 @@
11
[[analysis-elision-tokenfilter]]
2-
=== Elision Token Filter
2+
=== Elision token filter
3+
++++
4+
<titleabbrev>Elision</titleabbrev>
5+
++++
36

4-
A token filter which removes elisions. For example, "l'avion" (the
5-
plane) will tokenized as "avion" (plane).
7+
Removes specified https://en.wikipedia.org/wiki/Elision[elisions] from
8+
the beginning of tokens. For example, you can use this filter to change
9+
`l'avion` to `avion`.
610

7-
Requires either an `articles` parameter which is a set of stop word articles, or
8-
`articles_path` which points to a text file containing the stop set. Also optionally
9-
accepts `articles_case`, which indicates whether the filter treats those articles as
10-
case sensitive.
11+
When not customized, the filter removes the following French elisions by default:
1112

12-
For example:
13+
`l'`, `m'`, `t'`, `qu'`, `n'`, `s'`, `j'`, `d'`, `c'`, `jusqu'`, `quoiqu'`,
14+
`lorsqu'`, `puisqu'`
15+
16+
Customized versions of this filter are included in several of {es}'s built-in
17+
<<analysis-lang-analyzer,language analyzers>>:
18+
19+
* <<catalan-analyzer, Catalan analyzer>>
20+
* <<french-analyzer, French analyzer>>
21+
* <<irish-analyzer, Irish analyzer>>
22+
* <<italian-analyzer, Italian analyzer>>
23+
24+
This filter uses Lucene's
25+
https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis/util/ElisionFilter.html[ElisionFilter].
26+
27+
[[analysis-elision-tokenfilter-analyze-ex]]
28+
==== Example
29+
30+
The following <<indices-analyze,analyze API>> request uses the `elision`
31+
filter to remove `j'` from `j’examine près du wharf`:
32+
33+
[source,console]
34+
--------------------------------------------------
35+
GET _analyze
36+
{
37+
"tokenizer" : "standard",
38+
"filter" : ["elision"],
39+
"text" : "j’examine près du wharf"
40+
}
41+
--------------------------------------------------
42+
43+
The filter produces the following tokens:
44+
45+
[source,text]
46+
--------------------------------------------------
47+
[ examine, près, du, wharf ]
48+
--------------------------------------------------
49+
50+
/////////////////////
51+
[source,console-result]
52+
--------------------------------------------------
53+
{
54+
"tokens" : [
55+
{
56+
"token" : "examine",
57+
"start_offset" : 0,
58+
"end_offset" : 9,
59+
"type" : "<ALPHANUM>",
60+
"position" : 0
61+
},
62+
{
63+
"token" : "près",
64+
"start_offset" : 10,
65+
"end_offset" : 14,
66+
"type" : "<ALPHANUM>",
67+
"position" : 1
68+
},
69+
{
70+
"token" : "du",
71+
"start_offset" : 15,
72+
"end_offset" : 17,
73+
"type" : "<ALPHANUM>",
74+
"position" : 2
75+
},
76+
{
77+
"token" : "wharf",
78+
"start_offset" : 18,
79+
"end_offset" : 23,
80+
"type" : "<ALPHANUM>",
81+
"position" : 3
82+
}
83+
]
84+
}
85+
--------------------------------------------------
86+
/////////////////////
87+
88+
[[analysis-elision-tokenfilter-analyzer-ex]]
89+
==== Add to an analyzer
90+
91+
The following <<indices-create-index,create index API>> request uses the
92+
`elision` filter to configure a new
93+
<<analysis-custom-analyzer,custom analyzer>>.
1394

1495
[source,console]
1596
--------------------------------------------------
@@ -18,16 +99,85 @@ PUT /elision_example
1899
"settings" : {
19100
"analysis" : {
20101
"analyzer" : {
21-
"default" : {
22-
"tokenizer" : "standard",
102+
"whitespace_elision" : {
103+
"tokenizer" : "whitespace",
23104
"filter" : ["elision"]
24105
}
106+
}
107+
}
108+
}
109+
}
110+
--------------------------------------------------
111+
112+
[[analysis-elision-tokenfilter-configure-parms]]
113+
==== Configurable parameters
114+
115+
[[analysis-elision-tokenfilter-articles]]
116+
`articles`::
117+
+
118+
--
119+
(Required+++*+++, array of string)
120+
List of elisions to remove.
121+
122+
To be removed, the elision must be at the beginning of a token and be
123+
immediately followed by an apostrophe. Both the elision and apostrophe are
124+
removed.
125+
126+
For custom `elision` filters, either this parameter or `articles_path` must be
127+
specified.
128+
--
129+
130+
`articles_path`::
131+
+
132+
--
133+
(Required+++*+++, string)
134+
Path to a file that contains a list of elisions to remove.
135+
136+
This path must be absolute or relative to the `config` location, and the file
137+
must be UTF-8 encoded. Each elision in the file must be separated by a line
138+
break.
139+
140+
To be removed, the elision must be at the beginning of a token and be
141+
immediately followed by an apostrophe. Both the elision and apostrophe are
142+
removed.
143+
144+
For custom `elision` filters, either this parameter or `articles` must be
145+
specified.
146+
--
147+
148+
`articles_case`::
149+
(Optional, boolean)
150+
If `true`, the filter treats any provided elisions as case sensitive.
151+
Defaults to `false`.
152+
153+
[[analysis-elision-tokenfilter-customize]]
154+
==== Customize
155+
156+
To customize the `elision` filter, duplicate it to create the basis
157+
for a new custom token filter. You can modify the filter using its configurable
158+
parameters.
159+
160+
For example, the following request creates a custom case-sensitive `elision`
161+
filter that removes the `l'`, `m'`, `t'`, `qu'`, `n'`, `s'`,
162+
and `j'` elisions:
163+
164+
[source,console]
165+
--------------------------------------------------
166+
PUT /elision_case_sensitive_example
167+
{
168+
"settings" : {
169+
"analysis" : {
170+
"analyzer" : {
171+
"default" : {
172+
"tokenizer" : "whitespace",
173+
"filter" : ["elision_case_sensitive"]
174+
}
25175
},
26176
"filter" : {
27-
"elision" : {
177+
"elision_case_sensitive" : {
28178
"type" : "elision",
29-
"articles_case": true,
30-
"articles" : ["l", "m", "t", "qu", "n", "s", "j"]
179+
"articles" : ["l", "m", "t", "qu", "n", "s", "j"],
180+
"articles_case": true
31181
}
32182
}
33183
}

0 commit comments

Comments
 (0)