Skip to content

Commit c36df27

Browse files
authored
[DOCS] Reformat pattern_replace token filter (#57699) (#57995)
Changes: * Rewrites description and adds Lucene link * Adds analyze example * Adds parameter definitions * Adds custom analyzer example
1 parent 85b0b54 commit c36df27

File tree

1 file changed

+148
-14
lines changed

1 file changed

+148
-14
lines changed

docs/reference/analysis/tokenfilters/pattern_replace-tokenfilter.asciidoc

Lines changed: 148 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -4,23 +4,157 @@
44
<titleabbrev>Pattern replace</titleabbrev>
55
++++
66

7-
The `pattern_replace` token filter allows to easily handle string
8-
replacements based on a regular expression. The regular expression is
9-
defined using the `pattern` parameter, and the replacement string can be
10-
provided using the `replacement` parameter (supporting referencing the
11-
original text, as explained
12-
http://docs.oracle.com/javase/6/docs/api/java/util/regex/Matcher.html#appendReplacement(java.lang.StringBuffer,%20java.lang.String)[here]).
7+
Uses a regular expression to match and replace token substrings.
8+
9+
The `pattern_replace` filter uses
10+
http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html[Java's
11+
regular expression syntax]. By default, the filter replaces matching
12+
substrings with an empty substring (`""`).
13+
14+
Regular expressions cannot be anchored to the
15+
beginning or end of a token. Replacement substrings can use Java's
16+
https://docs.oracle.com/javase/8/docs/api/java/util/regex/Matcher.html#appendReplacement-java.lang.StringBuffer-java.lang.String-[`$g` syntax] to reference capture groups
17+
from the original token text.
1318

1419
[WARNING]
15-
.Beware of Pathological Regular Expressions
16-
========================================
20+
====
21+
A poorly-written regular expression may run slowly or return a
22+
StackOverflowError, causing the node running the expression to exit suddenly.
23+
24+
Read more about
25+
http://www.regular-expressions.info/catastrophic.html[pathological regular
26+
expressions and how to avoid them].
27+
====
28+
29+
This filter uses Lucene's
30+
{lucene-analysis-docs}/pattern/PatternReplaceFilter.html[PatternReplaceFilter].
31+
32+
[[analysis-pattern-replace-tokenfilter-analyze-ex]]
33+
==== Example
34+
35+
The following <<indices-analyze,analyze API>> request uses the `pattern_replace`
36+
filter to prepend `watch` to the substring `dog` in `foxes jump lazy dogs`.
37+
38+
[source,console]
39+
----
40+
GET /_analyze
41+
{
42+
"tokenizer": "whitespace",
43+
"filter": [
44+
{
45+
"type": "pattern_replace",
46+
"pattern": "(dog)",
47+
"replacement": "watch$1"
48+
}
49+
],
50+
"text": "foxes jump lazy dogs"
51+
}
52+
----
53+
54+
The filter produces the following tokens.
55+
56+
[source,text]
57+
----
58+
[ foxes, jump, lazy, watchdogs ]
59+
----
60+
61+
////
62+
[source,console-result]
63+
----
64+
{
65+
"tokens": [
66+
{
67+
"token": "foxes",
68+
"start_offset": 0,
69+
"end_offset": 5,
70+
"type": "word",
71+
"position": 0
72+
},
73+
{
74+
"token": "jump",
75+
"start_offset": 6,
76+
"end_offset": 10,
77+
"type": "word",
78+
"position": 1
79+
},
80+
{
81+
"token": "lazy",
82+
"start_offset": 11,
83+
"end_offset": 15,
84+
"type": "word",
85+
"position": 2
86+
},
87+
{
88+
"token": "watchdogs",
89+
"start_offset": 16,
90+
"end_offset": 20,
91+
"type": "word",
92+
"position": 3
93+
}
94+
]
95+
}
96+
----
97+
////
98+
99+
[[analysis-pattern-replace-tokenfilter-configure-parms]]
100+
==== Configurable parameters
101+
102+
`all`::
103+
(Optional, boolean)
104+
If `true`, all substrings matching the `pattern` parameter's regular expression
105+
are replaced. If `false`, the filter replaces only the first matching substring
106+
in each token. Defaults to `true`.
107+
108+
`pattern`::
109+
(Required, string)
110+
Regular expression, written in
111+
http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html[Java's
112+
regular expression syntax]. The filter replaces token substrings matching this
113+
pattern with the substring in the `replacement` parameter.
114+
115+
`replacement`::
116+
(Optional, string)
117+
Replacement substring. Defaults to an empty substring (`""`).
118+
119+
[[analysis-pattern-replace-tokenfilter-customize]]
120+
==== Customize and add to an analyzer
17121

18-
The pattern replace token filter uses
19-
http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html[Java Regular Expressions].
122+
To customize the `pattern_replace` filter, duplicate it to create the basis
123+
for a new custom token filter. You can modify the filter using its configurable
124+
parameters.
20125

21-
A badly written regular expression could run very slowly or even throw a
22-
StackOverflowError and cause the node it is running on to exit suddenly.
126+
The following <<indices-create-index,create index API>> request
127+
configures a new <<analysis-custom-analyzer,custom analyzer>> using a custom
128+
`pattern_replace` filter, `my_pattern_replace_filter`.
23129

24-
Read more about http://www.regular-expressions.info/catastrophic.html[pathological regular expressions and how to avoid them].
130+
The `my_pattern_replace_filter` filter uses the regular expression `[£|€]` to
131+
match and remove the currency symbols `£` and `€`. The filter's `all`
132+
parameter is `false`, meaning only the first matching symbol in each token is
133+
removed.
25134

26-
========================================
135+
[source,console]
136+
----
137+
PUT /my_index
138+
{
139+
"settings": {
140+
"analysis": {
141+
"analyzer": {
142+
"my_analyzer": {
143+
"tokenizer": "keyword",
144+
"filter": [
145+
"my_pattern_replace_filter"
146+
]
147+
}
148+
},
149+
"filter": {
150+
"my_pattern_replace_filter": {
151+
"type": "pattern_replace",
152+
"pattern": "[£|€]",
153+
"replacement": "",
154+
"all": false
155+
}
156+
}
157+
}
158+
}
159+
}
160+
----

0 commit comments

Comments
 (0)