Skip to content

[DOCS] Reformat pattern_replace token filter #57699

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jun 11, 2020
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -4,23 +4,157 @@
<titleabbrev>Pattern replace</titleabbrev>
++++

The `pattern_replace` token filter allows to easily handle string
replacements based on a regular expression. The regular expression is
defined using the `pattern` parameter, and the replacement string can be
provided using the `replacement` parameter (supporting referencing the
original text, as explained
http://docs.oracle.com/javase/6/docs/api/java/util/regex/Matcher.html#appendReplacement(java.lang.StringBuffer,%20java.lang.String)[here]).
Uses a regular expression to match and replace token substrings.

The `pattern_replace` filter uses
http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html[Java's
regular expression syntax]. By default, the filter replaces matching
substrings with an empty substring (`""`).

Regular expressions cannot be anchored to the
beginning or end of a token. Replacement substrings can use Java's
https://docs.oracle.com/javase/8/docs/api/java/util/regex/Matcher.html#appendReplacement-java.lang.StringBuffer-java.lang.String-[`$g` syntax] to reference capture groups
from the original token text.

[WARNING]
.Beware of Pathological Regular Expressions
========================================
====
A poorly-written regular expression may run slowly or return a
StackOverflowError, causing the node running the expression to exit suddenly.

Read more about
http://www.regular-expressions.info/catastrophic.html[pathological regular
expressions and how to avoid them].
====

This filter uses Lucene's
{lucene-analysis-docs}/pattern/PatternReplaceFilter.html[PatternReplaceFilter].

[[analysis-pattern-replace-tokenfilter-analyze-ex]]
==== Example

The following <<indices-analyze,analyze API>> request uses the `pattern_replace`
filter to prepend `watch` to the substring `dog` in `foxes jump lazy dogs`.

[source,console]
----
GET /_analyze
{
"tokenizer": "whitespace",
"filter": [
{
"type": "pattern_replace",
"pattern": "(dog)",
"replacement": "watch$1"
}
],
"text": "foxes jump lazy dogs"
}
----

The filter produces the following tokens.

[source,text]
----
[ foxes, jump, lazy, watchdogs ]
----

////
[source,console-result]
----
{
"tokens": [
{
"token": "foxes",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "jump",
"start_offset": 6,
"end_offset": 10,
"type": "word",
"position": 1
},
{
"token": "lazy",
"start_offset": 11,
"end_offset": 15,
"type": "word",
"position": 2
},
{
"token": "watchdogs",
"start_offset": 16,
"end_offset": 20,
"type": "word",
"position": 3
}
]
}
----
////

[[analysis-pattern-replace-tokenfilter-configure-parms]]
==== Configurable parameters

`all`::
(Optional, boolean)
If `true`, all substrings matching the `pattern` parameter's regular expression
are replaced. If `false`, the filter replaces only the first matching substring
in each token. Defaults to `true`.

`pattern`::
(Required, string)
Regular expression, written in
http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html[Java's
regular expression syntax]. The filter replaces token substrings matching this
pattern with the substring in the `replacement` parameter.

`replacement`::
(Optional, string)
Replacement substring. Defaults to an empty substring (`""`).

[[analysis-pattern-replace-tokenfilter-customize]]
==== Customize and add to an analyzer

The pattern replace token filter uses
http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html[Java Regular Expressions].
To customize the `pattern_replace` filter, duplicate it to create the basis
for a new custom token filter. You can modify the filter using its configurable
parameters.

A badly written regular expression could run very slowly or even throw a
StackOverflowError and cause the node it is running on to exit suddenly.
The following <<indices-create-index,create index API>> request
configures a new <<analysis-custom-analyzer,custom analyzer>> using a custom
`pattern_replace` filter, `my_pattern_replace_filter`.

Read more about http://www.regular-expressions.info/catastrophic.html[pathological regular expressions and how to avoid them].
The `my_pattern_replace_filter` filter uses the regular expression `[£|€]` to
match and remove the currency symbols `£` and `€`. The filter's `all`
parameter is `false`, meaning only the first matching symbol in each token is
removed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not know \\p{Sc} was a thing so thanks for educating me :)
While these more esoteric functions work in token filters they are not supported in RegExp query syntax.

I wonder if it's worth doing one of two things here:
A) Add a footnote to the effect of "* note this particular expression wouldn't work with RegExp queries which use a different regex parser" or
B) Use a different example that works in both "pattern_replace" token filters and RegExp queries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @markharwood. With 992ab1c, I updated the example to use a pattern that compatible with the RegExp query.


========================================
[source,console]
----
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"filter": [
"my_pattern_replace_filter"
]
}
},
"filter": {
"my_pattern_replace_filter": {
"type": "pattern_replace",
"pattern": "[£|€]",
"replacement": "",
"all": false
}
}
}
}
}
----