|
4 | 4 | <titleabbrev>Pattern replace</titleabbrev>
|
5 | 5 | ++++
|
6 | 6 |
|
7 |
| -The `pattern_replace` token filter allows to easily handle string |
8 |
| -replacements based on a regular expression. The regular expression is |
9 |
| -defined using the `pattern` parameter, and the replacement string can be |
10 |
| -provided using the `replacement` parameter (supporting referencing the |
11 |
| -original text, as explained |
12 |
| -http://docs.oracle.com/javase/6/docs/api/java/util/regex/Matcher.html#appendReplacement(java.lang.StringBuffer,%20java.lang.String)[here]). |
| 7 | +Uses a regular expression to match and replace token substrings. |
| 8 | + |
| 9 | +The `pattern_replace` filter uses |
| 10 | +http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html[Java's |
| 11 | +regular expression syntax]. By default, the filter replaces matching |
| 12 | +substrings with an empty substring (`""`). |
| 13 | + |
| 14 | +Regular expressions cannot be anchored to the |
| 15 | +beginning or end of a token. Replacement substrings can use Java's |
| 16 | +https://docs.oracle.com/javase/8/docs/api/java/util/regex/Matcher.html#appendReplacement-java.lang.StringBuffer-java.lang.String-[`$g` syntax] to reference capture groups |
| 17 | +from the original token text. |
13 | 18 |
|
14 | 19 | [WARNING]
|
15 |
| -.Beware of Pathological Regular Expressions |
16 |
| -======================================== |
| 20 | +==== |
| 21 | +A poorly-written regular expression may run slowly or return a |
| 22 | +StackOverflowError, causing the node running the expression to exit suddenly. |
| 23 | +
|
| 24 | +Read more about |
| 25 | +http://www.regular-expressions.info/catastrophic.html[pathological regular |
| 26 | +expressions and how to avoid them]. |
| 27 | +==== |
| 28 | + |
| 29 | +This filter uses Lucene's |
| 30 | +{lucene-analysis-docs}/pattern/PatternReplaceFilter.html[PatternReplaceFilter]. |
| 31 | + |
| 32 | +[[analysis-pattern-replace-tokenfilter-analyze-ex]] |
| 33 | +==== Example |
| 34 | + |
| 35 | +The following <<indices-analyze,analyze API>> request uses the `pattern_replace` |
| 36 | +filter to prepend `watch` to the substring `dog` in `foxes jump lazy dogs`. |
| 37 | + |
| 38 | +[source,console] |
| 39 | +---- |
| 40 | +GET /_analyze |
| 41 | +{ |
| 42 | + "tokenizer": "whitespace", |
| 43 | + "filter": [ |
| 44 | + { |
| 45 | + "type": "pattern_replace", |
| 46 | + "pattern": "(dog)", |
| 47 | + "replacement": "watch$1" |
| 48 | + } |
| 49 | + ], |
| 50 | + "text": "foxes jump lazy dogs" |
| 51 | +} |
| 52 | +---- |
| 53 | + |
| 54 | +The filter produces the following tokens. |
| 55 | + |
| 56 | +[source,text] |
| 57 | +---- |
| 58 | +[ foxes, jump, lazy, watchdogs ] |
| 59 | +---- |
| 60 | + |
| 61 | +//// |
| 62 | +[source,console-result] |
| 63 | +---- |
| 64 | +{ |
| 65 | + "tokens": [ |
| 66 | + { |
| 67 | + "token": "foxes", |
| 68 | + "start_offset": 0, |
| 69 | + "end_offset": 5, |
| 70 | + "type": "word", |
| 71 | + "position": 0 |
| 72 | + }, |
| 73 | + { |
| 74 | + "token": "jump", |
| 75 | + "start_offset": 6, |
| 76 | + "end_offset": 10, |
| 77 | + "type": "word", |
| 78 | + "position": 1 |
| 79 | + }, |
| 80 | + { |
| 81 | + "token": "lazy", |
| 82 | + "start_offset": 11, |
| 83 | + "end_offset": 15, |
| 84 | + "type": "word", |
| 85 | + "position": 2 |
| 86 | + }, |
| 87 | + { |
| 88 | + "token": "watchdogs", |
| 89 | + "start_offset": 16, |
| 90 | + "end_offset": 20, |
| 91 | + "type": "word", |
| 92 | + "position": 3 |
| 93 | + } |
| 94 | + ] |
| 95 | +} |
| 96 | +---- |
| 97 | +//// |
| 98 | + |
| 99 | +[[analysis-pattern-replace-tokenfilter-configure-parms]] |
| 100 | +==== Configurable parameters |
| 101 | + |
| 102 | +`all`:: |
| 103 | +(Optional, boolean) |
| 104 | +If `true`, all substrings matching the `pattern` parameter's regular expression |
| 105 | +are replaced. If `false`, the filter replaces only the first matching substring |
| 106 | +in each token. Defaults to `true`. |
| 107 | + |
| 108 | +`pattern`:: |
| 109 | +(Required, string) |
| 110 | +Regular expression, written in |
| 111 | +http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html[Java's |
| 112 | +regular expression syntax]. The filter replaces token substrings matching this |
| 113 | +pattern with the substring in the `replacement` parameter. |
| 114 | + |
| 115 | +`replacement`:: |
| 116 | +(Optional, string) |
| 117 | +Replacement substring. Defaults to an empty substring (`""`). |
| 118 | + |
| 119 | +[[analysis-pattern-replace-tokenfilter-customize]] |
| 120 | +==== Customize and add to an analyzer |
17 | 121 |
|
18 |
| -The pattern replace token filter uses |
19 |
| -http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html[Java Regular Expressions]. |
| 122 | +To customize the `pattern_replace` filter, duplicate it to create the basis |
| 123 | +for a new custom token filter. You can modify the filter using its configurable |
| 124 | +parameters. |
20 | 125 |
|
21 |
| -A badly written regular expression could run very slowly or even throw a |
22 |
| -StackOverflowError and cause the node it is running on to exit suddenly. |
| 126 | +The following <<indices-create-index,create index API>> request |
| 127 | +configures a new <<analysis-custom-analyzer,custom analyzer>> using a custom |
| 128 | +`pattern_replace` filter, `my_pattern_replace_filter`. |
23 | 129 |
|
24 |
| -Read more about http://www.regular-expressions.info/catastrophic.html[pathological regular expressions and how to avoid them]. |
| 130 | +The `my_pattern_replace_filter` filter uses the regular expression `[£|€]` to |
| 131 | +match and remove the currency symbols `£` and `€`. The filter's `all` |
| 132 | +parameter is `false`, meaning only the first matching symbol in each token is |
| 133 | +removed. |
25 | 134 |
|
26 |
| -======================================== |
| 135 | +[source,console] |
| 136 | +---- |
| 137 | +PUT /my_index |
| 138 | +{ |
| 139 | + "settings": { |
| 140 | + "analysis": { |
| 141 | + "analyzer": { |
| 142 | + "my_analyzer": { |
| 143 | + "tokenizer": "keyword", |
| 144 | + "filter": [ |
| 145 | + "my_pattern_replace_filter" |
| 146 | + ] |
| 147 | + } |
| 148 | + }, |
| 149 | + "filter": { |
| 150 | + "my_pattern_replace_filter": { |
| 151 | + "type": "pattern_replace", |
| 152 | + "pattern": "[£|€]", |
| 153 | + "replacement": "", |
| 154 | + "all": false |
| 155 | + } |
| 156 | + } |
| 157 | + } |
| 158 | + } |
| 159 | +} |
| 160 | +---- |
0 commit comments