|
1 | 1 | [[analysis-delimited-payload-tokenfilter]]
|
2 |
| -=== Delimited Payload Token Filter |
3 |
| - |
4 |
| -Named `delimited_payload`. Splits tokens into tokens and payload whenever a delimiter character is found. |
| 2 | +=== Delimited payload token filter |
| 3 | +++++ |
| 4 | +<titleabbrev>Delimited payload</titleabbrev> |
| 5 | +++++ |
5 | 6 |
|
6 | 7 | [WARNING]
|
7 |
| -============================================ |
| 8 | +==== |
| 9 | +The older name `delimited_payload_filter` is deprecated and should not be used |
| 10 | +with new indices. Use `delimited_payload` instead. |
| 11 | +==== |
| 12 | + |
| 13 | +Separates a token stream into tokens and payloads based on a specified |
| 14 | +delimiter. |
| 15 | + |
| 16 | +For example, you can use the `delimited_payload` filter with a `|` delimiter to |
| 17 | +split `the|1 quick|2 fox|3` into the tokens `the`, `quick`, and `fox` |
| 18 | +with respective payloads of `1`, `2`, and `3`. |
| 19 | + |
| 20 | +This filter uses Lucene's |
| 21 | +https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis/payloads/DelimitedPayloadTokenFilter.html[DelimitedPayloadTokenFilter]. |
| 22 | + |
| 23 | +[NOTE] |
| 24 | +.Payloads |
| 25 | +==== |
| 26 | +A payload is user-defined binary data associated with a token position and |
| 27 | +stored as base64-encoded bytes. |
| 28 | +
|
| 29 | +{es} does not store token payloads by default. To store payloads, you must: |
| 30 | +
|
| 31 | +* Set the <<term-vector,`term_vector`>> mapping parameter to |
| 32 | + `with_positions_payloads` or `with_positions_offsets_payloads` for any field |
| 33 | + storing payloads. |
| 34 | +* Use an index analyzer that includes the `delimited_payload` filter |
| 35 | +
|
| 36 | +You can view stored payloads using the <<docs-termvectors,term vectors API>>. |
| 37 | +==== |
| 38 | + |
| 39 | +[[analysis-delimited-payload-tokenfilter-analyze-ex]] |
| 40 | +==== Example |
| 41 | + |
| 42 | +The following <<indices-analyze,analyze API>> request uses the |
| 43 | +`delimited_payload` filter with the default `|` delimiter to split |
| 44 | +`the|0 brown|10 fox|5 is|0 quick|10` into tokens and payloads. |
| 45 | + |
| 46 | +[source,console] |
| 47 | +-------------------------------------------------- |
| 48 | +GET _analyze |
| 49 | +{ |
| 50 | + "tokenizer": "whitespace", |
| 51 | + "filter": ["delimited_payload"], |
| 52 | + "text": "the|0 brown|10 fox|5 is|0 quick|10" |
| 53 | +} |
| 54 | +-------------------------------------------------- |
| 55 | + |
| 56 | +The filter produces the following tokens: |
| 57 | + |
| 58 | +[source,text] |
| 59 | +-------------------------------------------------- |
| 60 | +[ the, brown, fox, is, quick ] |
| 61 | +-------------------------------------------------- |
| 62 | + |
| 63 | +Note that the analyze API does not return stored payloads. For an example that |
| 64 | +includes returned payloads, see |
| 65 | +<<analysis-delimited-payload-tokenfilter-return-stored-payloads>>. |
| 66 | + |
| 67 | +///////////////////// |
| 68 | +[source,console-result] |
| 69 | +-------------------------------------------------- |
| 70 | +{ |
| 71 | + "tokens": [ |
| 72 | + { |
| 73 | + "token": "the", |
| 74 | + "start_offset": 0, |
| 75 | + "end_offset": 5, |
| 76 | + "type": "word", |
| 77 | + "position": 0 |
| 78 | + }, |
| 79 | + { |
| 80 | + "token": "brown", |
| 81 | + "start_offset": 6, |
| 82 | + "end_offset": 14, |
| 83 | + "type": "word", |
| 84 | + "position": 1 |
| 85 | + }, |
| 86 | + { |
| 87 | + "token": "fox", |
| 88 | + "start_offset": 15, |
| 89 | + "end_offset": 20, |
| 90 | + "type": "word", |
| 91 | + "position": 2 |
| 92 | + }, |
| 93 | + { |
| 94 | + "token": "is", |
| 95 | + "start_offset": 21, |
| 96 | + "end_offset": 25, |
| 97 | + "type": "word", |
| 98 | + "position": 3 |
| 99 | + }, |
| 100 | + { |
| 101 | + "token": "quick", |
| 102 | + "start_offset": 26, |
| 103 | + "end_offset": 34, |
| 104 | + "type": "word", |
| 105 | + "position": 4 |
| 106 | + } |
| 107 | + ] |
| 108 | +} |
| 109 | +-------------------------------------------------- |
| 110 | +///////////////////// |
| 111 | + |
| 112 | +[[analysis-delimited-payload-tokenfilter-analyzer-ex]] |
| 113 | +==== Add to an analyzer |
| 114 | + |
| 115 | +The following <<indices-create-index,create index API>> request uses the |
| 116 | +`delimited-payload` filter to configure a new <<analysis-custom-analyzer,custom |
| 117 | +analyzer>>. |
| 118 | + |
| 119 | +[source,console] |
| 120 | +-------------------------------------------------- |
| 121 | +PUT delimited_payload |
| 122 | +{ |
| 123 | + "settings": { |
| 124 | + "analysis": { |
| 125 | + "analyzer": { |
| 126 | + "whitespace_delimited_payload": { |
| 127 | + "tokenizer": "whitespace", |
| 128 | + "filter": [ "delimited_payload" ] |
| 129 | + } |
| 130 | + } |
| 131 | + } |
| 132 | + } |
| 133 | +} |
| 134 | +-------------------------------------------------- |
| 135 | + |
| 136 | +[[analysis-delimited-payload-tokenfilter-configure-parms]] |
| 137 | +==== Configurable parameters |
| 138 | + |
| 139 | +`delimiter`:: |
| 140 | +(Optional, string) |
| 141 | +Character used to separate tokens from payloads. Defaults to `|`. |
| 142 | + |
| 143 | +`encoding`:: |
| 144 | ++ |
| 145 | +-- |
| 146 | +(Optional, string) |
| 147 | +Datatype for the stored payload. Valid values are: |
| 148 | + |
| 149 | +`float`::: |
| 150 | +(Default) Float |
| 151 | + |
| 152 | +`identity`::: |
| 153 | +Characters |
| 154 | + |
| 155 | +`int`::: |
| 156 | +Integer |
| 157 | +-- |
| 158 | + |
| 159 | +[[analysis-delimited-payload-tokenfilter-customize]] |
| 160 | +==== Customize and add to an analyzer |
| 161 | + |
| 162 | +To customize the `delimited_payload` filter, duplicate it to create the basis |
| 163 | +for a new custom token filter. You can modify the filter using its configurable |
| 164 | +parameters. |
| 165 | + |
| 166 | +For example, the following <<indices-create-index,create index API>> request |
| 167 | +uses a custom `delimited_payload` filter to configure a new |
| 168 | +<<analysis-custom-analyzer,custom analyzer>>. The custom `delimited_payload` |
| 169 | +filter uses the `+` delimiter to separate tokens from payloads. Payloads are |
| 170 | +encoded as integers. |
| 171 | + |
| 172 | +[source,console] |
| 173 | +-------------------------------------------------- |
| 174 | +PUT delimited_payload_example |
| 175 | +{ |
| 176 | + "settings": { |
| 177 | + "analysis": { |
| 178 | + "analyzer": { |
| 179 | + "whitespace_plus_delimited": { |
| 180 | + "tokenizer": "whitespace", |
| 181 | + "filter": [ "plus_delimited" ] |
| 182 | + } |
| 183 | + }, |
| 184 | + "filter": { |
| 185 | + "plus_delimited": { |
| 186 | + "type": "delimited_payload", |
| 187 | + "delimiter": "+", |
| 188 | + "encoding": "int" |
| 189 | + } |
| 190 | + } |
| 191 | + } |
| 192 | + } |
| 193 | +} |
| 194 | +-------------------------------------------------- |
| 195 | + |
| 196 | +[[analysis-delimited-payload-tokenfilter-return-stored-payloads]] |
| 197 | +==== Return stored payloads |
| 198 | + |
| 199 | +Use the <<indices-create-index,create index API>> to create an index that: |
| 200 | + |
| 201 | +* Includes a field that stores term vectors with payloads. |
| 202 | +* Uses a <<analysis-custom-analyzer,custom index analyzer>> with the |
| 203 | + `delimited_payload` filter. |
| 204 | + |
| 205 | +[source,console] |
| 206 | +-------------------------------------------------- |
| 207 | +PUT text_payloads |
| 208 | +{ |
| 209 | + "mappings": { |
| 210 | + "properties": { |
| 211 | + "text": { |
| 212 | + "type": "text", |
| 213 | + "term_vector": "with_positions_payloads", |
| 214 | + "analyzer": "payload_delimiter" |
| 215 | + } |
| 216 | + } |
| 217 | + }, |
| 218 | + "settings": { |
| 219 | + "analysis": { |
| 220 | + "analyzer": { |
| 221 | + "payload_delimiter": { |
| 222 | + "tokenizer": "whitespace", |
| 223 | + "filter": [ "delimited_payload" ] |
| 224 | + } |
| 225 | + } |
| 226 | + } |
| 227 | + } |
| 228 | +} |
| 229 | +-------------------------------------------------- |
8 | 230 |
|
9 |
| -The older name `delimited_payload_filter` is deprecated and should not be used for new indices. Use `delimited_payload` instead. |
| 231 | +Add a document containing payloads to the index. |
10 | 232 |
|
11 |
| -============================================ |
| 233 | +[source,console] |
| 234 | +-------------------------------------------------- |
| 235 | +POST text_payloads/_doc/1 |
| 236 | +{ |
| 237 | + "text": "the|0 brown|3 fox|4 is|0 quick|10" |
| 238 | +} |
| 239 | +-------------------------------------------------- |
| 240 | +// TEST[continued] |
12 | 241 |
|
13 |
| -Example: "the|1 quick|2 fox|3" is split by default into tokens `the`, `quick`, and `fox` with payloads `1`, `2`, and `3` respectively. |
| 242 | +Use the <<docs-termvectors,term vectors API>> to return the document's tokens |
| 243 | +and base64-encoded payloads. |
14 | 244 |
|
15 |
| -Parameters: |
| 245 | +[source,console] |
| 246 | +-------------------------------------------------- |
| 247 | +GET text_payloads/_termvectors/1 |
| 248 | +{ |
| 249 | + "fields": [ "text" ], |
| 250 | + "payloads": true |
| 251 | +} |
| 252 | +-------------------------------------------------- |
| 253 | +// TEST[continued] |
16 | 254 |
|
17 |
| -`delimiter`:: |
18 |
| - Character used for splitting the tokens. Default is `|`. |
| 255 | +The API returns the following response: |
19 | 256 |
|
20 |
| -`encoding`:: |
21 |
| - The type of the payload. `int` for integer, `float` for float and `identity` for characters. Default is `float`. |
| 257 | +[source,console-result] |
| 258 | +-------------------------------------------------- |
| 259 | +{ |
| 260 | + "_index": "text_payloads", |
| 261 | + "_id": "1", |
| 262 | + "_version": 1, |
| 263 | + "found": true, |
| 264 | + "took": 8, |
| 265 | + "term_vectors": { |
| 266 | + "text": { |
| 267 | + "field_statistics": { |
| 268 | + "sum_doc_freq": 5, |
| 269 | + "doc_count": 1, |
| 270 | + "sum_ttf": 5 |
| 271 | + }, |
| 272 | + "terms": { |
| 273 | + "brown": { |
| 274 | + "term_freq": 1, |
| 275 | + "tokens": [ |
| 276 | + { |
| 277 | + "position": 1, |
| 278 | + "payload": "QEAAAA==" |
| 279 | + } |
| 280 | + ] |
| 281 | + }, |
| 282 | + "fox": { |
| 283 | + "term_freq": 1, |
| 284 | + "tokens": [ |
| 285 | + { |
| 286 | + "position": 2, |
| 287 | + "payload": "QIAAAA==" |
| 288 | + } |
| 289 | + ] |
| 290 | + }, |
| 291 | + "is": { |
| 292 | + "term_freq": 1, |
| 293 | + "tokens": [ |
| 294 | + { |
| 295 | + "position": 3, |
| 296 | + "payload": "AAAAAA==" |
| 297 | + } |
| 298 | + ] |
| 299 | + }, |
| 300 | + "quick": { |
| 301 | + "term_freq": 1, |
| 302 | + "tokens": [ |
| 303 | + { |
| 304 | + "position": 4, |
| 305 | + "payload": "QSAAAA==" |
| 306 | + } |
| 307 | + ] |
| 308 | + }, |
| 309 | + "the": { |
| 310 | + "term_freq": 1, |
| 311 | + "tokens": [ |
| 312 | + { |
| 313 | + "position": 0, |
| 314 | + "payload": "AAAAAA==" |
| 315 | + } |
| 316 | + ] |
| 317 | + } |
| 318 | + } |
| 319 | + } |
| 320 | + } |
| 321 | +} |
| 322 | +-------------------------------------------------- |
| 323 | +// TESTRESPONSE[s/"took": 8/"took": "$body.took"/] |
0 commit comments