Skip to content

Commit 1471f34

Browse files
authored
[DOCS] Reformat delimited payload token filter docs (#49380)
* Adds a title abbreviation * Relocates the older name deprecation warning * Updates the description and adds a Lucene link * Adds a note to explain payloads and how to store them * Adds analyze and custom analyzer snippets * Adds a 'Return stored payloads' example
1 parent fc33ee4 commit 1471f34

File tree

1 file changed

+314
-12
lines changed

1 file changed

+314
-12
lines changed
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,323 @@
11
[[analysis-delimited-payload-tokenfilter]]
2-
=== Delimited Payload Token Filter
3-
4-
Named `delimited_payload`. Splits tokens into tokens and payload whenever a delimiter character is found.
2+
=== Delimited payload token filter
3+
++++
4+
<titleabbrev>Delimited payload</titleabbrev>
5+
++++
56

67
[WARNING]
7-
============================================
8+
====
9+
The older name `delimited_payload_filter` is deprecated and should not be used
10+
with new indices. Use `delimited_payload` instead.
11+
====
12+
13+
Separates a token stream into tokens and payloads based on a specified
14+
delimiter.
15+
16+
For example, you can use the `delimited_payload` filter with a `|` delimiter to
17+
split `the|1 quick|2 fox|3` into the tokens `the`, `quick`, and `fox`
18+
with respective payloads of `1`, `2`, and `3`.
19+
20+
This filter uses Lucene's
21+
https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis/payloads/DelimitedPayloadTokenFilter.html[DelimitedPayloadTokenFilter].
22+
23+
[NOTE]
24+
.Payloads
25+
====
26+
A payload is user-defined binary data associated with a token position and
27+
stored as base64-encoded bytes.
28+
29+
{es} does not store token payloads by default. To store payloads, you must:
30+
31+
* Set the <<term-vector,`term_vector`>> mapping parameter to
32+
`with_positions_payloads` or `with_positions_offsets_payloads` for any field
33+
storing payloads.
34+
* Use an index analyzer that includes the `delimited_payload` filter
35+
36+
You can view stored payloads using the <<docs-termvectors,term vectors API>>.
37+
====
38+
39+
[[analysis-delimited-payload-tokenfilter-analyze-ex]]
40+
==== Example
41+
42+
The following <<indices-analyze,analyze API>> request uses the
43+
`delimited_payload` filter with the default `|` delimiter to split
44+
`the|0 brown|10 fox|5 is|0 quick|10` into tokens and payloads.
45+
46+
[source,console]
47+
--------------------------------------------------
48+
GET _analyze
49+
{
50+
"tokenizer": "whitespace",
51+
"filter": ["delimited_payload"],
52+
"text": "the|0 brown|10 fox|5 is|0 quick|10"
53+
}
54+
--------------------------------------------------
55+
56+
The filter produces the following tokens:
57+
58+
[source,text]
59+
--------------------------------------------------
60+
[ the, brown, fox, is, quick ]
61+
--------------------------------------------------
62+
63+
Note that the analyze API does not return stored payloads. For an example that
64+
includes returned payloads, see
65+
<<analysis-delimited-payload-tokenfilter-return-stored-payloads>>.
66+
67+
/////////////////////
68+
[source,console-result]
69+
--------------------------------------------------
70+
{
71+
"tokens": [
72+
{
73+
"token": "the",
74+
"start_offset": 0,
75+
"end_offset": 5,
76+
"type": "word",
77+
"position": 0
78+
},
79+
{
80+
"token": "brown",
81+
"start_offset": 6,
82+
"end_offset": 14,
83+
"type": "word",
84+
"position": 1
85+
},
86+
{
87+
"token": "fox",
88+
"start_offset": 15,
89+
"end_offset": 20,
90+
"type": "word",
91+
"position": 2
92+
},
93+
{
94+
"token": "is",
95+
"start_offset": 21,
96+
"end_offset": 25,
97+
"type": "word",
98+
"position": 3
99+
},
100+
{
101+
"token": "quick",
102+
"start_offset": 26,
103+
"end_offset": 34,
104+
"type": "word",
105+
"position": 4
106+
}
107+
]
108+
}
109+
--------------------------------------------------
110+
/////////////////////
111+
112+
[[analysis-delimited-payload-tokenfilter-analyzer-ex]]
113+
==== Add to an analyzer
114+
115+
The following <<indices-create-index,create index API>> request uses the
116+
`delimited-payload` filter to configure a new <<analysis-custom-analyzer,custom
117+
analyzer>>.
118+
119+
[source,console]
120+
--------------------------------------------------
121+
PUT delimited_payload
122+
{
123+
"settings": {
124+
"analysis": {
125+
"analyzer": {
126+
"whitespace_delimited_payload": {
127+
"tokenizer": "whitespace",
128+
"filter": [ "delimited_payload" ]
129+
}
130+
}
131+
}
132+
}
133+
}
134+
--------------------------------------------------
135+
136+
[[analysis-delimited-payload-tokenfilter-configure-parms]]
137+
==== Configurable parameters
138+
139+
`delimiter`::
140+
(Optional, string)
141+
Character used to separate tokens from payloads. Defaults to `|`.
142+
143+
`encoding`::
144+
+
145+
--
146+
(Optional, string)
147+
Datatype for the stored payload. Valid values are:
148+
149+
`float`:::
150+
(Default) Float
151+
152+
`identity`:::
153+
Characters
154+
155+
`int`:::
156+
Integer
157+
--
158+
159+
[[analysis-delimited-payload-tokenfilter-customize]]
160+
==== Customize and add to an analyzer
161+
162+
To customize the `delimited_payload` filter, duplicate it to create the basis
163+
for a new custom token filter. You can modify the filter using its configurable
164+
parameters.
165+
166+
For example, the following <<indices-create-index,create index API>> request
167+
uses a custom `delimited_payload` filter to configure a new
168+
<<analysis-custom-analyzer,custom analyzer>>. The custom `delimited_payload`
169+
filter uses the `+` delimiter to separate tokens from payloads. Payloads are
170+
encoded as integers.
171+
172+
[source,console]
173+
--------------------------------------------------
174+
PUT delimited_payload_example
175+
{
176+
"settings": {
177+
"analysis": {
178+
"analyzer": {
179+
"whitespace_plus_delimited": {
180+
"tokenizer": "whitespace",
181+
"filter": [ "plus_delimited" ]
182+
}
183+
},
184+
"filter": {
185+
"plus_delimited": {
186+
"type": "delimited_payload",
187+
"delimiter": "+",
188+
"encoding": "int"
189+
}
190+
}
191+
}
192+
}
193+
}
194+
--------------------------------------------------
195+
196+
[[analysis-delimited-payload-tokenfilter-return-stored-payloads]]
197+
==== Return stored payloads
198+
199+
Use the <<indices-create-index,create index API>> to create an index that:
200+
201+
* Includes a field that stores term vectors with payloads.
202+
* Uses a <<analysis-custom-analyzer,custom index analyzer>> with the
203+
`delimited_payload` filter.
204+
205+
[source,console]
206+
--------------------------------------------------
207+
PUT text_payloads
208+
{
209+
"mappings": {
210+
"properties": {
211+
"text": {
212+
"type": "text",
213+
"term_vector": "with_positions_payloads",
214+
"analyzer": "payload_delimiter"
215+
}
216+
}
217+
},
218+
"settings": {
219+
"analysis": {
220+
"analyzer": {
221+
"payload_delimiter": {
222+
"tokenizer": "whitespace",
223+
"filter": [ "delimited_payload" ]
224+
}
225+
}
226+
}
227+
}
228+
}
229+
--------------------------------------------------
8230

9-
The older name `delimited_payload_filter` is deprecated and should not be used for new indices. Use `delimited_payload` instead.
231+
Add a document containing payloads to the index.
10232

11-
============================================
233+
[source,console]
234+
--------------------------------------------------
235+
POST text_payloads/_doc/1
236+
{
237+
"text": "the|0 brown|3 fox|4 is|0 quick|10"
238+
}
239+
--------------------------------------------------
240+
// TEST[continued]
12241

13-
Example: "the|1 quick|2 fox|3" is split by default into tokens `the`, `quick`, and `fox` with payloads `1`, `2`, and `3` respectively.
242+
Use the <<docs-termvectors,term vectors API>> to return the document's tokens
243+
and base64-encoded payloads.
14244

15-
Parameters:
245+
[source,console]
246+
--------------------------------------------------
247+
GET text_payloads/_termvectors/1
248+
{
249+
"fields": [ "text" ],
250+
"payloads": true
251+
}
252+
--------------------------------------------------
253+
// TEST[continued]
16254

17-
`delimiter`::
18-
Character used for splitting the tokens. Default is `|`.
255+
The API returns the following response:
19256

20-
`encoding`::
21-
The type of the payload. `int` for integer, `float` for float and `identity` for characters. Default is `float`.
257+
[source,console-result]
258+
--------------------------------------------------
259+
{
260+
"_index": "text_payloads",
261+
"_id": "1",
262+
"_version": 1,
263+
"found": true,
264+
"took": 8,
265+
"term_vectors": {
266+
"text": {
267+
"field_statistics": {
268+
"sum_doc_freq": 5,
269+
"doc_count": 1,
270+
"sum_ttf": 5
271+
},
272+
"terms": {
273+
"brown": {
274+
"term_freq": 1,
275+
"tokens": [
276+
{
277+
"position": 1,
278+
"payload": "QEAAAA=="
279+
}
280+
]
281+
},
282+
"fox": {
283+
"term_freq": 1,
284+
"tokens": [
285+
{
286+
"position": 2,
287+
"payload": "QIAAAA=="
288+
}
289+
]
290+
},
291+
"is": {
292+
"term_freq": 1,
293+
"tokens": [
294+
{
295+
"position": 3,
296+
"payload": "AAAAAA=="
297+
}
298+
]
299+
},
300+
"quick": {
301+
"term_freq": 1,
302+
"tokens": [
303+
{
304+
"position": 4,
305+
"payload": "QSAAAA=="
306+
}
307+
]
308+
},
309+
"the": {
310+
"term_freq": 1,
311+
"tokens": [
312+
{
313+
"position": 0,
314+
"payload": "AAAAAA=="
315+
}
316+
]
317+
}
318+
}
319+
}
320+
}
321+
}
322+
--------------------------------------------------
323+
// TESTRESPONSE[s/"took": 8/"took": "$body.took"/]

0 commit comments

Comments
 (0)