Skip to content

Commit 534e734

Browse files
szabostevelcawl
andcommitted
[DOCS] Adds painless transform examples (#53274)
Co-authored-by: Lisa Cawley <[email protected]>
1 parent 5b86fc4 commit 534e734

File tree

2 files changed

+331
-0
lines changed

2 files changed

+331
-0
lines changed

docs/reference/transform/index.asciidoc

+2
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ your data.
1616
* <<transform-api-quickref>>
1717
* <<ecommerce-transforms>>
1818
* <<transform-examples>>
19+
* <<transform-painless-examples>>
1920
* <<transform-troubleshooting>>
2021
* <<transform-limitations>>
2122

@@ -26,5 +27,6 @@ include::checkpoints.asciidoc[]
2627
include::api-quickref.asciidoc[]
2728
include::ecommerce-tutorial.asciidoc[]
2829
include::examples.asciidoc[]
30+
include::painless-examples.asciidoc[]
2931
include::troubleshooting.asciidoc[]
3032
include::limitations.asciidoc[]
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,329 @@
1+
[role="xpack"]
2+
[testenv="basic"]
3+
[[transform-painless-examples]]
4+
=== Painless examples for {transforms}
5+
++++
6+
<titleabbrev>Painless examples for {transforms}</titleabbrev>
7+
++++
8+
9+
These examples demonstrate how to use Painless in {transforms}. You can learn
10+
more about the Painless scripting language in the
11+
{painless}/painless-guide.html[Painless guide].
12+
13+
* <<painless-top-hits>>
14+
* <<painless-time-features>>
15+
* <<painless-group-by>>
16+
* <<painless-bucket-script>>
17+
18+
19+
[discrete]
20+
[[painless-top-hits]]
21+
==== Getting top hits by using scripted metric
22+
23+
This snippet shows how to find the latest document, in other words the document
24+
with the earliest timestamp. From a technical perspective, it helps to achieve
25+
the function of a <<search-aggregations-metrics-top-hits-aggregation>> by using
26+
scripted metric aggregation which provides a metric output.
27+
28+
[source,js]
29+
--------------------------------------------------
30+
"latest_doc": {
31+
"scripted_metric": {
32+
"init_script": "state.timestamp_latest = 0L; state.last_doc = ''", <1>
33+
"map_script": """ <2>
34+
def current_date = doc['@timestamp'].getValue().toInstant().toEpochMilli();
35+
if (current_date > state.timestamp_latest)
36+
{state.timestamp_latest = current_date;
37+
state.last_doc = new HashMap(params['_source']);}
38+
""",
39+
"combine_script": "return state", <3>
40+
"reduce_script": """ <4>
41+
def last_doc = '';
42+
def timestamp_latest = 0L;
43+
for (s in states) {if (s.timestamp_latest > (timestamp_latest))
44+
{timestamp_latest = s.timestamp_latest; last_doc = s.last_doc;}}
45+
return last_doc
46+
"""
47+
}
48+
}
49+
--------------------------------------------------
50+
// NOTCONSOLE
51+
52+
<1> The `init_script` creates a long type `timestamp_latest` and a string type
53+
`last_doc` in the `state` object.
54+
<2> The `map_script` defines `current_date` based on the timestamp of the
55+
document, then compares `current_date` with `state.timestamp_latest`, finally
56+
returns `state.last_doc` from the shard. By using `new HashMap(...)` we copy the
57+
source document, this is important whenever you want to pass the full source
58+
object from one phase to the next.
59+
<3> The `combine_script` returns `state` from each shard.
60+
<4> The `reduce_script` iterates through the value of `s.timestamp_latest`
61+
returned by each shard and returns the document with the latest timestamp
62+
(`last_doc`). In the response, the top hit (in other words, the `latest_doc`) is
63+
nested below the `latest_doc` field.
64+
65+
Check the
66+
<<scripted-metric-aggregation-scope,scope of scripts>>
67+
for detailed explanation on the respective scripts.
68+
69+
You can retrieve the last value in a similar way:
70+
71+
[source,js]
72+
--------------------------------------------------
73+
"latest_value": {
74+
"scripted_metric": {
75+
"init_script": "state.timestamp_latest = 0L; state.last_value = ''",
76+
"map_script": """
77+
def current_date = doc['date'].getValue().toInstant().toEpochMilli();
78+
if (current_date > state.timestamp_latest)
79+
{state.timestamp_latest = current_date;
80+
state.last_value = params['_source']['value'];}
81+
""",
82+
"combine_script": "return state",
83+
"reduce_script": """
84+
def last_value = '';
85+
def timestamp_latest = 0L;
86+
for (s in states) {if (s.timestamp_latest > (timestamp_latest))
87+
{timestamp_latest = s.timestamp_latest; last_value = s.last_value;}}
88+
return last_value
89+
"""
90+
}
91+
}
92+
--------------------------------------------------
93+
// NOTCONSOLE
94+
95+
96+
[discrete]
97+
[[painless-time-features]]
98+
==== Getting time features as scripted fields
99+
100+
This snippet shows how to extract time based features by using Painless. The
101+
snippet uses an index where `@timestamp` is defined as a `date` type field.
102+
103+
[source,js]
104+
--------------------------------------------------
105+
"script_fields": {
106+
"hour_of_day": { <1>
107+
"script": {
108+
"lang": "painless",
109+
"source": """
110+
ZonedDateTime date = doc['@timestamp'].value; <2>
111+
return date.getHour(); <3>
112+
"""
113+
}
114+
},
115+
"month_of_year": { <4>
116+
"script": {
117+
"lang": "painless",
118+
"source": """
119+
ZonedDateTime date = doc['@timestamp'].value; <5>
120+
return date.getMonthValue(); <6>
121+
"""
122+
}
123+
}
124+
}
125+
--------------------------------------------------
126+
// NOTCONSOLE
127+
128+
<1> Contains the Painless script that returns the hour of the day.
129+
<2> Sets `date` based on the timestamp of the document.
130+
<3> Returns the hour value from `date`.
131+
<4> Contains the Painless script that returns the month of the year.
132+
<5> Sets `date` based on the timestamp of the document.
133+
<6> Returns the month value from `date`.
134+
135+
136+
[discrete]
137+
[[painless-group-by]]
138+
==== Using Painless in `group_by`
139+
140+
It is possible to base the `group_by` property of a {transform} on the output of
141+
a script. The following example uses the {kib} sample web logs dataset. The goal
142+
here is to make the {transform} output easier to understand through normalizing
143+
the value of the fields that the data is grouped by.
144+
145+
[source,console]
146+
--------------------------------------------------
147+
POST _transform/_preview
148+
{
149+
"source": {
150+
"index": [ <1>
151+
"kibana_sample_data_logs"
152+
]
153+
},
154+
"pivot": {
155+
"group_by": {
156+
"agent": {
157+
"terms": {
158+
"script": { <2>
159+
"source": """String agent = doc['agent.keyword'].value;
160+
if (agent.contains("MSIE")) {
161+
return "internet explorer";
162+
} else if (agent.contains("AppleWebKit")) {
163+
return "safari";
164+
} else if (agent.contains('Firefox')) {
165+
return "firefox";
166+
} else { return agent }""",
167+
"lang": "painless"
168+
}
169+
}
170+
}
171+
},
172+
"aggregations": { <3>
173+
"200": {
174+
"filter": {
175+
"term": {
176+
"response": "200"
177+
}
178+
}
179+
},
180+
"404": {
181+
"filter": {
182+
"term": {
183+
"response": "404"
184+
}
185+
}
186+
},
187+
"503": {
188+
"filter": {
189+
"term": {
190+
"response": "503"
191+
}
192+
}
193+
}
194+
}
195+
},
196+
"dest": { <4>
197+
"index": "pivot_logs"
198+
}
199+
}
200+
--------------------------------------------------
201+
// TEST[skip:setup kibana sample data]
202+
203+
<1> Specifies the source index or indices.
204+
<2> The script defines an `agent` string based on the `agent` field of the
205+
documents, then iterates through the values. If an `agent` field contains
206+
"MSIE", than the script returns "Internet Explorer". If it contains
207+
`AppleWebKit`, it returns "safari". It returns "firefox" if the field value
208+
contains "Firefox". Finally, in every other case, the value of the field is
209+
returned.
210+
<3> The aggregations object contains filters that narrow down the results to
211+
documents that contains `200`, `404`, or `503` values in the `response` field.
212+
<4> Specifies the destination index of the {transform}.
213+
214+
The API returns the following result:
215+
216+
[source,js]
217+
--------------------------------------------------
218+
{
219+
"preview" : [
220+
{
221+
"agent" : "firefox",
222+
"200" : 4931,
223+
"404" : 259,
224+
"503" : 172
225+
},
226+
{
227+
"agent" : "internet explorer",
228+
"200" : 3674,
229+
"404" : 210,
230+
"503" : 126
231+
},
232+
{
233+
"agent" : "safari",
234+
"200" : 4227,
235+
"404" : 332,
236+
"503" : 143
237+
}
238+
],
239+
"mappings" : {
240+
"properties" : {
241+
"200" : {
242+
"type" : "long"
243+
},
244+
"agent" : {
245+
"type" : "keyword"
246+
},
247+
"404" : {
248+
"type" : "long"
249+
},
250+
"503" : {
251+
"type" : "long"
252+
}
253+
}
254+
}
255+
}
256+
--------------------------------------------------
257+
// NOTCONSOLE
258+
259+
You can see that the `agent` values are simplified so it is easier to interpret
260+
them. The table below shows how normalization modifies the output of the
261+
{transform} in our example compared to the non-normalized values.
262+
263+
[width="50%"]
264+
265+
|===
266+
| Non-normalized `agent` value | Normalized `agent` value
267+
268+
| "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)" | "internet explorer"
269+
| "Mozilla/5.0 (X11; Linux i686) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.50 Safari/534.24" | "safari"
270+
| "Mozilla/5.0 (X11; Linux x86_64; rv:6.0a1) Gecko/20110421 Firefox/6.0a1" | "firefox"
271+
|===
272+
273+
274+
[discrete]
275+
[[painless-bucket-script]]
276+
==== Getting duration by using bucket script
277+
278+
This example shows you how to get the duration of a session by client IP from a
279+
data log by using
280+
{ref}/search-aggregations-pipeline-bucket-script-aggregation.html[bucket script].
281+
The example uses the {kib} sample web logs dataset.
282+
283+
[source,console]
284+
--------------------------------------------------
285+
PUT _data_frame/transforms/data_log
286+
{
287+
"source": {
288+
"index": "kibana_sample_data_logs"
289+
},
290+
"dest": {
291+
"index": "data-logs-by-client"
292+
},
293+
"pivot": {
294+
"group_by": {
295+
"machine.os": {"terms": {"field": "machine.os.keyword"}},
296+
"machine.ip": {"terms": {"field": "clientip"}}
297+
},
298+
"aggregations": {
299+
"time_frame.lte": {
300+
"max": {
301+
"field": "timestamp"
302+
}
303+
},
304+
"time_frame.gte": {
305+
"min": {
306+
"field": "timestamp"
307+
}
308+
},
309+
"time_length": { <1>
310+
"bucket_script": {
311+
"buckets_path": { <2>
312+
"min": "time_frame.gte.value",
313+
"max": "time_frame.lte.value"
314+
},
315+
"script": "params.max - params.min" <3>
316+
}
317+
}
318+
}
319+
}
320+
}
321+
--------------------------------------------------
322+
// TEST[skip:setup kibana sample data]
323+
324+
<1> To define the length of the sessions, we use a bucket script.
325+
<2> The bucket path is a map of script variables and their associated path to
326+
the buckets you want to use for the variable. In this particular case, `min` and
327+
`max` are variables mapped to `time_frame.gte.value` and `time_frame.lte.value`.
328+
<3> Finally, the script substracts the start date of the session from the end
329+
date which results in the duration of the session.

0 commit comments

Comments
 (0)