Skip to content

Commit 2334952

Browse files
author
Adam Locke
authored
[DOCS] [7.x] Create a new page for dissect content in scripting docs #73437 (#73507) (#73508)
* [DOCS] Create a new page for dissect content in scripting docs (#73437) * [DOCS] Create a new page for dissect in scripting docs * Expanding a bit more * Adding a section for using dissect patterns * Adding tests * Fix test cases and other edits * Add doc type to response
1 parent f37bce1 commit 2334952

File tree

3 files changed

+312
-6
lines changed

3 files changed

+312
-6
lines changed

docs/reference/scripting/common-script-uses.asciidoc

Lines changed: 1 addition & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -17,15 +17,10 @@ There are two options at your disposal:
1717
* <<grok,Grok>> is a regular expression dialect that supports aliased
1818
expressions that you can reuse. Because Grok sits on top of regular expressions
1919
(regex), any regular expressions are valid in grok as well.
20-
* <<dissect-processor,Dissect>> extracts structured fields out of text, using
20+
* <<dissect,Dissect>> extracts structured fields out of text, using
2121
delimiters to define the matching pattern. Unlike grok, dissect doesn't use regular
2222
expressions.
2323

24-
Regex is incredibly powerful but can be complicated. If you don't need the
25-
power of regular expressions, use dissect patterns, which are simple and
26-
often faster than grok patterns. Paying special attention to the parts of the string
27-
you want to discard will help build successful dissect patterns.
28-
2924
Let's start with a simple example by adding the `@timestamp` and `message`
3025
fields to the `my-index` mapping as indexed fields. To remain flexible, use
3126
`wildcard` as the field type for `message`:
Lines changed: 310 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,310 @@
1+
[[dissect]]
2+
=== Dissecting data
3+
Dissect matches a single text field against a defined pattern. A dissect
4+
pattern is defined by the parts of the string you want to discard. Paying
5+
special attention to each part of a string helps to build successful dissect
6+
patterns.
7+
8+
If you don't need the power of regular expressions, use dissect patterns instead
9+
of grok. Dissect uses a much simpler syntax than grok and is typically faster
10+
overall. The syntax for dissect is transparent: tell dissect what you want and
11+
it will return those results to you.
12+
13+
[[dissect-syntax]]
14+
==== Dissect patterns
15+
Dissect patterns are comprised of _variables_ and _separators_. Anything
16+
defined by a percent sign and curly braces `%{}` is considered a variable,
17+
such as `%{clientip}`. You can assign variables to any part of data in a field,
18+
and then return only the parts that you want. Separators are any values between
19+
variables, which could be spaces, dashes, or other delimiters.
20+
21+
For example, let's say you have log data with a `message` field that looks like
22+
this:
23+
24+
[source,js]
25+
----
26+
"message" : "247.37.0.0 - - [30/Apr/2020:14:31:22 -0500] \"GET /images/hm_nbg.jpg HTTP/1.0\" 304 0"
27+
----
28+
// NOTCONSOLE
29+
30+
You assign variables to each part of the data to construct a successful
31+
dissect pattern. Remember, tell dissect _exactly_ what you want you want to
32+
match on.
33+
34+
35+
[NOTE]
36+
====
37+
ASDLKJASLDKF
38+
39+
ASDFLKJA;SLDrF
40+
====
41+
42+
The first part of the data looks like an IP address, so you
43+
can assign a variable like `%{clientip}`. The next two characters are dashes
44+
with a space on either side. You can assign a variable for each dash, or a
45+
single variable to represent the dashes and spaces. Next are a set of brackets
46+
containing a timestamp. The brackets are a separator, so you include those in
47+
the dissect pattern. Thus far, the data and matching dissect pattern look like
48+
this:
49+
50+
[source,js]
51+
----
52+
247.37.0.0 - - [30/Apr/2020:14:31:22 -0500] <1>
53+
54+
%{clientip} %{ident} %{auth} [%{@timestamp}] <2>
55+
----
56+
// NOTCONSOLE
57+
<1> The first chunks of data from the `message` field
58+
<2> Dissect pattern to match on the selected data chunks
59+
60+
Using that same logic, you can create variables for the remaining chunks of
61+
data. Double quotation marks are separators, so include those in your dissect
62+
pattern. The pattern replaces `GET` with a `%{verb}` variable, but keeps `HTTP`
63+
as part of the pattern.
64+
65+
[source,js]
66+
----
67+
\"GET /images/hm_nbg.jpg HTTP/1.0\" 304 0
68+
69+
"%{verb} %{request} HTTP/%{httpversion}" %{response} %{size}
70+
----
71+
// NOTCONSOLE
72+
73+
Combining the two patterns results in a dissect pattern that looks like this:
74+
75+
[source,js]
76+
----
77+
%{clientip} %{ident} %{auth} [%{@timestamp}] \"%{verb} %{request} HTTP/%{httpversion}\" %{status} %{size}
78+
----
79+
// NOTCONSOLE
80+
81+
Now that you have a dissect pattern, how do you test and use it?
82+
83+
[[dissect-patterns-test]]
84+
==== Test dissect patterns with Painless
85+
You can incorporate dissect patterns into Painless scripts to extract
86+
data. To test your script, use either the {painless}/painless-execute-api.html#painless-execute-runtime-field-context[field contexts] of the Painless
87+
execute API or create a runtime field that includes the script. Runtime fields
88+
offer greater flexibility and accept multiple documents, but the Painless execute
89+
API is a great option if you don't have write access on a cluster where you're
90+
testing a script.
91+
92+
For example, test your dissect pattern with the Painless execute API by
93+
including your Painless script and a single document that matches your data.
94+
Start by indexing the `message` field as a `wildcard` data type:
95+
96+
[source,console]
97+
----
98+
PUT my-index
99+
{
100+
"mappings": {
101+
"properties": {
102+
"message": {
103+
"type": "wildcard"
104+
}
105+
}
106+
}
107+
}
108+
----
109+
110+
If you want to retrieve the HTTP response code, add your dissect pattern to a
111+
Painless script that extracts the `response` value. To extract values from a
112+
field, use this function:
113+
114+
[source,painless]
115+
----
116+
`.extract(doc["<field_name>"].value)?.<field_value>`
117+
----
118+
119+
In this example, `message` is the `<field_name>` and `response` is the
120+
`<field_value>`:
121+
122+
[source,console]
123+
----
124+
POST /_scripts/painless/_execute
125+
{
126+
"script": {
127+
"source": """
128+
String response=dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{response} %{size}').extract(doc["message"].value)?.response;
129+
if (response != null) emit(Integer.parseInt(response)); <1>
130+
"""
131+
},
132+
"context": "long_field", <2>
133+
"context_setup": {
134+
"index": "my-index",
135+
"document": { <3>
136+
"message": """247.37.0.0 - - [30/Apr/2020:14:31:22 -0500] "GET /images/hm_nbg.jpg HTTP/1.0" 304 0"""
137+
}
138+
}
139+
}
140+
----
141+
// TEST[continued]
142+
<1> Runtime fields require the `emit` method to return values.
143+
<2> Because the response code is an integer, use the `long_field` context.
144+
<3> Include a sample document that matches your data.
145+
146+
The result includes the HTTP response code:
147+
148+
[source,console-result]
149+
----
150+
{
151+
"result" : [
152+
304
153+
]
154+
}
155+
----
156+
157+
[[dissect-patterns-runtime]]
158+
==== Use dissect patterns and scripts in runtime fields
159+
If you have a functional dissect pattern, you can add it to a runtime field to
160+
manipulate data. Because runtime fields don't require you to index fields, you
161+
have incredible flexibility to modify your script and how it functions. If you
162+
already <<dissect-patterns-test,tested your dissect pattern>> using the Painless
163+
execute API, you can use that _exact_ Painless script in your runtime field.
164+
165+
To start, add the `message` field as a `wildcard` type like in the previous
166+
section, but also add `@timestamp` as a `date` in case you want to operate on
167+
that field for <<common-script-uses,other use cases>>:
168+
169+
[source,console]
170+
----
171+
PUT /my-index/
172+
{
173+
"mappings": {
174+
"properties": {
175+
"@timestamp": {
176+
"format": "strict_date_optional_time||epoch_second",
177+
"type": "date"
178+
},
179+
"message": {
180+
"type": "wildcard"
181+
}
182+
}
183+
}
184+
}
185+
----
186+
187+
If you want to extract the HTTP response code using your dissect pattern, you
188+
can create a runtime field like `http.response`:
189+
190+
[source,console]
191+
----
192+
PUT my-index/_mappings
193+
{
194+
"runtime": {
195+
"http.response": {
196+
"type": "long",
197+
"script": """
198+
String response=dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{response} %{size}').extract(doc["message"].value)?.response;
199+
if (response != null) emit(Integer.parseInt(response));
200+
"""
201+
}
202+
}
203+
}
204+
----
205+
// TEST[continued]
206+
207+
After mapping the fields you want to retrieve, index a few records from
208+
your log data into {es}. The following request uses the <<docs-bulk,bulk API>>
209+
to index raw log data into `my-index`:
210+
211+
[source,console]
212+
----
213+
POST /my-index/_bulk?refresh=true
214+
{"index":{}}
215+
{"timestamp":"2020-04-30T14:30:17-05:00","message":"40.135.0.0 - - [30/Apr/2020:14:30:17 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"}
216+
{"index":{}}
217+
{"timestamp":"2020-04-30T14:30:53-05:00","message":"232.0.0.0 - - [30/Apr/2020:14:30:53 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"}
218+
{"index":{}}
219+
{"timestamp":"2020-04-30T14:31:12-05:00","message":"26.1.0.0 - - [30/Apr/2020:14:31:12 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"}
220+
{"index":{}}
221+
{"timestamp":"2020-04-30T14:31:19-05:00","message":"247.37.0.0 - - [30/Apr/2020:14:31:19 -0500] \"GET /french/splash_inet.html HTTP/1.0\" 200 3781"}
222+
{"index":{}}
223+
{"timestamp":"2020-04-30T14:31:22-05:00","message":"247.37.0.0 - - [30/Apr/2020:14:31:22 -0500] \"GET /images/hm_nbg.jpg HTTP/1.0\" 304 0"}
224+
{"index":{}}
225+
{"timestamp":"2020-04-30T14:31:27-05:00","message":"252.0.0.0 - - [30/Apr/2020:14:31:27 -0500] \"GET /images/hm_bg.jpg HTTP/1.0\" 200 24736"}
226+
{"index":{}}
227+
{"timestamp":"2020-04-30T14:31:28-05:00","message":"not a valid apache log"}
228+
----
229+
// TEST[continued]
230+
231+
You can define a simple query to run a search for a specific HTTP response and
232+
return all related fields. Use the `fields` parameter of the search API to
233+
retrieve the `http.response` runtime field.
234+
235+
[source,console]
236+
----
237+
GET my-index/_search
238+
{
239+
"query": {
240+
"match": {
241+
"http.response": "304"
242+
}
243+
},
244+
"fields" : ["http.response"]
245+
}
246+
----
247+
// TEST[continued]
248+
249+
Alternatively, you can define the same runtime field but in the context of a
250+
search request. The runtime definition and the script are exactly the same as
251+
the one defined previously in the index mapping. Just copy that definition into
252+
the search request under the `runtime_mappings` section and include a query
253+
that matches on the runtime field. This query returns the same results as the
254+
search query previously defined for the `http.response` runtime field in your
255+
index mappings, but only in the context of this specific search:
256+
257+
[source,console]
258+
----
259+
GET my-index/_search
260+
{
261+
"runtime_mappings": {
262+
"http.response": {
263+
"type": "long",
264+
"script": """
265+
String response=dissect('%{clientip} %{ident} %{auth} [%{@timestamp}] "%{verb} %{request} HTTP/%{httpversion}" %{response} %{size}').extract(doc["message"].value)?.response;
266+
if (response != null) emit(Integer.parseInt(response));
267+
"""
268+
}
269+
},
270+
"query": {
271+
"match": {
272+
"http.response": "304"
273+
}
274+
},
275+
"fields" : ["http.response"]
276+
}
277+
----
278+
// TEST[continued]
279+
// TEST[s/_search/_search\?filter_path=hits/]
280+
281+
[source,console-result]
282+
----
283+
{
284+
"hits" : {
285+
"total" : {
286+
"value" : 1,
287+
"relation" : "eq"
288+
},
289+
"max_score" : 1.0,
290+
"hits" : [
291+
{
292+
"_index" : "my-index",
293+
"_type" : "_doc",
294+
"_id" : "D47UqXkBByC8cgZrkbOm",
295+
"_score" : 1.0,
296+
"_source" : {
297+
"timestamp" : "2020-04-30T14:31:22-05:00",
298+
"message" : "247.37.0.0 - - [30/Apr/2020:14:31:22 -0500] \"GET /images/hm_nbg.jpg HTTP/1.0\" 304 0"
299+
},
300+
"fields" : {
301+
"http.response" : [
302+
304
303+
]
304+
}
305+
}
306+
]
307+
}
308+
}
309+
----
310+
// TESTRESPONSE[s/"_id" : "D47UqXkBByC8cgZrkbOm"/"_id": $body.hits.hits.0._id/]

docs/reference/scripting/using.asciidoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -566,4 +566,5 @@ DELETE /_ingest/pipeline/my_test_scores_pipeline
566566
567567
////
568568

569+
include::dissect-syntax.asciidoc[]
569570
include::grok-syntax.asciidoc[]

0 commit comments

Comments
 (0)