@@ -27,11 +27,12 @@ such that we get a new index that contains a sales summary for each customer.
27
27
28
28
In particular, create a {transform} that calculates the sum of the products
29
29
(`products.quantity`) and the sum of prices (`products.taxful_price`) in all of
30
- the orders, grouped by customer (`customer_full_name`). Also include a value
30
+ the orders, grouped by customer (`customer_full_name.keyword `). Also include a value
31
31
count aggregation, so that we know how many orders (`order_id`) exist for each
32
32
customer.
33
33
34
- You can preview the {transform} before you create it in {kib}:
34
+ You can preview the {transform} before you create it in *{stack-manage-app}*
35
+ > *Transforms*:
35
36
36
37
[role="screenshot"]
37
38
image::images/ecommerce-transform-preview.png["Creating a {transform} in {kib}"]
@@ -152,12 +153,26 @@ POST _data_frame/transforms/ecommerce-customer-sales/_start
152
153
. Create a {dfanalytics-job} to detect outliers in the new entity-centric index.
153
154
+
154
155
--
155
- There is a wizard for creating {dfanalytics-jobs} on the
156
- *Machine Learning* > *Data Frame Analytics* page in {kib}:
156
+ In the wizard on the *Machine Learning* > *Data Frame Analytics* page in {kib},
157
+ select your new index pattern then use the default values for {oldetection}. For
158
+ example:
157
159
158
160
[role="screenshot"]
159
161
image::images/ecommerce-outlier-job-1.png["Create a {dfanalytics-job} in {kib}"]
160
162
163
+ The wizard includes a scatterplot matrix, which enables you to explore the
164
+ relationships between the fields. You can use that information to help you
165
+ decide which fields to include or exclude from the analysis.
166
+
167
+ [role="screenshot"]
168
+ image::images/ecommerce-outlier-scatterplot.png["A scatterplot matrix for three fields in {kib}"]
169
+
170
+ If you want these charts to represent data from a larger sample size or from a
171
+ randomized selection of documents, you can change the default behavior. However,
172
+ a larger sample size might slow down the performance of the matrix and a
173
+ randomized selection might put more load on the cluster due to the more
174
+ intensive query.
175
+
161
176
Alternatively, you can use the
162
177
{ref}/put-dfanalytics.html[create {dfanalytics-jobs} API].
163
178
@@ -191,8 +206,8 @@ PUT _ml/data_frame/analytics/ecommerce
191
206
+
192
207
--
193
208
You can start, stop, and manage {dfanalytics-jobs} on the
194
- *Machine Learning* > *Data Frame Analytics* page in {kib} . Alternatively, you
195
- can use the {ref}/start-dfanalytics.html[start {dfanalytics-jobs}] and
209
+ *Machine Learning* > *Data Frame Analytics* page. Alternatively, you can use the
210
+ {ref}/start-dfanalytics.html[start {dfanalytics-jobs}] and
196
211
{ref}/stop-dfanalytics.html[stop {dfanalytics-jobs}] APIs.
197
212
198
213
.API example
@@ -248,16 +263,40 @@ The search results include the following {oldetection} scores:
248
263
[source,js]
249
264
--------------------------------------------------
250
265
...
251
- "ml" : {
252
- "outlier_score" : 0.9653657078742981,
253
- "feature_influence.products.quantity.sum" : 0.00592468399554491,
254
- "feature_influence.order_id.value_count" : 0.01975759118795395,
255
- "feature_influence.products.taxful_price.sum" : 0.974317729473114
266
+ "ml" : {
267
+ "outlier_score" : 0.9706582427024841,
268
+ "feature_influence" : [
269
+ {
270
+ "feature_name" : "order_id.value_count",
271
+ "influence" : 0.015179949812591076
272
+ },
273
+ {
274
+ "feature_name" : "products.quantity.sum",
275
+ "influence" : 0.003752298653125763
276
+ },
277
+ {
278
+ "feature_name" : "products.taxful_price.sum",
279
+ "influence" : 0.9810677766799927
280
+ }
281
+ ]
256
282
}
257
283
...
258
284
--------------------------------------------------
259
285
// NOTCONSOLE
260
286
====
287
+
288
+ {kib} also provides a scatterplot matrix in the results. Outliers with a score
289
+ that exceeds the threshold are highlighted in each chart:
290
+
291
+ [role="screenshot"]
292
+ image::images/outliers-scatterplot.png["View scatterplot in {oldetection} results"]
293
+
294
+ In addition to the sample size and random scoring options, there is a
295
+ *Dynamic size* option. If you enable this option, the size of each point is
296
+ affected by its {olscore}; that is to say, the largest points have the
297
+ highest {olscores}. The goal of these charts and options is to help you
298
+ visualize and explore the outliers within your data.
299
+
261
300
--
262
301
263
302
Now that you've found unusual behavior in the sample data set, consider how you
@@ -269,9 +308,9 @@ algorithms perform by using the evaluate {dfanalytics} API. See
269
308
TIP: If you do not want to keep the {transform} and the {dfanalytics-job}, you
270
309
can delete them in {kib} or use the
271
310
{ref}/delete-data-frame-transform.html[delete {transform} API] and
272
- {ref}/delete-dfanalytics.html[delete {dfanalytics-job} API]. When
273
- you delete {transforms} and {dfanalytics-jobs}, the destination indices and
274
- {kib} index patterns remain .
311
+ {ref}/delete-dfanalytics.html[delete {dfanalytics-job} API]. When you delete
312
+ {transforms} and {dfanalytics-jobs} in {kib}, you have the option to also remove
313
+ the destination indices and index patterns.
275
314
276
315
If you want to see another example of {oldetection} in a Jupyter notebook,
277
316
https://github.com/elastic/examples/tree/master/Machine%20Learning/Outlier%20Detection/Introduction[click here].
0 commit comments