Skip to content

Commit bea4e3f

Browse files
authored
[DOCS] Add scatterplot matrix to outlier detection example (#1507) (#1536)
1 parent 0d5efc0 commit bea4e3f

File tree

6 files changed

+53
-14
lines changed

6 files changed

+53
-14
lines changed

docs/en/stack/ml/df-analytics/ecommerce-outliers.asciidoc

Lines changed: 53 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -27,11 +27,12 @@ such that we get a new index that contains a sales summary for each customer.
2727

2828
In particular, create a {transform} that calculates the sum of the products
2929
(`products.quantity`) and the sum of prices (`products.taxful_price`) in all of
30-
the orders, grouped by customer (`customer_full_name`). Also include a value
30+
the orders, grouped by customer (`customer_full_name.keyword`). Also include a value
3131
count aggregation, so that we know how many orders (`order_id`) exist for each
3232
customer.
3333

34-
You can preview the {transform} before you create it in {kib}:
34+
You can preview the {transform} before you create it in *{stack-manage-app}*
35+
> *Transforms*:
3536

3637
[role="screenshot"]
3738
image::images/ecommerce-transform-preview.png["Creating a {transform} in {kib}"]
@@ -152,12 +153,26 @@ POST _data_frame/transforms/ecommerce-customer-sales/_start
152153
. Create a {dfanalytics-job} to detect outliers in the new entity-centric index.
153154
+
154155
--
155-
There is a wizard for creating {dfanalytics-jobs} on the
156-
*Machine Learning* > *Data Frame Analytics* page in {kib}:
156+
In the wizard on the *Machine Learning* > *Data Frame Analytics* page in {kib},
157+
select your new index pattern then use the default values for {oldetection}. For
158+
example:
157159

158160
[role="screenshot"]
159161
image::images/ecommerce-outlier-job-1.png["Create a {dfanalytics-job} in {kib}"]
160162

163+
The wizard includes a scatterplot matrix, which enables you to explore the
164+
relationships between the fields. You can use that information to help you
165+
decide which fields to include or exclude from the analysis.
166+
167+
[role="screenshot"]
168+
image::images/ecommerce-outlier-scatterplot.png["A scatterplot matrix for three fields in {kib}"]
169+
170+
If you want these charts to represent data from a larger sample size or from a
171+
randomized selection of documents, you can change the default behavior. However,
172+
a larger sample size might slow down the performance of the matrix and a
173+
randomized selection might put more load on the cluster due to the more
174+
intensive query.
175+
161176
Alternatively, you can use the
162177
{ref}/put-dfanalytics.html[create {dfanalytics-jobs} API].
163178

@@ -191,8 +206,8 @@ PUT _ml/data_frame/analytics/ecommerce
191206
+
192207
--
193208
You can start, stop, and manage {dfanalytics-jobs} on the
194-
*Machine Learning* > *Data Frame Analytics* page in {kib}. Alternatively, you
195-
can use the {ref}/start-dfanalytics.html[start {dfanalytics-jobs}] and
209+
*Machine Learning* > *Data Frame Analytics* page. Alternatively, you can use the
210+
{ref}/start-dfanalytics.html[start {dfanalytics-jobs}] and
196211
{ref}/stop-dfanalytics.html[stop {dfanalytics-jobs}] APIs.
197212

198213
.API example
@@ -248,16 +263,40 @@ The search results include the following {oldetection} scores:
248263
[source,js]
249264
--------------------------------------------------
250265
...
251-
"ml" : {
252-
"outlier_score" : 0.9653657078742981,
253-
"feature_influence.products.quantity.sum" : 0.00592468399554491,
254-
"feature_influence.order_id.value_count" : 0.01975759118795395,
255-
"feature_influence.products.taxful_price.sum" : 0.974317729473114
266+
"ml" : {
267+
"outlier_score" : 0.9706582427024841,
268+
"feature_influence" : [
269+
{
270+
"feature_name" : "order_id.value_count",
271+
"influence" : 0.015179949812591076
272+
},
273+
{
274+
"feature_name" : "products.quantity.sum",
275+
"influence" : 0.003752298653125763
276+
},
277+
{
278+
"feature_name" : "products.taxful_price.sum",
279+
"influence" : 0.9810677766799927
280+
}
281+
]
256282
}
257283
...
258284
--------------------------------------------------
259285
// NOTCONSOLE
260286
====
287+
288+
{kib} also provides a scatterplot matrix in the results. Outliers with a score
289+
that exceeds the threshold are highlighted in each chart:
290+
291+
[role="screenshot"]
292+
image::images/outliers-scatterplot.png["View scatterplot in {oldetection} results"]
293+
294+
In addition to the sample size and random scoring options, there is a
295+
*Dynamic size* option. If you enable this option, the size of each point is
296+
affected by its {olscore}; that is to say, the largest points have the
297+
highest {olscores}. The goal of these charts and options is to help you
298+
visualize and explore the outliers within your data.
299+
261300
--
262301

263302
Now that you've found unusual behavior in the sample data set, consider how you
@@ -269,9 +308,9 @@ algorithms perform by using the evaluate {dfanalytics} API. See
269308
TIP: If you do not want to keep the {transform} and the {dfanalytics-job}, you
270309
can delete them in {kib} or use the
271310
{ref}/delete-data-frame-transform.html[delete {transform} API] and
272-
{ref}/delete-dfanalytics.html[delete {dfanalytics-job} API]. When
273-
you delete {transforms} and {dfanalytics-jobs}, the destination indices and
274-
{kib} index patterns remain.
311+
{ref}/delete-dfanalytics.html[delete {dfanalytics-job} API]. When you delete
312+
{transforms} and {dfanalytics-jobs} in {kib}, you have the option to also remove
313+
the destination indices and index patterns.
275314

276315
If you want to see another example of {oldetection} in a Jupyter notebook,
277316
https://github.com/elastic/examples/tree/master/Machine%20Learning/Outlier%20Detection/Introduction[click here].
Loading
Loading
Loading
Loading
Loading

0 commit comments

Comments
 (0)