Skip to content

[DOCS] Adds data frame analytics API and evaluate API resource documentation #43972

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jul 11, 2019
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 94 additions & 0 deletions docs/reference/ml/apis/dfanalyticsresources.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
[role="xpack"]
[testenv="platinum"]
[[ml-dfanalytics-resources]]
=== {dfanalytics-cap} resources

A {dfanalytics} configuration object has the following properties:

`analysis`::
(object) The type of analysis that is performed on the `source`. For example:
`outlier_detection`. For more information, see <<dfanalytics-types>>.

`analyzed_fields`::
(object) You can specify both `includes` and/or `excludes` patterns. If
`analyzed_fields` is not set, only the relevant fileds will be included. For
example all the numeric fields for {oldetection}.

`dest`::
(object) The destination configuration of the analysis. For more information,
see <<dfanalytics-dest-resources>>.

`id`::
(string) The unique identifier for the {dfanalytics-job}. This identifier can
contain lowercase alphanumeric characters (a-z and 0-9), hyphens, and
underscores. It must start and end with alphanumeric characters. This property
is informational; you cannot change the identifier for existing jobs.

`model_memory_limit`::
(string) The approximate maximum amount of memory resources that are
required for analytical processing. The default value for {dfanalytics-jobs}
is `1gb`. If your `elasticsearch.yml` file contains an
`xpack.ml.max_model_memory_limit` setting, an error occurs when you try to
create {dfanalytics-jobs} that have `model_memory_limit` values greater than
that setting. For more information, see <<ml-settings>>.

`source`::
(object) The source configuration, consisting of `index` and optionally a
`query`. For more information, see <<dfanalytics-source-resources>>.

[float]
[[dfanalytics-types]]
==== Analysis types

[float]
[[oldetection-resources]]
===== {oldetection-cap} configuration objects

An {oldetection} configuration object has the following properties:

`n_neighbors` (Optional)::
(integer) Defines the value for how many nearest neighbors each method of
{oldetection} will use to calculate its {olscore}. When the value is
not set, the system will dynamically detect an appropriate value.

`method` (Optional)::
(string) Sets the method that {oldetection} uses. If the method is not set
{oldetection} uses an ensemble of different methods and normalises and
combines their individual {olscores} to obtain the overall {olscore}.
Available methods are `lof`, `ldof`, `distance_kth_nn`, `distance_knn`.

`feature_influence_threshold` (Optional)::
(double) The minimum {olscore} that a document needs to have in order to
calculate its {fiscore}.
Value range: 0-1 (`0.1` by default).

[float]
[[dfanalytics-dest-resources]]
==== Dest configuration objects

The `dest` configuration object has the following properties:

`index` (Required)::
(string) The name of the index in which to store the results of the
{dfanalytics-job}.

`results_field` (Optional)::
(string) The name of the field in which to store the results of the analysis.
The default value is `ml`.

[float]
[[dfanalytics-source-resources]]
==== Source configuration objects

The `source` configuration object has the following properties:

`index` (Required)::
(array) An array of index names on which to perform the analysis. It can be a
single index or index pattern as well as an array of indices or patterns.

`query`::
(object) The {es} query domain-specific language (DSL). This value
corresponds to the query object in an {es} search POST body. All the
options that are supported by {es} can be used, as this object is
passed verbatim to {es}. By default, this property has the following
value: `{"match_all": {"boost": 1}}`.
63 changes: 63 additions & 0 deletions docs/reference/ml/apis/evaluateresources.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
[role="xpack"]
[testenv="platinum"]
[[ml-evaluate-dfanalytics-resources]]
=== Evaluate {dfanalytics} resources

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This page is different than most of the other pages in the "Definitions" section, since it seems to be defining the input (request body) properties for the evaluate DF analytics API, rather than the output (response body) properties. In many other cases, the input and output is similar (i.e. input to create job matches output from get jobs so the "job resources" applies to both). That doesn't seem to be the case here, though.

I think we should either (a) extend the evaluation resources page to also describe the response objects, or (b) move the configuration objects into the API reference page and only cover the response objects in the resources page.

If I've explained this poorly or misunderstood the goal of this page, just let me know!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resources page contains both the request body and the response body parameters.

Some params are overlapping, for example auc_roc, precison, recall, confusion_matrix can be also the part of the request body as well as of the response body. tp, fp, tn, fn could be only the part of the response body. As far as I see, all the response objects are covered here.

An evaluation configuration object has the following properties:

`evaluation`::
(object) Defines the type of evaluation you want to perform. The value of this
object can be different depending on the type of evaluation you want to
perform. For more information, see <<ml-evaluation-types>>.


[float]
[[ml-evaluation-types]]
==== Evaluation types


[float]
[[binary-sc-resources]]
===== Binary soft classification configuration object

Binary soft classification evaluates the results of an analysis which outputs
the probability that each {dataframe} row belongs to a certain class. For
example, in the context of outlier detection, the analysis outputs the
probability whether each row is an outlier.

A binary soft classification object has the following properties:

`actual_field` (Required)::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically this information about which fields are required or optional appears in the API reference page. It's unusual to see it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the notes in fc3ad12.

(string) The field of the `index` which contains the `ground
truth`. The data type of this field can be boolean or integer. If the data
type is integer, the value has to be either `0` (false) or `1` (true).

`predicted_probability_field` (Required)::
(string) The field of the `index` that defines the probability of whether the
item belongs to the class in question or not. It's the field that contains the
results of the analysis.

`metrics` (Optional)::
(object) Specifies the metrics that are used for the evaluation. Available
mertics:

`auc_roc` (Optional)::
(object) The AUC ROC (area under the curve of the receiver operating
characteristic) score and optionally the curve.
Default value is {"includes_curve": false}.

`precision` (Optional)::
(object) Set the different thresholds of the {olscore} at where the metric
is calculated.
Default value is {"at": [0.25, 0.50, 0.75]}.

`recall` (Optional)::
(object) Set the different thresholds of the {olscore} at where the metric
is calculated.
Default value is {"at": [0.25, 0.50, 0.75]}.

`confusion_matrix` (Optional)::
(object) Set the different thresholds of the {olscore} at where the metrics
(TP - true positive, FP - false positive, TN - true negative, FN - false
negative) are calculated.
Default value is {"at": [0.25, 0.50, 0.75]}.
4 changes: 4 additions & 0 deletions docs/reference/rest-api/defs.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ These resource definitions are used in APIs related to {ml-features} and
* <<ml-calendar-resource,Calendars>>
* <<ml-datafeed-resource,{dfeeds-cap}>>
* <<ml-datafeed-counts,{dfeed-cap} counts>>
* <<ml-dfanalytics-resources,{dfanalytics-cap}>>
* <<ml-evaluate-dfanalytics-resources,Evaluate {dfanalytics}>>
* <<ml-filter-resource,Filters>>
* <<ml-job-resource,Jobs>>
* <<ml-jobstats,Job statistics>>
Expand All @@ -19,6 +21,8 @@ These resource definitions are used in APIs related to {ml-features} and

include::{es-repo-dir}/ml/apis/calendarresource.asciidoc[]
include::{es-repo-dir}/ml/apis/datafeedresource.asciidoc[]
include::{es-repo-dir}/ml/apis/dfanalyticsresources.asciidoc[]
include::{es-repo-dir}/ml/apis/evaluateresources.asciidoc[]
include::{es-repo-dir}/ml/apis/filterresource.asciidoc[]
include::{es-repo-dir}/ml/apis/jobresource.asciidoc[]
include::{es-repo-dir}/ml/apis/jobcounts.asciidoc[]
Expand Down