-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Make it possible to configure missing values. #11042
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
While the API proposal here is different from the one proposed on #5324, I think it could address most use-cases and even be more generic. For instance, in some cases you might want to have a dedicated bucket for documents that miss a value and all that you would have to do would be to pass a value which doesn't exist in the index (eg. Also I like that we would have a consistent behaviour in all aggregations that support this parameter (ie. all aggregations that work on top of a field or script but |
|
||
==== Missing value | ||
|
||
The `missing` parameter defines how documents that miss a value should be treated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"that are missing a value"
Nice work! |
Thanks @clintongormley for helping fix the docs, I pushed a new commit. |
@@ -123,3 +123,26 @@ settings and filter the returned buckets based on a `min_doc_count` setting (by | |||
bucket that matches documents and the last one are returned). This histogram also supports the `extended_bounds` | |||
setting, which enables extending the bounds of the histogram beyond the data itself (to read more on why you'd want to | |||
do that please refer to the explanation <<search-aggregations-bucket-histogram-aggregation-extended-bounds,here>>). | |||
|
|||
==== Missing value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would this section not fit better in the general aggregations section since it affects (almost) every aggregation and is the same syntax for them all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it similar to other features like script support. While this duplicates the documentation effort, it also has the benefit of showing an example in context (also note that examples try to be meaningful to the aggregation whenever possible)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, that makes sense
@jpountz left a couple of minor comments |
LGTM |
Most aggregations (terms, histogram, stats, percentiles, geohash-grid) now support a new `missing` option which defines the value to consider when a field does not have a value. This can be handy if you eg. want a terms aggregation to handle the same way documents that have "N/A" or no value for a `tag` field. This works in a very similar way to the `missing` option on the `sort` element. One known issue is that this option sometimes cannot make the right decision in the unmapped case: it needs to replace all values with the `missing` value but might not know what kind of values source should be produced (numerics, strings, geo points?). For this reason, we might want to add an `unmapped_type` option in the future like we did for sorting. Related to elastic#5324
8698a26
to
32e23b9
Compare
Aggs: Make it possible to configure missing values.
Can't get this working for the life of me. Is this in 1.6? I can't find any documentation on this feature at https://www.elastic.co/
{
"size": 0,
"query": {
"filtered": {
"query": {
"query_string": {
"query": "_type:Subscription",
"analyze_wildcard": true
}
}
}
},
"aggs": {
"2": {
"date_histogram": {
"field": "date",
"interval": "1M",
"pre_zone_adjust_large_interval": true,
"min_doc_count": 1
},
"aggs": {
"campaign_term": {
"terms": {
"field": "context.campaign.term",
"size": 0,
"missing": "hr-openers"
}
}
}
}
}
} |
@mrfelton this will be available from 2.0 onwards. The documentation for it is availble on the master branch of the docs. There is a new section for each agg called 'Missing Values'. For example: https://www.elastic.co/guide/en/elasticsearch/reference/master/search-aggregations-metrics-avg-aggregation.html#_missing_value |
In the meantime, before 2.0, and with apologies if this has already been covered: can you specify a script in the Kibana "JSON input" field that dynamically replaces a missing field value with zero? (And can someone point me to detailed documentation of what can be specified in that field? My Google-fu has failed there, too.) |
Suppose I have an Elasticsearch document with no "grade" field; the "grade" field is missing. Suppose I have another document with a "grade" field explicitly specified as
Will the new-for-2.0 |
Missing and Regarding scripting, you can indeed do that in 1.x by running the aggregation of a script (likely with a bit performance/memory usage hit) that would check whether the list of values is empty. |
Thanks, @jpountz . Re:
Could you please either spoonfeed me (cringe, sorry) the appropriate contents of the Kibana JSON Input field, or point me to detailed documentation for specifying the contents of that field? I can write "if x is null, then set x to 0" in a few programming languages, but I lack the experience and detailed documentation I need to do that in this context (such as the surrounding JSON, the specific syntax and variable names). |
Not sure about the Kibana side, but here's an example (with groovy dynamic scripting) which will replace missing values with -1:
You could use the
|
Thanks again, @jpountz . I need the average ( For example, suppose I have the following five Elasticsearch documents, where T_n_ is a timestamp value, and grade is the name of a field on which I want to perform an average calculation:
Currently, when I use an average aggregration in a visualization, a bucket that includes T1 - T5 shows the average grade as 10: (10 + 10) / 2 = 10 (that is, it skips the documents with null or missing grade) whereas I want it to show 4 (to include the documents with null or missing grade, and treat grade as zero): (0 + 10 + 0 + 10 + 0) / 5 = 4 However, I have so far been unable to trap null field values via the Kibana JSON Input field. I suspect (I could be wrong) that what is happening is that Kibana (more specifically, Elasticsearch; but I'm doing all of this through the Kibana user interface) skips the documents with null or missing field values, and so those documents never "reach" the JSON Input field value. I can use the following JSON Input field value to override the values of fields that are present (say, replace 10 with 20):
but the following has no effect:
Similarly, neither does this, possibly unfaithfully transcribed from your suggestion (much appreciated, thank you):
I'd appreciate some more advice here. I'd like to have a workaround (before 2.0 arrives) for these skewed averages that doesn't involve re-loading the (currently, deliberately "sparse") data with explicit zero field values. Even if that workaround involves a performance hit on large data sets (as I imagine this script-based would; so far, I've only tested it on very small indices). |
Unfortunately, this can't be done today because Kibana requires you to configure a field and then merges the agg definition with the value in the json input, which makes elasticsearch run the script on every value instead of every document. |
We came across this feature of configuring missing values looking at the Terms Aggregation docs and were excited to use it with rollup search, but it doesn't seem like this feature is available yet for rollup search. @polyfractal we were wondering if you might know if configuring missing values are available for rollup search or if there some is other way to search for missing values? |
Most aggregations (terms, histogram, stats, percentiles, geohash-grid) now
support a new
missing
option which defines the value to consider when afield does not have a value. This can be handy if you eg. want a terms
aggregation to handle the same way documents that have "N/A" or no value
for a
tag
field.This works in a very similar way to the
missing
option on thesort
element.
One known issue is that this option sometimes cannot make the right decision
in the unmapped case: it needs to replace all values with the
missing
valuebut might not know what kind of values source should be produced (numerics,
strings, geo points?). For this reason, we might want to add an
unmapped_type
option in the future like we did for sorting.
Related to #5324