null buckets missing from terms aggregation #6273

j0hnsmith · 2014-05-22T10:08:24Z

The terms aggregation creates a bucket for each value but doesn't include a null bucket. When there is a doc with a key set to null the cardinality is currently incorrect.

Users may or may not want to include a null bucket so it should be configurable (default should be false in line with current behaviour), I propose

{
    "aggs" : {
        "genders" : {
            "terms" : { 
                "field" : "gender",
                "null": true
            }
        }
    }
}

And the response

{
    ...

    "aggregations" : {
        "genders" : {
            "buckets" : [
                {
                    "key" : "male",
                    "doc_count" : 10
                },
                {
                    "key" : "female",
                    "doc_count" : 10
                },
                {
                    "key" : null,
                    "doc_count" : 1  // indicates doc without gender field or field set to null
                },
            ]
        }
    }
}

The text was updated successfully, but these errors were encountered:

clintongormley · 2014-05-22T12:26:03Z

You're looking for the missing aggregation.

Note that there is an open issue (#5324) to add support for the missing bucket to all aggs

j0hnsmith · 2014-05-22T13:45:43Z

I'm not looking for the missing aggregation (although that's what I'm using as a fallback), I want to be able to do a single aggregation that contains 'male', 'female' and null keys. Perhaps there should also be an additional _missing_key term that includes docs that don't have a gender key. As @roytmana says in #5324, it shouldn't be difficult to get the sum of the doc counts for each term to add up to the total number of docs being considered.

What you're suggesting (I think) is to do two aggregations in the same search (my fallback solution), one term aggregation and one missing aggregation. This is sub optimal and nothing more than a workaround, null is a valid term if the key exists so should be included in the terms aggregation.

The reason why this is so important is when applying a filter to an aggregation, it has to be applied twice, then all the sub aggregations have to be be done twice etc etc which ends up in a confusing mess. Then the results have to be juggled to get the dataset that contains the complete picture.

ElasticSearch is an excellent tool and aggregations are very powerful however in this case jumping through hoops shouldn't be necessary.

roytmana · 2014-05-22T13:53:11Z

and I would love to see not just missing bucket but also _other bucket if requested. It is very important for many use cases to retain entire data set as stats get rolled up. With facets I did it buy using stats facet and subtracting sum of all buckets including _missing. It was not bad since facets do not allow sub-aggs and I had a nice API for it which hid all the complexities. but with sub-aggs it will be very ugly

clintongormley · 2014-05-22T13:53:35Z

Actually, null is not a valid term. There is no way of storing the term null in Lucene. Instead, that field has no value. Similarly, if you store: ["foo","bar",null] then the field will have two values: foo and bar, not 3.

If you want to be able to distinguish a missing value from an explicit null, then you should use the null_value mapping on the field to substitute some real term when ES encounters a null.
See Dealing with null values for more.

But yes, for convenience's sake and efficiency, the missing count should probably be included in the agg itself, as requested in #5324, so I'll close this issue in favour of that one.

roytmana · 2014-05-22T14:02:32Z

null_value mapping won't help much as null can be introduced at object level (say you agg on person's country but his/her entire address is missing. I do not believe ES can handle that. I tried to use null_value but it only works when the actual scalar value is null not one of its owning objects

Please not to consider it to be a convenience. lack of _missing and _other buckets is a major limitation when implementing dynamic system with user ad-hoc defined analytic, pivot tables etc

j0hnsmith · 2014-05-22T14:34:14Z

@clintongormley "There is no way of storing the term null in Lucene."

Aha, now many things make sense to me, thanks for the explanation.

Given field values of ["foo","bar",null], even if Lucene only sees 2 values I'd like to see a configurable (but not default) way of ElasticSearch doing the work in the background to return the expected buckets then processing them accordingly. I understand that this may be far less performant that what currently exists however I believe it's a common enough case to be supported (and the performance difference is moot since if it's what a user wants they'll have to run the same queries anyway).

clintongormley · 2014-05-22T14:37:40Z

@j0hnsmith The only way to make ["foo","bar",null] store three values in the inverted index is to use null_value to map null to a concrete term, eg _null_value_.

Then it'll work just like all other terms and have its own bucket, and the missing agg will be just for those docs which have no value set for that field.

roytmana · 2014-05-22T14:46:30Z

but as I said not if field is missing because its entire containing object is missing!
that solution will not work.

clintongormley · 2014-05-22T14:47:59Z

@roytmana yes, but then you couldn't aggregate on the top level object anyway, as the "container" doesn't have an associated inverted index. But it's a reasonably easy workaround to have a "null" object be represented by an object with a concrete field which contains a null value.

roytmana · 2014-05-22T15:34:52Z

@clintongormley for a fairly deep graph (say three levels contract/party/address to compensate for party being null and still be able to aggregate on any party field including address fields (say party country) will require producing super ugly json for _source replacing blank party with party object with null substitution and every object type within party with its own null object with its own null values etc. And then people using _source will have to deal with it all in their application. I do it now in few cases but I would not want to do it all the time

jpountz · 2015-05-15T14:41:19Z

Closing: with #11042 you can now configure arbitrary keys for documents that are missing a value.

clintongormley closed this as completed May 22, 2014

clintongormley reopened this May 22, 2014

roytmana mentioned this issue Jun 13, 2014

Docs: Facets deprecated #6485

Closed

clintongormley added the discuss label Jul 10, 2014

clintongormley assigned jpountz Oct 24, 2014

clintongormley removed the discuss label Oct 24, 2014

tylerprete mentioned this issue Apr 17, 2015

With min_doc_count: 0, terms agg now creating null bucket (new in 1.5.1?) #10640

Closed

jpountz closed this as completed May 15, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

null buckets missing from terms aggregation #6273

null buckets missing from terms aggregation #6273

j0hnsmith commented May 22, 2014

clintongormley commented May 22, 2014

j0hnsmith commented May 22, 2014

roytmana commented May 22, 2014

clintongormley commented May 22, 2014

roytmana commented May 22, 2014

j0hnsmith commented May 22, 2014

clintongormley commented May 22, 2014

roytmana commented May 22, 2014

clintongormley commented May 22, 2014

roytmana commented May 22, 2014

jpountz commented May 15, 2015

null buckets missing from terms aggregation #6273

null buckets missing from terms aggregation #6273

Comments

j0hnsmith commented May 22, 2014

clintongormley commented May 22, 2014

j0hnsmith commented May 22, 2014

roytmana commented May 22, 2014

clintongormley commented May 22, 2014

roytmana commented May 22, 2014

j0hnsmith commented May 22, 2014

clintongormley commented May 22, 2014

roytmana commented May 22, 2014

clintongormley commented May 22, 2014

roytmana commented May 22, 2014

jpountz commented May 15, 2015