Skip to content

null buckets missing from terms aggregation #6273

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
j0hnsmith opened this issue May 22, 2014 · 11 comments
Closed

null buckets missing from terms aggregation #6273

j0hnsmith opened this issue May 22, 2014 · 11 comments
Assignees

Comments

@j0hnsmith
Copy link

The terms aggregation creates a bucket for each value but doesn't include a null bucket. When there is a doc with a key set to null the cardinality is currently incorrect.

Users may or may not want to include a null bucket so it should be configurable (default should be false in line with current behaviour), I propose

{
    "aggs" : {
        "genders" : {
            "terms" : { 
                "field" : "gender",
                "null": true
            }
        }
    }
}

And the response

{
    ...

    "aggregations" : {
        "genders" : {
            "buckets" : [
                {
                    "key" : "male",
                    "doc_count" : 10
                },
                {
                    "key" : "female",
                    "doc_count" : 10
                },
                {
                    "key" : null,
                    "doc_count" : 1  // indicates doc without gender field or field set to null
                },
            ]
        }
    }
}
@clintongormley
Copy link
Contributor

You're looking for the missing aggregation.

Note that there is an open issue (#5324) to add support for the missing bucket to all aggs

@j0hnsmith
Copy link
Author

I'm not looking for the missing aggregation (although that's what I'm using as a fallback), I want to be able to do a single aggregation that contains 'male', 'female' and null keys. Perhaps there should also be an additional _missing_key term that includes docs that don't have a gender key. As @roytmana says in #5324, it shouldn't be difficult to get the sum of the doc counts for each term to add up to the total number of docs being considered.

What you're suggesting (I think) is to do two aggregations in the same search (my fallback solution), one term aggregation and one missing aggregation. This is sub optimal and nothing more than a workaround, null is a valid term if the key exists so should be included in the terms aggregation.

The reason why this is so important is when applying a filter to an aggregation, it has to be applied twice, then all the sub aggregations have to be be done twice etc etc which ends up in a confusing mess. Then the results have to be juggled to get the dataset that contains the complete picture.

ElasticSearch is an excellent tool and aggregations are very powerful however in this case jumping through hoops shouldn't be necessary.

@roytmana
Copy link

and I would love to see not just missing bucket but also _other bucket if requested. It is very important for many use cases to retain entire data set as stats get rolled up. With facets I did it buy using stats facet and subtracting sum of all buckets including _missing. It was not bad since facets do not allow sub-aggs and I had a nice API for it which hid all the complexities. but with sub-aggs it will be very ugly

@clintongormley
Copy link
Contributor

Actually, null is not a valid term. There is no way of storing the term null in Lucene. Instead, that field has no value. Similarly, if you store: ["foo","bar",null] then the field will have two values: foo and bar, not 3.

If you want to be able to distinguish a missing value from an explicit null, then you should use the null_value mapping on the field to substitute some real term when ES encounters a null.
See Dealing with null values for more.

But yes, for convenience's sake and efficiency, the missing count should probably be included in the agg itself, as requested in #5324, so I'll close this issue in favour of that one.

@roytmana
Copy link

null_value mapping won't help much as null can be introduced at object level (say you agg on person's country but his/her entire address is missing. I do not believe ES can handle that. I tried to use null_value but it only works when the actual scalar value is null not one of its owning objects

Please not to consider it to be a convenience. lack of _missing and _other buckets is a major limitation when implementing dynamic system with user ad-hoc defined analytic, pivot tables etc

@j0hnsmith
Copy link
Author

@clintongormley "There is no way of storing the term null in Lucene."

Aha, now many things make sense to me, thanks for the explanation.

Given field values of ["foo","bar",null], even if Lucene only sees 2 values I'd like to see a configurable (but not default) way of ElasticSearch doing the work in the background to return the expected buckets then processing them accordingly. I understand that this may be far less performant that what currently exists however I believe it's a common enough case to be supported (and the performance difference is moot since if it's what a user wants they'll have to run the same queries anyway).

@clintongormley
Copy link
Contributor

@j0hnsmith The only way to make ["foo","bar",null] store three values in the inverted index is to use null_value to map null to a concrete term, eg _null_value_.

Then it'll work just like all other terms and have its own bucket, and the missing agg will be just for those docs which have no value set for that field.

@roytmana
Copy link

but as I said not if field is missing because its entire containing object is missing!
that solution will not work.

@clintongormley
Copy link
Contributor

@roytmana yes, but then you couldn't aggregate on the top level object anyway, as the "container" doesn't have an associated inverted index. But it's a reasonably easy workaround to have a "null" object be represented by an object with a concrete field which contains a null value.

@roytmana
Copy link

@clintongormley for a fairly deep graph (say three levels contract/party/address to compensate for party being null and still be able to aggregate on any party field including address fields (say party country) will require producing super ugly json for _source replacing blank party with party object with null substitution and every object type within party with its own null object with its own null values etc. And then people using _source will have to deal with it all in their application. I do it now in few cases but I would not want to do it all the time

@jpountz
Copy link
Contributor

jpountz commented May 15, 2015

Closing: with #11042 you can now configure arbitrary keys for documents that are missing a value.

@jpountz jpountz closed this as completed May 15, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants