Skip to content

Add an option to create "other" bucket for Terms aggregation #6804

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kostiklv opened this issue Jul 9, 2014 · 5 comments
Closed

Add an option to create "other" bucket for Terms aggregation #6804

kostiklv opened this issue Jul 9, 2014 · 5 comments
Assignees
Labels

Comments

@kostiklv
Copy link

kostiklv commented Jul 9, 2014

When using "terms" aggregation, it's often useful to get top X terms (achieved by using size parameter), but as well get a separate bucket for all other terms together (possibly constrained by minimum doc count).

The query syntax might be:

{
    "aggs" : {
        "tags" : {
            "terms" : { 
                "field" : "tag",
                "size": 3,
                "min_doc_count": 10,
                "other": "_other_terms"
             }
        }
    }
}

And the response might look like:

{
    ...
    "aggregations" : {
        "tags" : {
            "buckets" : [
                {
                    "key" : "soccer",
                    "doc_count" : 500
                },
                {
                    "key" : "hockey",
                    "doc_count" : 400
                },
                {
                    "key" : "basketball",
                    "doc_count" : 300
                },
                {
                    "key" : "_other_terms",
                    "doc_count" : 150
                },
            ]
        }
    }
}

The _other_terms bucket will be based on all tags with doc_count > 10 per tag, excluding already listed (top 3).

Related to #5324

@jpountz
Copy link
Contributor

jpountz commented Jul 18, 2014

One question that is related to that change is whether other should only track doc counts (cheap, and could be done by default) or also sub aggregations (potentially costly, so would require an option).

@clintongormley
Copy link
Contributor

I'd say just the doc counts, at least by default.

@kostiklv
Copy link
Author

The suggested syntax is already an option, so the developer using this option should understand the cost. Based on that, I think the default should include all sub aggregations.
Consider the following query:

"aggs": {
    "top_selling": {
       "terms": {
          "field": "make",
          "size": 5,
          "other": "_other_terms"
       },
       "aggs": {
          "avg_price": {
             "avg": { "field": "price" }
          }
       }
    }
 }

What's the point of using _other_terms if we don't get the average price for them? I also doubt if the option to disable sub aggregations is needed at all. What's the use case when you want sub aggregations on specific terms, but don't want it for other?

Anyway, the syntax can be future-proof, so instead of "other": "_other_terms" it can be:

...
"other": {
   "bucket_key": "_other_terms",
   "some_future_option": "option_value"
}
...

It may also check if the value of other option is a string, and use it as other.bucket_key by default as syntactic sugar.

@jpountz
Copy link
Contributor

jpountz commented Jul 24, 2014

I have thought more about this issue and computing the document count for other buckets is not possible in the general case without doing another pass over the data (think about multi-valued fields).

The only thing that it can do would be to return the number of other values (as opposed to documents). But we already have the value_count aggregation for that.

If a bucket or count for other docs is really needed, the right way to build it would be to run a first query with the terms aggregation, and a second query that would have a filter aggregation that would exclude the returned terms.

@ebuildy
Copy link
Contributor

ebuildy commented Jul 9, 2021

This will be really useful with sub-aggregations:

"aggs" : {
    "country" : {
      "terms" : {
        "field" : "geoip.country_name.keyword"
      },
      "aggs" : {
        "response_time_avg" : {
          "avg" : {
            "field" : "message.upstream.response_time"
          }
        },
        "response_time_p95" : {
          "percentiles" : {
            "field" : "message.upstream.response_time",
            "percents": [ 95 ]
          }
        },
        "http_status" : {
          "terms" : {
            "field" : "message.request.status",
            "size" : 5
          }
        }
      }
    }
  }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants