exclude aggregator #7020

miguelnmiranda · 2014-07-24T15:31:09Z

This feature allows to exclude parts of a query when defining the result set used by an aggregator.

Based on: https://wiki.apache.org/solr/SimpleFacetParameters#Tagging_and_excluding_Filters

I work at a company that develops ecommerce solutions and we use the solr feature a lot to provide a better search experience while keeping the query size manageable. And the one excuse people gave me not to use elasticsearch was that this behaviour would be to verbose, having to write the whole query in each aggregator. I love elasticsearch, and want to use it in future projects.

I experimented with a more generic approach which allowed to remove/cut a branch of the filter tree at any point, but there were some inconsistencies in behaviour when excluding in a 'or' filter vs 'and' filter, because having a null for a filter is not treated the same way everywhere.

jpountz · 2014-07-24T15:44:31Z

@miguelnmiranda I don't think you need to have the query in each aggregator. Users usually solve this problem by using post_filter:

{
  "query": {
    "filtered": {
      "query": "your query goes here",
      "filter": "filters to take into account for top-hits and aggs"
    }
  },
  "post_filter" : "filters to take into account for top-hits only",
  "aggs": {
    "my_filter": {
      "filter": "filter to take into account for aggs only"
    }
  }
}

Would it work for you?

miguelnmiranda · 2014-07-24T15:58:32Z

@jpountz I believe that with post_filter you can only add more filter not remove some of the query filters.

The example I wrote for the documentation you have the case where you have a facet, lets say colour facet,
and although the result set is being filtered for a specific colour I want to show all colours that I would display if no colour was selected. Not the same as no match_all query, just the some query without the colour part.

jpountz · 2014-07-24T16:26:36Z

The example I wrote for the documentation you have the case where you have a facet, lets say colour facet, and although the result set is being filtered for a specific colour I want to show all colours that I would display if no colour was selected.

Wouldn't it work if the colour filter was a post_filter (but neither a query filter nor an aggregation filter)?

miguelnmiranda · 2014-07-24T16:45:34Z

Wouldn't it work if the colour filter was a post_filter (but neither a query filter nor an aggregation filter)?

Yes it would, but if you have two or more facets, lets say colour and gender you cannot follow that approach.
In a case I remember we had gender, category, size, colour, features, price range and rating facets (maybe more).

jpountz · 2014-07-24T17:23:39Z

I think I understand the issue now. Using post_filter would work, but would not be easy to use given that if you have N filters, you would need to put them all in the post_filter and have N filter aggregations that would be a combination of N-1 filters (so you would have a filter on colour that would filter on everything but colour, a filter on gender that would filter on everything but gender, etc.)

miguelnmiranda · 2014-07-24T17:26:46Z

Yes. Sorry for not putting it in those terms to start with.

jpountz · 2014-07-24T17:29:38Z

No worries, I was a bit slow to understand on my end as well. :-)

miguelnmiranda · 2014-07-24T17:31:37Z

I think my example for the documentation is incorrect. I used cardinality aggregator when I just wanted the number of docs. Will fix that later if this goes forward.

Also there is a live example.. the website of a brand, but it seems to be down for maintenance at the moment. I will post it as an example here when it goes back up.

jpountz · 2014-07-24T17:42:01Z

I don't like much having the ability to exclude filters from the query since it breaks the expectation that aggregations apply to the documents that match the query (the only exception being the global aggregation.

However, I think we could have an aggregation that would accept a list of filters and would build a bucket for every combination of (N-1) filters. Maybe this functionnality could even be folded into the filters aggregation (not sure, just wondering).

miguelnmiranda · 2014-07-24T18:35:34Z

I don't like much having the ability to exclude filters from the query since it breaks the expectation that aggregations apply to the documents that match the query (the only exception being the global aggregation.

Is it because of the semantics of exclude? It actually extends the global aggregator, and could be done the other way around, where you say which filters to include instead.

However, I think we could have an aggregation that would accept a list of filters and would build a bucket for every combination of (N-1) filters. Maybe this functionality could even be folded into the filters aggregation (not sure, just wondering).

This would work, and is ideal for the case where you have selected at least a value for each available facet. But if you only filter based on M (<N) of the available facets, meaning you don't have a filter for all facets yet, it won't generate buckets for the remaining N-M facets.

roytmana · 2014-07-24T19:37:24Z

If I may add I would lowe a feature to control sub aggs on bucket level. One of my cases is to be able to produce output for multilevel agg where lower level aggs will be slightly difgerent from each other not in terms of agg nature but in following:

whether to do sub agg for a given bucket at all
max sub agg size for a a given bucket
fulters different for sub agg per bucket

Imagine a drill down tree UI where user can start from top level agg and drill down int idividual buckets and then change query and see the changes to yhe expanded trer in one call to elastic. It is acievalbe now but at the cost of multiple aggs

Say we have agg on country state and city

I drilled down into United States/Montana and Alaska

In order to reload the data in one call

I would have to aggs on the same level

Countries and countries/states with filter including alaska and montana and then post process the data to merge second agg results into the first

It is doable but if you add need to handle missing and other buckets not supported by ES in sumilar way it becomes rather messy and hard to implement in general way.

What I would like to see is abulity to specify extra options inside states agg per possible bucket

For example

Bucket-config: {
Montana:{size:20}
_others:{calculate:false}
}

In short I would like to be able to exercise some control over sub aggs on parent agg bucket level rather than all of rhem be exactly the same such drill down into individual branches instead of all of them

dmitry · 2014-07-24T22:23:43Z

As I understand clearly, the current solution is to include filter in every aggregation bucket from the main filter with exclusion (one of the filter you don't want to be in the aggs)?

Something like described here: http://stackoverflow.com/questions/8908325/elasticsearch-excluding-filters-while-faceting-possible-like-in-solr (it's for the facets, but follows the same idea)

jpountz · 2014-07-24T22:27:11Z

Correct. @clintongormley explained it much better than me!

roytmana · 2014-07-24T22:33:34Z

@dmitry
That's for a single level agg (facets). I want to be able to control what would happen in the second level for individual buckets produced by the first level of agg. particularly I want to say which first level buckets should have their sub aggs calculated rather than having them all calculated. Say I have an agg by country/state
I want aggs over all countries but only want sub-aggs calculated for USA and UK but not for the others

so in the countries/states agg definition I want to specify instructions for each country bucket such whether to calculate matching coutry sub-aggs, max size of the sub-agg etc

miguelnmiranda · 2014-07-24T22:53:57Z

@dmitry Yes. The example describes exactly the behaviour pretended.
With two facets is still manageable, with N facets we come to the case described by @jpountz.
The behaviour can be seen here. Filtering for Male products does not remove count from the other gender options.

dmitry · 2014-07-25T02:52:01Z

@miguelnmiranda and currently it's not possible to have the same behavior without including all the filters in the aggs with an exception of the field that is aggregated by?

In my case I have something like that:

{
    "body": {
        "post_filter": {
            "and": [{
                "terms": {
                    "type": ["apartment"]
                }
            }, {
                "terms": {
                    "location_ids": [386]
                }
            }]
        },
        "aggregations": {
            "types": {
                "filter": {
                    "and": [{
                        "terms": {
                            "location_ids": [386]
                        }
                    }]
                },
                "aggs": {
                    "types": {
                        "terms": {
                            "field": "type",
                            "size": 0
                        }
                    }
                }
            },
            "locations": {
                "filter": {
                    "and": [{
                        "terms": {
                            "type": ["apartment"]
                        }
                    }]
                },
                "aggs": {
                    "locations": {
                        "terms": {
                            "field": "location_ids",
                            "size": 0
                        }
                    }
                }
            }
        }
    },
    "index": "properties",
    "type": ["property"]
}

I thought there should be some better solution for that most used case of elasticsearch or I'm wrong?

clintongormley · 2014-07-25T05:41:21Z

Hi @miguelnmiranda

Thanks for the PR, but I agree wholeheartedly with @jpountz. We used to have named "scopes" in facets, back in the day, but they were removed. This PR suffers from the same problem that they did.

The DSL allows for complicated nesting of clauses, while scopes refer to individual filters, regardless of their position in the query. You could apply a name to a filter which is a sub-clause of another filter, but in the aggregation, you'd get documents that you are not expecting because the filter is treated as though it were at the top. You include an option to exclude the query - same problem: which query are you referring to? there could be several.

I like @jpountz 's idea of extending the filters agg (see #6974) to allow application of all filters but one, in a loop. It seems like a good general solution to this issue.

But if you only filter based on M (<N) of the available facets, meaning you don't have a filter for all facets yet, it won't generate buckets for the remaining N-M facets.

But you'd know up front which clauses don't have values, so you'd just specify these as normal aggs.

(@roytmana you're talking about something completely separate, please don't hijack this issue)

clintongormley · 2014-07-25T10:43:59Z

@miguelnmiranda yes, you have a point... The only thing I could come up with looks like this:

GET _search
{
  "query": {
    "filtered": {
      "filter": {
        "bool": {
          "must": [
            {
              "term": {
                "gender": "male"
              }
            },
            {
              "term": {
                "size": "xl"
              }
            },
            {
              "range": {
                "price": {
                  "lt": 20
                }
              }
            }
          ]
        }
      }
    }
  },
  "aggs": {
    "global": {
      "global": {},
      "aggs": {
        "colour": {
          "terms": {
            "field": "colour"
          }
        },
        "filtered": {
          "combinatorial_filters": [
            {
              "filter": {
                "term": {
                  "gender": "male"
                }
              },
              "aggs": {
                "gender": {
                  "terms": {
                    "field": "gender"
                  }
                }
              }
            },
            {
              "filter": {
                "term": {
                  "gender": "size"
                }
              },
              "aggs": {
                "size": {
                  "terms": {
                    "field": "size"
                  }
                }
              }
            },
            {
              "filter": {
                "term": {
                  "gender": "male"
                }
              },
              "aggs": {
                "price_range": {
                  "range": {
                    "field": "price",
                    "ranges": [
                      {
                        "from": 0,
                        "to": 20
                      },
                      {
                        "from": 20,
                        "to": 40
                      },
                      {
                        "from": 40,
                        "to": 60
                      }
                    ]
                  }
                }
              }
            }
          ]
        }
      }
    }
  }
}

In this example, colour is not being filtered, so it is run as just a normal agg (under the global scope). The filtered fields are passed in a combinatorial_filters agg (which doesn't exist yet). Each entry includes: (1) a filter and (2) any aggs.

Execution would iterate through the entries and apply all filters except for the current filter, where it would calculate the aggs instead.

This is completely different from any other aggs as they are today, so not sure how well this API would fit.

miguelnmiranda · 2014-07-25T11:13:34Z

I deleted my previous the comment by mistake!

@clintongormley the behaviour seems odd.. and as you say does not fit well with the API.

You could apply a name to a filter which is a sub-clause of another filter, but in the aggregation, you'd get documents that you are not expecting because the filter is treated as though it were at the top. You include an option to exclude the query - same problem: which query are you referring to? there could be several.

The "current" exclude filter only looks at the the names inside the top level and only excludes those.

I implemented a different approach where you could "cut" the filter tree at any point.

"filter": {
  "and": {
    _name : root
    filters : [
    "or": {
       _name : orBranch
       filters : [
         "filter": {
           _name : f2
         }
         "filter": {
           _name : f3
         }
       ]
    ], {
    "filter": {
      _name : f1
    }
  ]
}

. and (root)
|_. or (orBrach)
|  |_. filter (f2)
|  |_. filter (f3)
|_.filter (f1)

But while writting it I found that the behaviour when a sub filter is null is not consistent across filters.

markharwood · 2014-08-22T16:43:57Z

One way of getting counts for each dimension independent of that dimension's clauses would be to use a minimum_number_should_match value of 1 less than the number of clauses e.g.

curl -XGET "http://localhost:9200/pr7020/product/_search?pretty=1" -d'
{
   "query" : {
       "bool":{
           "minimum_number_should_match": 2, 
           "should": [
              {
                    "term" : {
                        "gender" : "M"
                    }
              },
            {
                    "term" : {
                        "size" : "large"
                    }
              },              
              {
                    "term" : {
                        "colour" : "blue"
                    }
              }
           ]
       }
   },
   "aggs" : {
            "colors":{
                "terms":{
                    "field":"colour"
                }
            },
            "genders":{
                "terms":{
                    "field":"gender"
                }
            },
            "sizes":{
                "terms":{
                    "field":"size"
                }
            }
   }
}'

Each terms agg in the above would then collect all terms where only 2 of the 3 clauses were present.
Obviously there may be some extra work in filtering displayed hits. The ranking algos would ideally show the hits with 3 out of 3 clause matches on top of the 2-out-of-3 ones anyway.

clintongormley · 2014-10-21T10:18:09Z

@markharwood I tried out your solution and it doesn't do quite what we're after. For instance, given the following query:

    "bool": {
      "must": [
          { "term": { "size":  "large" }},
          { "term": { "color": "red"   }},
          { "term": { "type":  "shirt" }}
        ]
     }

... we want to know what count we would get for:

count type where size:large AND color:red
count color where size:large AND type:shirt
count size where color:red AND type:shirt

While your approach actually gives us counts type, color, and size where:

(size:large AND color:red) OR (size:large AND type:shirt) OR (color:red AND type:shirt)

I don't see any concise way of doing this out of the box.

That said, this is a common and very specific use case. We could possibly provide a simple (but inflexible) aggregation that does exactly what is needed here. I say inflexible because we want to keep it simple - if you want flexibility you can go the verbose route instead.

What about something like this:

GET /_search?
{
  "post_filter": {
    "bool": {
      "must": [
          { "term": { "size":  "large" }},
          { "term": { "color": "red"   }},
          { "term": { "type":  "shirt" }}
        ]
     }
  },
  "aggs": {
    "combos": {
      "filtered_terms": {
        "filters": [
          { "term": { "size": "large"}},
          { "term": { "color": "red" }},
          { "term": { "type": "shirt"}}
        ]
      }
    }
  }
}

Of course, this syntax doesn't reduce the amount of work that has to be performed. 3 terms means 6 filters, 4 terms means 12 filters, 5 terms means 20 filters...

markharwood · 2014-10-21T10:38:18Z

Not sure I follow.

we want to know what count we would get for:
count type where size:large AND color:red
count color where size:large AND type:shirt
count size where color:red AND type:shirt

The OP primarily asked for a list not a count: "I want to show all colours that I would display if no colour was selected".
So in my example I produce the following aggs :
a) list all of the available colours for large shirts.
b) list all of the available types of clothing that are large and red.
c) list all of the sizes of the red shirts.

Each of these lists include counts e.g. how many large shirts are available in blue but the count is perhaps not the primary concern - the typical shopper just wants to know the large shirt is also available in blue.

clintongormley · 2014-10-21T10:47:39Z

The OP primarily asked for a list not a count:

Actually, in a later comment the OP says "...I just wanted the number of docs"*

And this is the typical use case - how many red, green, blue products do I have which are type:shirt and size:large, ie what will I see if I remove this particular filter.

markharwood · 2014-10-21T11:10:22Z

Still confused then.

The full quote from the OP re numbers is

"I think my example for the documentation is incorrect. I used cardinality aggregator when I just wanted the number of docs. Will fix that later"

Cardinality aggs is about count distinct and I don't know what example he refers to.

I think the typical requirement is simple - if I have a dimension that supports multiple selections (red OR blue OR green checkboxes) then I don't want the options for blue or green to immediately disappear when I select red. However, if I make a selection in a different dimension that says I'm only interested in Large shirts then I don't want to see colours that are not available in large. That's the behaviour I assumed the user was after and which I think my example provides (along with the related counts).

sicarrots · 2015-01-29T23:44:47Z

What is the status of this pull request? It's possible to have this feature in near future?

clintongormley · 2016-03-08T15:18:49Z

It is clear that this PR isn't going to be merged as is. I haven't seen a good suggestion yet for how to implement this with term counts (although @markharwood's suggestion works without term counts). I'd welcome more suggestions in a new issue.

ssetem · 2017-05-09T12:15:03Z

we have this issue in https://github.com/searchkit/searchkit

SOLR solve this with tags which is a bit like this PR solution

http://yonik.com/multi-select-faceting/

We need to put filters applied by aggregators in post_filter
and generate filtered aggregators per aggregator which excludes just its own filter, which is very verbose, it works but would probably cut the query by 50% at least if this was done at ElasticSearch layer

markharwood · 2017-05-09T16:12:36Z

I think my suggestion should give you the terms and counts you need in aggregation results but the end tail of docs in the hits it produces may have false positives (docs that match n-1 dimensions when you want them to match n dimensions).
If the quality of results at the tail-end of the hits is a concern you can always tighten it up using a copy of the agg filters composed in a must expression. This would mean that as a max the query JSON would only need one repetition of the user selections rather than some combinatorial explosion of them based on the numbers of dimensions.

ssetem · 2017-05-09T17:24:51Z

thanks @markharwood I will test out the the n-1 on root bool query

exclude aggregator

66c5e1c

jpountz self-assigned this Jul 24, 2014

clintongormley mentioned this pull request Jul 25, 2014

Added Filters aggregation #6974

Merged

clintongormley added the discuss label Aug 7, 2014

clintongormley added >docs General docs changes and removed discuss labels Oct 10, 2014

clintongormley assigned clintongormley and unassigned jpountz Oct 10, 2014

clintongormley removed the >docs General docs changes label Oct 21, 2014

clintongormley added the discuss label Oct 21, 2014

clintongormley removed their assignment Oct 21, 2014

clintongormley added the :Analytics/Aggregations Aggregations label Nov 11, 2014

drewr force-pushed the master branch from dcc3da0 to 7c20a8a Compare February 20, 2015 16:48

clintongormley closed this Mar 8, 2016

exclude aggregator #7020

exclude aggregator #7020

Uh oh!

Conversation

miguelnmiranda commented Jul 24, 2014

Uh oh!

jpountz commented Jul 24, 2014

Uh oh!

miguelnmiranda commented Jul 24, 2014

Uh oh!

jpountz commented Jul 24, 2014

Uh oh!

miguelnmiranda commented Jul 24, 2014

Uh oh!

jpountz commented Jul 24, 2014

Uh oh!

miguelnmiranda commented Jul 24, 2014

Uh oh!

jpountz commented Jul 24, 2014

Uh oh!

miguelnmiranda commented Jul 24, 2014

Uh oh!

jpountz commented Jul 24, 2014

Uh oh!

miguelnmiranda commented Jul 24, 2014

Uh oh!

roytmana commented Jul 24, 2014

Uh oh!

dmitry commented Jul 24, 2014

Uh oh!

jpountz commented Jul 24, 2014

Uh oh!

roytmana commented Jul 24, 2014

Uh oh!

miguelnmiranda commented Jul 24, 2014

Uh oh!

dmitry commented Jul 25, 2014

Uh oh!

clintongormley commented Jul 25, 2014

Uh oh!

clintongormley commented Jul 25, 2014

Uh oh!

miguelnmiranda commented Jul 25, 2014

Uh oh!

markharwood commented Aug 22, 2014

Uh oh!

clintongormley commented Oct 21, 2014

Uh oh!

markharwood commented Oct 21, 2014

Uh oh!

clintongormley commented Oct 21, 2014

Uh oh!

markharwood commented Oct 21, 2014

Uh oh!

sicarrots commented Jan 29, 2015

Uh oh!

clintongormley commented Mar 8, 2016

Uh oh!

ssetem commented May 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markharwood commented May 9, 2017

Uh oh!

ssetem commented May 9, 2017

Uh oh!

Uh oh!

ssetem commented May 9, 2017 •

edited

Loading