Skip to content

exclude aggregator #7020

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

exclude aggregator #7020

wants to merge 1 commit into from

Conversation

miguelnmiranda
Copy link

This feature allows to exclude parts of a query when defining the result set used by an aggregator.

Based on: https://wiki.apache.org/solr/SimpleFacetParameters#Tagging_and_excluding_Filters

I work at a company that develops ecommerce solutions and we use the solr feature a lot to provide a better search experience while keeping the query size manageable. And the one excuse people gave me not to use elasticsearch was that this behaviour would be to verbose, having to write the whole query in each aggregator. I love elasticsearch, and want to use it in future projects.

I experimented with a more generic approach which allowed to remove/cut a branch of the filter tree at any point, but there were some inconsistencies in behaviour when excluding in a 'or' filter vs 'and' filter, because having a null for a filter is not treated the same way everywhere.

@jpountz
Copy link
Contributor

jpountz commented Jul 24, 2014

@miguelnmiranda I don't think you need to have the query in each aggregator. Users usually solve this problem by using post_filter:

{
  "query": {
    "filtered": {
      "query": "your query goes here",
      "filter": "filters to take into account for top-hits and aggs"
    }
  },
  "post_filter" : "filters to take into account for top-hits only",
  "aggs": {
    "my_filter": {
      "filter": "filter to take into account for aggs only"
    }
  }
}

Would it work for you?

@jpountz jpountz self-assigned this Jul 24, 2014
@miguelnmiranda
Copy link
Author

@jpountz I believe that with post_filter you can only add more filter not remove some of the query filters.

The example I wrote for the documentation you have the case where you have a facet, lets say colour facet,
and although the result set is being filtered for a specific colour I want to show all colours that I would display if no colour was selected. Not the same as no match_all query, just the some query without the colour part.

@jpountz
Copy link
Contributor

jpountz commented Jul 24, 2014

The example I wrote for the documentation you have the case where you have a facet, lets say colour facet, and although the result set is being filtered for a specific colour I want to show all colours that I would display if no colour was selected.

Wouldn't it work if the colour filter was a post_filter (but neither a query filter nor an aggregation filter)?

@miguelnmiranda
Copy link
Author

Wouldn't it work if the colour filter was a post_filter (but neither a query filter nor an aggregation filter)?

Yes it would, but if you have two or more facets, lets say colour and gender you cannot follow that approach.
In a case I remember we had gender, category, size, colour, features, price range and rating facets (maybe more).

@jpountz
Copy link
Contributor

jpountz commented Jul 24, 2014

I think I understand the issue now. Using post_filter would work, but would not be easy to use given that if you have N filters, you would need to put them all in the post_filter and have N filter aggregations that would be a combination of N-1 filters (so you would have a filter on colour that would filter on everything but colour, a filter on gender that would filter on everything but gender, etc.)

@miguelnmiranda
Copy link
Author

Yes. Sorry for not putting it in those terms to start with.

@jpountz
Copy link
Contributor

jpountz commented Jul 24, 2014

No worries, I was a bit slow to understand on my end as well. :-)

@miguelnmiranda
Copy link
Author

I think my example for the documentation is incorrect. I used cardinality aggregator when I just wanted the number of docs. Will fix that later if this goes forward.

Also there is a live example.. the website of a brand, but it seems to be down for maintenance at the moment. I will post it as an example here when it goes back up.

@jpountz
Copy link
Contributor

jpountz commented Jul 24, 2014

I don't like much having the ability to exclude filters from the query since it breaks the expectation that aggregations apply to the documents that match the query (the only exception being the global aggregation.

However, I think we could have an aggregation that would accept a list of filters and would build a bucket for every combination of (N-1) filters. Maybe this functionnality could even be folded into the filters aggregation (not sure, just wondering).

@miguelnmiranda
Copy link
Author

I don't like much having the ability to exclude filters from the query since it breaks the expectation that aggregations apply to the documents that match the query (the only exception being the global aggregation.

Is it because of the semantics of exclude? It actually extends the global aggregator, and could be done the other way around, where you say which filters to include instead.

However, I think we could have an aggregation that would accept a list of filters and would build a bucket for every combination of (N-1) filters. Maybe this functionality could even be folded into the filters aggregation (not sure, just wondering).

This would work, and is ideal for the case where you have selected at least a value for each available facet. But if you only filter based on M (<N) of the available facets, meaning you don't have a filter for all facets yet, it won't generate buckets for the remaining N-M facets.

@roytmana
Copy link

If I may add I would lowe a feature to control sub aggs on bucket level. One of my cases is to be able to produce output for multilevel agg where lower level aggs will be slightly difgerent from each other not in terms of agg nature but in following:

  • whether to do sub agg for a given bucket at all
  • max sub agg size for a a given bucket
  • fulters different for sub agg per bucket

Imagine a drill down tree UI where user can start from top level agg and drill down int idividual buckets and then change query and see the changes to yhe expanded trer in one call to elastic. It is acievalbe now but at the cost of multiple aggs

Say we have agg on country state and city

I drilled down into United States/Montana and Alaska

In order to reload the data in one call

I would have to aggs on the same level

Countries and countries/states with filter including alaska and montana and then post process the data to merge second agg results into the first

It is doable but if you add need to handle missing and other buckets not supported by ES in sumilar way it becomes rather messy and hard to implement in general way.

What I would like to see is abulity to specify extra options inside states agg per possible bucket

For example

Bucket-config: {
Montana:{size:20}
_others:{calculate:false}
}

In short I would like to be able to exercise some control over sub aggs on parent agg bucket level rather than all of rhem be exactly the same such drill down into individual branches instead of all of them

@dmitry
Copy link
Contributor

dmitry commented Jul 24, 2014

As I understand clearly, the current solution is to include filter in every aggregation bucket from the main filter with exclusion (one of the filter you don't want to be in the aggs)?

Something like described here: http://stackoverflow.com/questions/8908325/elasticsearch-excluding-filters-while-faceting-possible-like-in-solr (it's for the facets, but follows the same idea)

@jpountz
Copy link
Contributor

jpountz commented Jul 24, 2014

Correct. @clintongormley explained it much better than me!

@roytmana
Copy link

@dmitry
That's for a single level agg (facets). I want to be able to control what would happen in the second level for individual buckets produced by the first level of agg. particularly I want to say which first level buckets should have their sub aggs calculated rather than having them all calculated. Say I have an agg by country/state
I want aggs over all countries but only want sub-aggs calculated for USA and UK but not for the others

so in the countries/states agg definition I want to specify instructions for each country bucket such whether to calculate matching coutry sub-aggs, max size of the sub-agg etc

@miguelnmiranda
Copy link
Author

@dmitry Yes. The example describes exactly the behaviour pretended.
With two facets is still manageable, with N facets we come to the case described by @jpountz.
The behaviour can be seen here. Filtering for Male products does not remove count from the other gender options.

@dmitry
Copy link
Contributor

dmitry commented Jul 25, 2014

@miguelnmiranda and currently it's not possible to have the same behavior without including all the filters in the aggs with an exception of the field that is aggregated by?

In my case I have something like that:

{
    "body": {
        "post_filter": {
            "and": [{
                "terms": {
                    "type": ["apartment"]
                }
            }, {
                "terms": {
                    "location_ids": [386]
                }
            }]
        },
        "aggregations": {
            "types": {
                "filter": {
                    "and": [{
                        "terms": {
                            "location_ids": [386]
                        }
                    }]
                },
                "aggs": {
                    "types": {
                        "terms": {
                            "field": "type",
                            "size": 0
                        }
                    }
                }
            },
            "locations": {
                "filter": {
                    "and": [{
                        "terms": {
                            "type": ["apartment"]
                        }
                    }]
                },
                "aggs": {
                    "locations": {
                        "terms": {
                            "field": "location_ids",
                            "size": 0
                        }
                    }
                }
            }
        }
    },
    "index": "properties",
    "type": ["property"]
}

I thought there should be some better solution for that most used case of elasticsearch or I'm wrong?

@clintongormley
Copy link
Contributor

Hi @miguelnmiranda

Thanks for the PR, but I agree wholeheartedly with @jpountz. We used to have named "scopes" in facets, back in the day, but they were removed. This PR suffers from the same problem that they did.

The DSL allows for complicated nesting of clauses, while scopes refer to individual filters, regardless of their position in the query. You could apply a name to a filter which is a sub-clause of another filter, but in the aggregation, you'd get documents that you are not expecting because the filter is treated as though it were at the top. You include an option to exclude the query - same problem: which query are you referring to? there could be several.

I like @jpountz 's idea of extending the filters agg (see #6974) to allow application of all filters but one, in a loop. It seems like a good general solution to this issue.

But if you only filter based on M (<N) of the available facets, meaning you don't have a filter for all facets yet, it won't generate buckets for the remaining N-M facets.

But you'd know up front which clauses don't have values, so you'd just specify these as normal aggs.

(@roytmana you're talking about something completely separate, please don't hijack this issue)

@clintongormley
Copy link
Contributor

@miguelnmiranda yes, you have a point... The only thing I could come up with looks like this:

GET _search
{
  "query": {
    "filtered": {
      "filter": {
        "bool": {
          "must": [
            {
              "term": {
                "gender": "male"
              }
            },
            {
              "term": {
                "size": "xl"
              }
            },
            {
              "range": {
                "price": {
                  "lt": 20
                }
              }
            }
          ]
        }
      }
    }
  },
  "aggs": {
    "global": {
      "global": {},
      "aggs": {
        "colour": {
          "terms": {
            "field": "colour"
          }
        },
        "filtered": {
          "combinatorial_filters": [
            {
              "filter": {
                "term": {
                  "gender": "male"
                }
              },
              "aggs": {
                "gender": {
                  "terms": {
                    "field": "gender"
                  }
                }
              }
            },
            {
              "filter": {
                "term": {
                  "gender": "size"
                }
              },
              "aggs": {
                "size": {
                  "terms": {
                    "field": "size"
                  }
                }
              }
            },
            {
              "filter": {
                "term": {
                  "gender": "male"
                }
              },
              "aggs": {
                "price_range": {
                  "range": {
                    "field": "price",
                    "ranges": [
                      {
                        "from": 0,
                        "to": 20
                      },
                      {
                        "from": 20,
                        "to": 40
                      },
                      {
                        "from": 40,
                        "to": 60
                      }
                    ]
                  }
                }
              }
            }
          ]
        }
      }
    }
  }
}

In this example, colour is not being filtered, so it is run as just a normal agg (under the global scope). The filtered fields are passed in a combinatorial_filters agg (which doesn't exist yet). Each entry includes: (1) a filter and (2) any aggs.

Execution would iterate through the entries and apply all filters except for the current filter, where it would calculate the aggs instead.

This is completely different from any other aggs as they are today, so not sure how well this API would fit.

@miguelnmiranda
Copy link
Author

I deleted my previous the comment by mistake!

@clintongormley the behaviour seems odd.. and as you say does not fit well with the API.

You could apply a name to a filter which is a sub-clause of another filter, but in the aggregation, you'd get documents that you are not expecting because the filter is treated as though it were at the top. You include an option to exclude the query - same problem: which query are you referring to? there could be several.

The "current" exclude filter only looks at the the names inside the top level and only excludes those.

I implemented a different approach where you could "cut" the filter tree at any point.

"filter": {
  "and": {
    _name : root
    filters : [
    "or": {
       _name : orBranch
       filters : [
         "filter": {
           _name : f2
         }
         "filter": {
           _name : f3
         }
       ]
    ], {
    "filter": {
      _name : f1
    }
  ]
}
. and (root)
|_. or (orBrach)
|  |_. filter (f2)
|  |_. filter (f3)
|_.filter (f1)

But while writting it I found that the behaviour when a sub filter is null is not consistent across filters.

@markharwood
Copy link
Contributor

One way of getting counts for each dimension independent of that dimension's clauses would be to use a minimum_number_should_match value of 1 less than the number of clauses e.g.

curl -XGET "http://localhost:9200/pr7020/product/_search?pretty=1" -d'
{
   "query" : {
       "bool":{
           "minimum_number_should_match": 2, 
           "should": [
              {
                    "term" : {
                        "gender" : "M"
                    }
              },
            {
                    "term" : {
                        "size" : "large"
                    }
              },              
              {
                    "term" : {
                        "colour" : "blue"
                    }
              }
           ]
       }
   },
   "aggs" : {
            "colors":{
                "terms":{
                    "field":"colour"
                }
            },
            "genders":{
                "terms":{
                    "field":"gender"
                }
            },
            "sizes":{
                "terms":{
                    "field":"size"
                }
            }
   }
}'

Each terms agg in the above would then collect all terms where only 2 of the 3 clauses were present.
Obviously there may be some extra work in filtering displayed hits. The ranking algos would ideally show the hits with 3 out of 3 clause matches on top of the 2-out-of-3 ones anyway.

@clintongormley clintongormley added >docs General docs changes and removed discuss labels Oct 10, 2014
@clintongormley
Copy link
Contributor

@markharwood I tried out your solution and it doesn't do quite what we're after. For instance, given the following query:

    "bool": {
      "must": [
          { "term": { "size":  "large" }},
          { "term": { "color": "red"   }},
          { "term": { "type":  "shirt" }}
        ]
     }

... we want to know what count we would get for:

  • count type where size:large AND color:red
  • count color where size:large AND type:shirt
  • count size where color:red AND type:shirt

While your approach actually gives us counts type, color, and size where:

(size:large AND color:red) OR (size:large AND type:shirt) OR (color:red AND type:shirt)

I don't see any concise way of doing this out of the box.

That said, this is a common and very specific use case. We could possibly provide a simple (but inflexible) aggregation that does exactly what is needed here. I say inflexible because we want to keep it simple - if you want flexibility you can go the verbose route instead.

What about something like this:

GET /_search?
{
  "post_filter": {
    "bool": {
      "must": [
          { "term": { "size":  "large" }},
          { "term": { "color": "red"   }},
          { "term": { "type":  "shirt" }}
        ]
     }
  },
  "aggs": {
    "combos": {
      "filtered_terms": {
        "filters": [
          { "term": { "size": "large"}},
          { "term": { "color": "red" }},
          { "term": { "type": "shirt"}}
        ]
      }
    }
  }
}

Of course, this syntax doesn't reduce the amount of work that has to be performed. 3 terms means 6 filters, 4 terms means 12 filters, 5 terms means 20 filters...

@clintongormley clintongormley removed the >docs General docs changes label Oct 21, 2014
@clintongormley clintongormley removed their assignment Oct 21, 2014
@markharwood
Copy link
Contributor

Not sure I follow.

we want to know what count we would get for:
count type where size:large AND color:red
count color where size:large AND type:shirt
count size where color:red AND type:shirt

The OP primarily asked for a list not a count: "I want to show all colours that I would display if no colour was selected".
So in my example I produce the following aggs :
a) list all of the available colours for large shirts.
b) list all of the available types of clothing that are large and red.
c) list all of the sizes of the red shirts.

Each of these lists include counts e.g. how many large shirts are available in blue but the count is perhaps not the primary concern - the typical shopper just wants to know the large shirt is also available in blue.

@clintongormley
Copy link
Contributor

The OP primarily asked for a list not a count:

Actually, in a later comment the OP says "...I just wanted the number of docs"*

And this is the typical use case - how many red, green, blue products do I have which are type:shirt and size:large, ie what will I see if I remove this particular filter.

@markharwood
Copy link
Contributor

Still confused then.

The full quote from the OP re numbers is

"I think my example for the documentation is incorrect. I used cardinality aggregator when I just wanted the number of docs. Will fix that later"

Cardinality aggs is about count distinct and I don't know what example he refers to.

I think the typical requirement is simple - if I have a dimension that supports multiple selections (red OR blue OR green checkboxes) then I don't want the options for blue or green to immediately disappear when I select red. However, if I make a selection in a different dimension that says I'm only interested in Large shirts then I don't want to see colours that are not available in large. That's the behaviour I assumed the user was after and which I think my example provides (along with the related counts).

@sicarrots
Copy link

What is the status of this pull request? It's possible to have this feature in near future?

@clintongormley
Copy link
Contributor

It is clear that this PR isn't going to be merged as is. I haven't seen a good suggestion yet for how to implement this with term counts (although @markharwood's suggestion works without term counts). I'd welcome more suggestions in a new issue.

@ssetem
Copy link

ssetem commented May 9, 2017

we have this issue in https://github.com/searchkit/searchkit

SOLR solve this with tags which is a bit like this PR solution

http://yonik.com/multi-select-faceting/

We need to put filters applied by aggregators in post_filter
and generate filtered aggregators per aggregator which excludes just its own filter, which is very verbose, it works but would probably cut the query by 50% at least if this was done at ElasticSearch layer

@markharwood
Copy link
Contributor

I think my suggestion should give you the terms and counts you need in aggregation results but the end tail of docs in the hits it produces may have false positives (docs that match n-1 dimensions when you want them to match n dimensions).
If the quality of results at the tail-end of the hits is a concern you can always tighten it up using a copy of the agg filters composed in a must expression. This would mean that as a max the query JSON would only need one repetition of the user selections rather than some combinatorial explosion of them based on the numbers of dimensions.

@ssetem
Copy link

ssetem commented May 9, 2017

thanks @markharwood I will test out the the n-1 on root bool query

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants