-
Notifications
You must be signed in to change notification settings - Fork 25.2k
date_histogram of date_range with null end points triggers a circuit breaker #50109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Pinging @elastic/es-analytics-geo (:Analytics/Aggregations) |
/cc @not-napoleon mind taking a look at this when you get a chance? |
Yeah, I definitely see why that would happen. It's a little unclear to me what the correct behavior would be though. In this case, it seems like we want to default the end date to Missing doesn't currently let you specify a range, but this seems almost like missing behavior. Just thinking out loud, if we let users specify a Range value for missing, and then partially applied it if the start or end point was null (and used the whole range if the value was actually missing), that might give users some ability to control this. We'd still need to deal with invalid ranges in the aggregator though. Unfortunately, getting missing to support complex types would be a pretty major change. It's something I'd like to do, but I don't have a plan for it right now. I'm hesitant to add an explicit option (two options really, for start and end) for this case. That creates a situation where we have options on the aggregation that only apply sometimes, or worse we end up needing a new aggregation entirely. One of our goals with adding range support was to minimize API footprint as much as possible, so in that spirit, I'd like to keep adding a new option as a last resort. TL;DR - if there's a sensible default value, this is a relatively straightforward fix, but I'm not convinced there's a sensible default value. At an absolute minimum, we should at least just skip these docs instead of trying to create infinite buckets. |
@jcrapuchettes I ran your test case locally, and it hit a |
@not-napoleon I do get a Having a default value of |
Scripting would be a good solution, but we don't currently have support for getting range values from scripts in aggregations. I've opened an issue for this (#50190 ), if you want to follow the discussion there. Unfortunately, I don't think there's a scripting based work around in 7.4 |
I've been experiencing a very similar issue. I have ~180M documents spanning the past 4 years or so. Each document has a When querying by month it isn't so bad, I just get more data back than I need (all months between the minimum month in the matching documents and the maximum month). A real problem comes up, however, when I aggregate on days. One query where I'm trying to see data for say 30 days, will actually bucket for all days in the range, which could be upward of 1500 buckets, one for each day in the matching document's range (not my query range). Most of the time this results in a timeout in Elasticsearch and I never actually get data back. This issue is preventing me from adopting the ExampleAn example query I would be running (using the same field name as the OP) would be something like this: {
"track_total_hits": true,
"query": {
"range": {
"active_range": {
"gte": "2016-11-01",
"lte": "2016-11-30"
}
}
},
"size": 0,
"aggs": {
"timeseries": {
"date_histogram": {
"field": "active_range",
"calendar_interval": "day",
"min_doc_count": 1,
"extended_bounds": {
"min": "2016-11-01",
"max": "2016-11-30"
}
}
}
}
} Since documents back in 2016 could be "active" today the ES query is generating all buckets from whatever the first day in the document range is all the way to today. It feels like A script might be a good solution for this, but it feels like a field similar to |
I discussed this with @polyfractal today, and we're not opposed to adding a flag to make the
(exact option name still TBD) Which would return exactly all buckets between the given min and max. This would still include the current behavior of adding empty buckets to fill out the range, but also serve to clip the results that fall outside of that range. This is not entirely useless for non-range cases either, as it will be a cleaner syntax than the current recommendation of adding a filter query. Would something along those lines solve your difficulty? |
@not-napoleon that solution would fit my use case perfectly. Thanks for taking the time to discuss this, it'll simplify what I'm currently doing significantly. |
@not-napoleon would it still count in values with open end in all buckets valid for given start date or filter them out? |
@krlm: @not-napoleon can correct me if I get this wrong, but I believe any unbounded interval on the "edge" will still be included in the final bucket on that side. It will just stop the aggregation from continuing to "fill out" the histogram with buckets that match the unbounded, infinite interval. |
Yeah, that's more or less what I was thinking. Currently it just adds buckets until it adds a bucket that contains the end point. If a hard limit extended bounds were set, it would stop at that limit instead. Basically, you'd set the hard limit and get exactly those buckets, even if your ranges fell outside of them. You'd still have the existing extend_bounds behavior where we add in empty buckets as needed to fit that range, too. |
@polyfractal @not-napoleon, thanks for the info. I forgot that by default there's no upper and lower boundaries (which I've defined in the filter part of my query) for date histogram aggregation. This should do the job then. |
I've had the same issue and would love to see support for date histograms on open-ended date range fields. The way I'm dealing with this right now is by using scripting (and #50190 will definitely help here, too). It's clunky, it's not flexible but it works for what I need to do. In my documents, I have a
As I said, it's clunky, but it does return all the date buckets I'm interested in (either from start to end or from start to now if end is null). Of course, depending on the interval type, the code needs to adapted. Just thought I'd share if that can help anyone until a real fix is available. |
Here is an interesting question about hard bounds. With extended bounds the max bound is inclusive. So, if we have an hourly histogram and max is 10 hours, the 10:00:00-10:59:59 bucket will be included, which might make sense since we are extending the bounds. The hard bounds, on the other hands, are limiting. So, if max hard bound is 10:00:00 the bucket 10:00:00-10:59:59 is mostly outside of the bounds. |
Adds a hard_bounds parameter to explicitly limit the buckets that a histogram can generate. This is especially useful in case of open ended ranges that can produce a very large number of buckets. Closes elastic#50109
Elasticsearch version: 7.4.2
Plugins installed: [repository-s3, discovery-ec2]
JVM version: Bundled version
OS version: 4.14.138-114.102.amzn2.x86_64 (Amazon Linux 2)
Description of the problem including expected versus actual behavior:
Running a date_histogram on a date_range field with document values that have null "lte" appears to cause the aggregation to create an infinite number of buckets and I get an out of memory error. The aggregation works well for fully defined date ranges (the first document in my example), but I have a large number of documents that have an undefined end point (value is null, second document in my example). I have been looking for a way to limit the buckets created in the query, but without luck. I've also tried to use a date_range aggregation on the fields to attempt to limit the potential buckets, but that caused cast exceptions.
Basically I'm attempting to find all of the months up to now that these documents cover/touch. I'd like to see buckets from 2017-10-01 to 2019-12-01 (as of the time of this issue being written).
Steps to reproduce:
The text was updated successfully, but these errors were encountered: