Skip to content

date_histogram of date_range with null end points triggers a circuit breaker #50109

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jcrapuchettes opened this issue Dec 11, 2019 · 15 comments · Fixed by #59175
Closed

date_histogram of date_range with null end points triggers a circuit breaker #50109

jcrapuchettes opened this issue Dec 11, 2019 · 15 comments · Fixed by #59175
Assignees
Labels
:Analytics/Aggregations Aggregations >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@jcrapuchettes
Copy link

jcrapuchettes commented Dec 11, 2019

Elasticsearch version: 7.4.2

Plugins installed: [repository-s3, discovery-ec2]

JVM version: Bundled version

OS version: 4.14.138-114.102.amzn2.x86_64 (Amazon Linux 2)

Description of the problem including expected versus actual behavior:
Running a date_histogram on a date_range field with document values that have null "lte" appears to cause the aggregation to create an infinite number of buckets and I get an out of memory error. The aggregation works well for fully defined date ranges (the first document in my example), but I have a large number of documents that have an undefined end point (value is null, second document in my example). I have been looking for a way to limit the buckets created in the query, but without luck. I've also tried to use a date_range aggregation on the fields to attempt to limit the potential buckets, but that caused cast exceptions.

Basically I'm attempting to find all of the months up to now that these documents cover/touch. I'd like to see buckets from 2017-10-01 to 2019-12-01 (as of the time of this issue being written).

Steps to reproduce:

  1. Create index:
PUT test
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1
  },
  "mappings": {
    "properties": {
      "active_range": {
        "type": "date_range",
        "format": "yyyy-MM-dd"
      }
    }
  }
}
  1. Add documents:
POST test/_doc/
{
  "active_range": {
    "gte": "2017-10-10",
    "lte": "2018-10-10"
  }
}

POST test/_doc/
{
  "active_range": {
    "gte": "2017-10-10",
    "lte": null
  }
}
  1. Run aggregation:
GET test/_search
{
  "size": 0,
  "aggs": {
    "active": {
      "date_histogram": {
        "field": "active_range",
        "calendar_interval": "month"
      }
    }
  }
}
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

@polyfractal
Copy link
Contributor

/cc @not-napoleon mind taking a look at this when you get a chance?

@not-napoleon not-napoleon self-assigned this Dec 12, 2019
@not-napoleon
Copy link
Member

Yeah, I definitely see why that would happen. It's a little unclear to me what the correct behavior would be though.

In this case, it seems like we want to default the end date to now(), but I'm not sure that's a good general case behavior. What happens if the start point is null, do we default that to now() as well? What happens if the default generates an invalid range, do we just skip that document? What about open ended Numeric ranges, or IPs (although, if there's a sensible solution for Dates that doesn't generalize to other ranges, I have no problem just fixing this for DateHistogram)

Missing doesn't currently let you specify a range, but this seems almost like missing behavior. Just thinking out loud, if we let users specify a Range value for missing, and then partially applied it if the start or end point was null (and used the whole range if the value was actually missing), that might give users some ability to control this. We'd still need to deal with invalid ranges in the aggregator though. Unfortunately, getting missing to support complex types would be a pretty major change. It's something I'd like to do, but I don't have a plan for it right now.

I'm hesitant to add an explicit option (two options really, for start and end) for this case. That creates a situation where we have options on the aggregation that only apply sometimes, or worse we end up needing a new aggregation entirely. One of our goals with adding range support was to minimize API footprint as much as possible, so in that spirit, I'd like to keep adding a new option as a last resort.

TL;DR - if there's a sensible default value, this is a relatively straightforward fix, but I'm not convinced there's a sensible default value. At an absolute minimum, we should at least just skip these docs instead of trying to create infinite buckets.

@not-napoleon
Copy link
Member

@jcrapuchettes I ran your test case locally, and it hit a CircuitBreakerException, not an actual OutOfMemoryError - can you confirm this is what you're seeing? To be clear, the circuit breaker should kick in before it actually runs out of memory, and is a well-handled error case.

@jcrapuchettes
Copy link
Author

@not-napoleon I do get a CircuitBreakerException.

Having a default value of now() for missing would be great for my case, but I can understand the difficulty of dealing with all cases. Is there any way to use a script or something else to mimic the default value in current versions? Our queries are really big and ugly since we can't use the date_histogram.

@not-napoleon
Copy link
Member

Scripting would be a good solution, but we don't currently have support for getting range values from scripts in aggregations. I've opened an issue for this (#50190 ), if you want to follow the discussion there. Unfortunately, I don't think there's a scripting based work around in 7.4

@bloche
Copy link

bloche commented Jan 21, 2020

I've been experiencing a very similar issue.

I have ~180M documents spanning the past 4 years or so. Each document has a date_range field that could be anywhere from a few days to a few years. When I perform a date_histogram on the date range I always get buckets for every date in the range the documents reside in and there isn't a way to restrict to a subset of those buckets.

When querying by month it isn't so bad, I just get more data back than I need (all months between the minimum month in the matching documents and the maximum month). A real problem comes up, however, when I aggregate on days. One query where I'm trying to see data for say 30 days, will actually bucket for all days in the range, which could be upward of 1500 buckets, one for each day in the matching document's range (not my query range). Most of the time this results in a timeout in Elasticsearch and I never actually get data back.

This issue is preventing me from adopting the date_histogram aggregation for date_range fields. If there were something like extended_bounds that could restrict the aggregation from looking at buckets that don't fall in the bounded range, that would solve both my problem and the OP's problem, I believe.

Example

An example query I would be running (using the same field name as the OP) would be something like this:

{
    "track_total_hits": true,
    "query": {
        "range": {
            "active_range": {
                "gte": "2016-11-01",
                "lte": "2016-11-30"
            }
        }
    },
    "size": 0,
    "aggs": {
        "timeseries": {
            "date_histogram": {
                "field": "active_range",
                "calendar_interval": "day",
                "min_doc_count": 1,
                "extended_bounds": {
                    "min": "2016-11-01",
                    "max": "2016-11-30"
                }
            }
        }
    }
}

Since documents back in 2016 could be "active" today the ES query is generating all buckets from whatever the first day in the document range is all the way to today.

It feels like extended_bounds could tell Elasticsearch not to aggregate buckets outside the specified range, but the documentation says it explicitly does not do that. So either extended_bounds could start doing that (breaking change) or we could introduce another field bounds or something like that, which would be a hard enforcement at aggregation time.

A script might be a good solution for this, but it feels like a field similar to extended_bounds would make a bit more sense for this type of aggregation. Or if there were a way to tell bucket aggregations to ignore certain buckets at aggregation time (like a bucket selector aggregation, only operating before aggregation instead of after), which could be useful in more than just these kind of range histogram aggregations.

@not-napoleon
Copy link
Member

I discussed this with @polyfractal today, and we're not opposed to adding a flag to make the extended_bounds values hard limits on what buckets are computed. Something like:

 "extended_bounds": {
                    "min": "2016-11-01",
                    "max": "2016-11-30",
                    "hard_limit": true
                }

(exact option name still TBD) Which would return exactly all buckets between the given min and max. This would still include the current behavior of adding empty buckets to fill out the range, but also serve to clip the results that fall outside of that range. This is not entirely useless for non-range cases either, as it will be a cleaner syntax than the current recommendation of adding a filter query. Would something along those lines solve your difficulty?

@bloche
Copy link

bloche commented Jan 28, 2020

@not-napoleon that solution would fit my use case perfectly. Thanks for taking the time to discuss this, it'll simplify what I'm currently doing significantly.

@krlm
Copy link

krlm commented Mar 26, 2020

@not-napoleon would it still count in values with open end in all buckets valid for given start date or filter them out?

@polyfractal
Copy link
Contributor

polyfractal commented Mar 26, 2020

@krlm: @not-napoleon can correct me if I get this wrong, but I believe any unbounded interval on the "edge" will still be included in the final bucket on that side. It will just stop the aggregation from continuing to "fill out" the histogram with buckets that match the unbounded, infinite interval.

@not-napoleon
Copy link
Member

Yeah, that's more or less what I was thinking. Currently it just adds buckets until it adds a bucket that contains the end point. If a hard limit extended bounds were set, it would stop at that limit instead. Basically, you'd set the hard limit and get exactly those buckets, even if your ranges fell outside of them. You'd still have the existing extend_bounds behavior where we add in empty buckets as needed to fit that range, too.

@krlm
Copy link

krlm commented Mar 27, 2020

@polyfractal @not-napoleon, thanks for the info. I forgot that by default there's no upper and lower boundaries (which I've defined in the filter part of my query) for date histogram aggregation. This should do the job then.

@rjernst rjernst added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 4, 2020
@consulthys
Copy link
Contributor

I've had the same issue and would love to see support for date histograms on open-ended date range fields.

The way I'm dealing with this right now is by using scripting (and #50190 will definitely help here, too). It's clunky, it's not flexible but it works for what I need to do.

In my documents, I have a date_range field (say period) but also the two date fields for the start and end of the date range and sometimes the latter can be null. So in my date_histogram aggregation I can deal with this open-ended situation by defaulting to now if the end date is missing, i.e. if the period is open-ended towards the future (but it would work the same towards the past):

POST my-index/_search
{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "aggs": {
    "histo": {
      "date_histogram": {
        "script": """
        def start = doc['start_date'].value;
        def end = null;
        if (doc['end_date'].size() > 0) {
          end = doc['end_date'].value;
        } else {
          // default to now
          end = Instant.ofEpochMilli(new Date().getTime()).atZone(ZoneId.of("UTC"));
        }

        // build buckets array
        def buckets = [start];        
        def months = ChronoUnit.MONTHS.between(start, end) + 1;
        while (months > 0) {
          start = start.plusMonths(1);
          buckets.add(start);
          months--;
        }
        
        // return the date buckets
        return buckets;
        """,
        "interval": "month"
      }
    }
  }
}

As I said, it's clunky, but it does return all the date buckets I'm interested in (either from start to end or from start to now if end is null). Of course, depending on the interval type, the code needs to adapted.

Just thought I'd share if that can help anyone until a real fix is available.

@imotov imotov assigned imotov and unassigned not-napoleon Jun 15, 2020
@imotov imotov changed the title date_histogram of date_range with null end points runs out of memory date_histogram of date_range with null end points triggers a circuit breaker Jun 15, 2020
@imotov
Copy link
Contributor

imotov commented Jun 30, 2020

Here is an interesting question about hard bounds. With extended bounds the max bound is inclusive. So, if we have an hourly histogram and max is 10 hours, the 10:00:00-10:59:59 bucket will be included, which might make sense since we are extending the bounds. The hard bounds, on the other hands, are limiting. So, if max hard bound is 10:00:00 the bucket 10:00:00-10:59:59 is mostly outside of the bounds.

imotov added a commit to imotov/elasticsearch that referenced this issue Jul 7, 2020
Adds a hard_bounds parameter to explicitly limit the buckets that a histogram
can generate. This is especially useful in case of open ended ranges that can
produce a very large number of buckets.

Closes elastic#50109
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants