date_histogram of date_range with null end points triggers a circuit breaker #50109

jcrapuchettes · 2019-12-11T23:58:06Z

Elasticsearch version: 7.4.2

Plugins installed: [repository-s3, discovery-ec2]

JVM version: Bundled version

OS version: 4.14.138-114.102.amzn2.x86_64 (Amazon Linux 2)

Description of the problem including expected versus actual behavior:
Running a date_histogram on a date_range field with document values that have null "lte" appears to cause the aggregation to create an infinite number of buckets and I get an out of memory error. The aggregation works well for fully defined date ranges (the first document in my example), but I have a large number of documents that have an undefined end point (value is null, second document in my example). I have been looking for a way to limit the buckets created in the query, but without luck. I've also tried to use a date_range aggregation on the fields to attempt to limit the potential buckets, but that caused cast exceptions.

Basically I'm attempting to find all of the months up to now that these documents cover/touch. I'd like to see buckets from 2017-10-01 to 2019-12-01 (as of the time of this issue being written).

Steps to reproduce:

Create index:

PUT test
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1
  },
  "mappings": {
    "properties": {
      "active_range": {
        "type": "date_range",
        "format": "yyyy-MM-dd"
      }
    }
  }
}

Add documents:

POST test/_doc/
{
  "active_range": {
    "gte": "2017-10-10",
    "lte": "2018-10-10"
  }
}

POST test/_doc/
{
  "active_range": {
    "gte": "2017-10-10",
    "lte": null
  }
}

Run aggregation:

GET test/_search
{
  "size": 0,
  "aggs": {
    "active": {
      "date_histogram": {
        "field": "active_range",
        "calendar_interval": "month"
      }
    }
  }
}

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-12-12T08:15:44Z

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

polyfractal · 2019-12-12T16:47:01Z

/cc @not-napoleon mind taking a look at this when you get a chance?

not-napoleon · 2019-12-12T17:19:35Z

Yeah, I definitely see why that would happen. It's a little unclear to me what the correct behavior would be though.

In this case, it seems like we want to default the end date to now(), but I'm not sure that's a good general case behavior. What happens if the start point is null, do we default that to now() as well? What happens if the default generates an invalid range, do we just skip that document? What about open ended Numeric ranges, or IPs (although, if there's a sensible solution for Dates that doesn't generalize to other ranges, I have no problem just fixing this for DateHistogram)

Missing doesn't currently let you specify a range, but this seems almost like missing behavior. Just thinking out loud, if we let users specify a Range value for missing, and then partially applied it if the start or end point was null (and used the whole range if the value was actually missing), that might give users some ability to control this. We'd still need to deal with invalid ranges in the aggregator though. Unfortunately, getting missing to support complex types would be a pretty major change. It's something I'd like to do, but I don't have a plan for it right now.

I'm hesitant to add an explicit option (two options really, for start and end) for this case. That creates a situation where we have options on the aggregation that only apply sometimes, or worse we end up needing a new aggregation entirely. One of our goals with adding range support was to minimize API footprint as much as possible, so in that spirit, I'd like to keep adding a new option as a last resort.

TL;DR - if there's a sensible default value, this is a relatively straightforward fix, but I'm not convinced there's a sensible default value. At an absolute minimum, we should at least just skip these docs instead of trying to create infinite buckets.

not-napoleon · 2019-12-12T19:49:09Z

@jcrapuchettes I ran your test case locally, and it hit a CircuitBreakerException, not an actual OutOfMemoryError - can you confirm this is what you're seeing? To be clear, the circuit breaker should kick in before it actually runs out of memory, and is a well-handled error case.

jcrapuchettes · 2019-12-13T00:21:48Z

@not-napoleon I do get a CircuitBreakerException.

Having a default value of now() for missing would be great for my case, but I can understand the difficulty of dealing with all cases. Is there any way to use a script or something else to mimic the default value in current versions? Our queries are really big and ugly since we can't use the date_histogram.

not-napoleon · 2019-12-13T15:33:34Z

Scripting would be a good solution, but we don't currently have support for getting range values from scripts in aggregations. I've opened an issue for this (#50190 ), if you want to follow the discussion there. Unfortunately, I don't think there's a scripting based work around in 7.4

bloche · 2020-01-21T17:50:42Z

I've been experiencing a very similar issue.

I have ~180M documents spanning the past 4 years or so. Each document has a date_range field that could be anywhere from a few days to a few years. When I perform a date_histogram on the date range I always get buckets for every date in the range the documents reside in and there isn't a way to restrict to a subset of those buckets.

When querying by month it isn't so bad, I just get more data back than I need (all months between the minimum month in the matching documents and the maximum month). A real problem comes up, however, when I aggregate on days. One query where I'm trying to see data for say 30 days, will actually bucket for all days in the range, which could be upward of 1500 buckets, one for each day in the matching document's range (not my query range). Most of the time this results in a timeout in Elasticsearch and I never actually get data back.

This issue is preventing me from adopting the date_histogram aggregation for date_range fields. If there were something like extended_bounds that could restrict the aggregation from looking at buckets that don't fall in the bounded range, that would solve both my problem and the OP's problem, I believe.

Example

An example query I would be running (using the same field name as the OP) would be something like this:

{
    "track_total_hits": true,
    "query": {
        "range": {
            "active_range": {
                "gte": "2016-11-01",
                "lte": "2016-11-30"
            }
        }
    },
    "size": 0,
    "aggs": {
        "timeseries": {
            "date_histogram": {
                "field": "active_range",
                "calendar_interval": "day",
                "min_doc_count": 1,
                "extended_bounds": {
                    "min": "2016-11-01",
                    "max": "2016-11-30"
                }
            }
        }
    }
}

Since documents back in 2016 could be "active" today the ES query is generating all buckets from whatever the first day in the document range is all the way to today.

It feels like extended_bounds could tell Elasticsearch not to aggregate buckets outside the specified range, but the documentation says it explicitly does not do that. So either extended_bounds could start doing that (breaking change) or we could introduce another field bounds or something like that, which would be a hard enforcement at aggregation time.

A script might be a good solution for this, but it feels like a field similar to extended_bounds would make a bit more sense for this type of aggregation. Or if there were a way to tell bucket aggregations to ignore certain buckets at aggregation time (like a bucket selector aggregation, only operating before aggregation instead of after), which could be useful in more than just these kind of range histogram aggregations.

not-napoleon · 2020-01-24T19:20:59Z

I discussed this with @polyfractal today, and we're not opposed to adding a flag to make the extended_bounds values hard limits on what buckets are computed. Something like:

 "extended_bounds": {
                    "min": "2016-11-01",
                    "max": "2016-11-30",
                    "hard_limit": true
                }

(exact option name still TBD) Which would return exactly all buckets between the given min and max. This would still include the current behavior of adding empty buckets to fill out the range, but also serve to clip the results that fall outside of that range. This is not entirely useless for non-range cases either, as it will be a cleaner syntax than the current recommendation of adding a filter query. Would something along those lines solve your difficulty?

bloche · 2020-01-28T01:14:51Z

@not-napoleon that solution would fit my use case perfectly. Thanks for taking the time to discuss this, it'll simplify what I'm currently doing significantly.

krlm · 2020-03-26T08:39:16Z

@not-napoleon would it still count in values with open end in all buckets valid for given start date or filter them out?

polyfractal · 2020-03-26T13:10:22Z

@krlm: @not-napoleon can correct me if I get this wrong, but I believe any unbounded interval on the "edge" will still be included in the final bucket on that side. It will just stop the aggregation from continuing to "fill out" the histogram with buckets that match the unbounded, infinite interval.

not-napoleon · 2020-03-26T13:44:38Z

Yeah, that's more or less what I was thinking. Currently it just adds buckets until it adds a bucket that contains the end point. If a hard limit extended bounds were set, it would stop at that limit instead. Basically, you'd set the hard limit and get exactly those buckets, even if your ranges fell outside of them. You'd still have the existing extend_bounds behavior where we add in empty buckets as needed to fit that range, too.

krlm · 2020-03-27T19:49:46Z

@polyfractal @not-napoleon, thanks for the info. I forgot that by default there's no upper and lower boundaries (which I've defined in the filter part of my query) for date histogram aggregation. This should do the job then.

consulthys · 2020-06-04T15:49:06Z

I've had the same issue and would love to see support for date histograms on open-ended date range fields.

The way I'm dealing with this right now is by using scripting (and #50190 will definitely help here, too). It's clunky, it's not flexible but it works for what I need to do.

In my documents, I have a date_range field (say period) but also the two date fields for the start and end of the date range and sometimes the latter can be null. So in my date_histogram aggregation I can deal with this open-ended situation by defaulting to now if the end date is missing, i.e. if the period is open-ended towards the future (but it would work the same towards the past):

POST my-index/_search
{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "aggs": {
    "histo": {
      "date_histogram": {
        "script": """
        def start = doc['start_date'].value;
        def end = null;
        if (doc['end_date'].size() > 0) {
          end = doc['end_date'].value;
        } else {
          // default to now
          end = Instant.ofEpochMilli(new Date().getTime()).atZone(ZoneId.of("UTC"));
        }

        // build buckets array
        def buckets = [start];        
        def months = ChronoUnit.MONTHS.between(start, end) + 1;
        while (months > 0) {
          start = start.plusMonths(1);
          buckets.add(start);
          months--;
        }
        
        // return the date buckets
        return buckets;
        """,
        "interval": "month"
      }
    }
  }
}

As I said, it's clunky, but it does return all the date buckets I'm interested in (either from start to end or from start to now if end is null). Of course, depending on the interval type, the code needs to adapted.

Just thought I'd share if that can help anyone until a real fix is available.

imotov · 2020-06-30T17:27:25Z

Here is an interesting question about hard bounds. With extended bounds the max bound is inclusive. So, if we have an hourly histogram and max is 10 hours, the 10:00:00-10:59:59 bucket will be included, which might make sense since we are extending the bounds. The hard bounds, on the other hands, are limiting. So, if max hard bound is 10:00:00 the bucket 10:00:00-10:59:59 is mostly outside of the bounds.

Adds a hard_bounds parameter to explicitly limit the buckets that a histogram can generate. This is especially useful in case of open ended ranges that can produce a very large number of buckets. Closes elastic#50109

iverase added :Analytics/Aggregations Aggregations >bug labels Dec 12, 2019

not-napoleon self-assigned this Dec 12, 2019

not-napoleon mentioned this issue Dec 13, 2019

Add scripting support for ranges #50190

Closed

$@polyfractal$ polyfractal mentioned this issue Mar 18, 2020

Single unbounded date_range document triggers circuit breaker #53736

Closed

$@polyfractal$ polyfractal mentioned this issue Mar 25, 2020

Date histogram aggregation throws an exception when aggreated on an open date_range field. #54228

Closed

rjernst added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 4, 2020

iverase mentioned this issue Jun 12, 2020

Fix bug in circuit-breaker check for geoshape grid aggregations #57962

Merged

imotov assigned imotov and unassigned not-napoleon Jun 15, 2020

imotov changed the title ~~date_histogram of date_range with null end points runs out of memory~~ date_histogram of date_range with null end points triggers a circuit breaker Jun 15, 2020

imotov mentioned this issue Jul 7, 2020

Adds hard_bounds to histogram aggregations #59175

Merged

imotov closed this as completed in #59175 Jul 13, 2020

Mpdreamz mentioned this issue Nov 16, 2020

7.10.1 Meta Ticket elastic/elasticsearch-net#5096

Closed

61 tasks

wylieconlon mentioned this issue Mar 4, 2021

Support for number, date and IP range data types elastic/kibana#76971

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

date_histogram of date_range with null end points triggers a circuit breaker #50109

date_histogram of date_range with null end points triggers a circuit breaker #50109

jcrapuchettes commented Dec 11, 2019 •

edited

Loading

elasticmachine commented Dec 12, 2019

polyfractal commented Dec 12, 2019

not-napoleon commented Dec 12, 2019

not-napoleon commented Dec 12, 2019

jcrapuchettes commented Dec 13, 2019

not-napoleon commented Dec 13, 2019

bloche commented Jan 21, 2020 •

edited

Loading

not-napoleon commented Jan 24, 2020

bloche commented Jan 28, 2020

krlm commented Mar 26, 2020

polyfractal commented Mar 26, 2020 •

edited

Loading

not-napoleon commented Mar 26, 2020

krlm commented Mar 27, 2020

consulthys commented Jun 4, 2020

imotov commented Jun 30, 2020

date_histogram of date_range with null end points triggers a circuit breaker #50109

date_histogram of date_range with null end points triggers a circuit breaker #50109

Comments

jcrapuchettes commented Dec 11, 2019 • edited Loading

elasticmachine commented Dec 12, 2019

polyfractal commented Dec 12, 2019

not-napoleon commented Dec 12, 2019

not-napoleon commented Dec 12, 2019

jcrapuchettes commented Dec 13, 2019

not-napoleon commented Dec 13, 2019

bloche commented Jan 21, 2020 • edited Loading

Example

not-napoleon commented Jan 24, 2020

bloche commented Jan 28, 2020

krlm commented Mar 26, 2020

polyfractal commented Mar 26, 2020 • edited Loading

not-napoleon commented Mar 26, 2020

krlm commented Mar 27, 2020

consulthys commented Jun 4, 2020

imotov commented Jun 30, 2020

jcrapuchettes commented Dec 11, 2019 •

edited

Loading

bloche commented Jan 21, 2020 •

edited

Loading

polyfractal commented Mar 26, 2020 •

edited

Loading