Skip to content

Speed up date_histogram without children (backport of #63643) #64823

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 9, 2020

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Nov 9, 2020

This speeds up date_histogram aggregations without a parent or
children. This is quite common - it's the aggregation that Kibana's Discover
uses all over the place. Also, we hope to be able to use the same
mechanism to speed aggs with children one day, but that day isn't today.

The kind of speedup we're seeing is fairly substantial in many cases:

|                              |                                            |  before |   after |    |
| 90th percentile service time |           date_histogram_calendar_interval | 9266.07 | 1376.13 | ms |
| 90th percentile service time |   date_histogram_calendar_interval_with_tz | 9217.21 | 1372.67 | ms |
| 90th percentile service time |              date_histogram_fixed_interval | 8817.36 | 1312.67 | ms |
| 90th percentile service time |      date_histogram_fixed_interval_with_tz | 8801.71 | 1311.69 | ms | <-- discover's agg
| 90th percentile service time | date_histogram_fixed_interval_with_metrics | 44660.2 | 43789.5 | ms |

This uses the work we did in #61467 to precompute the rounding points for
a date_histogram. Now, when we know the rounding points we execute the
date_histogram as a range aggregation. This is nice for two reasons:

  1. We can further rewrite the range aggregation (see below)
  2. We don't need to allocate a hash to convert rounding points
    to ordinals.
  3. We can send precise cardinality estimates to sub-aggs.

Points 2 and 3 above are nice, but most of the speed difference comes from
point 1. Specifically, we now look into executing range aggregations as
a filters aggregation. Normally the filters aggregation is quite slow
but when it doesn't have a parent or any children then we can execute it
"filter by filter" which is significantly faster. So fast, in fact, that
it is faster than the original date_histogram.

The range aggregation is fairly careful in how it rewrites, giving up
on the filters aggregation if it won't collect "filter by filter" and
falling back to its original execution mechanism.

So an aggregation like this:

POST _search
{
  "size": 0,
  "query": {
    "range": {
      "dropoff_datetime": {
        "gte": "2015-01-01 00:00:00",
        "lt": "2016-01-01 00:00:00"
      }
    }
  },
  "aggs": {
    "dropoffs_over_time": {
      "date_histogram": {
        "field": "dropoff_datetime",
        "fixed_interval": "60d",
        "time_zone": "America/New_York"
      }
    }
  }
}

is executed like:

POST _search
{
  "size": 0,
  "query": {
    "range": {
      "dropoff_datetime": {
        "gte": "2015-01-01 00:00:00",
        "lt": "2016-01-01 00:00:00"
      }
    }
  },
  "aggs": {
    "dropoffs_over_time": {
      "range": {
        "field": "dropoff_datetime",
        "ranges": [
          {"from": 1415250000000, "to": 1420434000000},
          {"from": 1420434000000, "to": 1425618000000},
          {"from": 1425618000000, "to": 1430798400000},
          {"from": 1430798400000, "to": 1435982400000},
          {"from": 1435982400000, "to": 1441166400000},
          {"from": 1441166400000, "to": 1446350400000},
          {"from": 1446350400000, "to": 1451538000000},
          {"from": 1451538000000}
        ]
      }
    }
  }
}

Which in turn is executed like this:

POST _search
{
  "size": 0,
  "query": {
    "range": {
      "dropoff_datetime": {
        "gte": "2015-01-01 00:00:00",
        "lt": "2016-01-01 00:00:00"
      }
    }
  },
  "aggs": {
    "dropoffs_over_time": {
      "filters": {
        "filters": {
          "1": {"range": {"dropoff_datetime": {"gte": "2014-12-30 00:00:00", "lt": "2015-01-05 05:00:00"}}},
          "2": {"range": {"dropoff_datetime": {"gte": "2015-01-05 05:00:00", "lt": "2015-03-06 05:00:00"}}},
          "3": {"range": {"dropoff_datetime": {"gte": "2015-03-06 00:00:00", "lt": "2015-05-05 00:00:00"}}},
          "4": {"range": {"dropoff_datetime": {"gte": "2015-05-05 00:00:00", "lt": "2015-07-04 00:00:00"}}},
          "5": {"range": {"dropoff_datetime": {"gte": "2015-07-04 00:00:00", "lt": "2015-09-02 00:00:00"}}},
          "6": {"range": {"dropoff_datetime": {"gte": "2015-09-02 00:00:00", "lt": "2015-11-01 00:00:00"}}},
          "7": {"range": {"dropoff_datetime": {"gte": "2015-11-01 00:00:00", "lt": "2015-12-31 00:00:00"}}},
          "8": {"range": {"dropoff_datetime": {"gte": "2015-12-31 00:00:00"}}}
        }
      }
    }
  }
}

And that is faster because we can execute it "filter by filter".

Finally, notice the range query filtering the data. That is required for
the data set that I'm using for testing. The "filter by filter" collection
mechanism for the filters agg needs special case handling when the query
is a range query and the filter is a range query and they are both on
the same field. That special case handling "merges" the range query.
Without it "filter by filter" collection is substantially slower. Its still
quite a bit quicker than the standard filter collection, but not nearly
as fast as it could be.

This speeds up `date_histogram` aggregations without a parent or
children. This is quite common - it's the aggregation that Kibana's Discover
uses all over the place. Also, we hope to be able to use the same
mechanism to speed aggs with children one day, but that day isn't today.

The kind of speedup we're seeing is fairly substantial in many cases:
```
|                              |                                            |  before |   after |    |
| 90th percentile service time |           date_histogram_calendar_interval | 9266.07 | 1376.13 | ms |
| 90th percentile service time |   date_histogram_calendar_interval_with_tz | 9217.21 | 1372.67 | ms |
| 90th percentile service time |              date_histogram_fixed_interval | 8817.36 | 1312.67 | ms |
| 90th percentile service time |      date_histogram_fixed_interval_with_tz | 8801.71 | 1311.69 | ms | <-- discover's agg
| 90th percentile service time | date_histogram_fixed_interval_with_metrics | 44660.2 | 43789.5 | ms |
```

This uses the work we did in elastic#61467 to precompute the rounding points for
a `date_histogram`. Now, when we know the rounding points we execute the
`date_histogram` as a `range` aggregation. This is nice for two reasons:
1. We can further rewrite the `range` aggregation (see below)
2. We don't need to allocate a hash to convert rounding points
   to ordinals.
3. We can send precise cardinality estimates to sub-aggs.

Points 2 and 3 above are nice, but most of the speed difference comes from
point 1. Specifically, we now look into executing `range` aggregations as
a `filters` aggregation. Normally the `filters` aggregation is quite slow
but when it doesn't have a parent or any children then we can execute it
"filter by filter" which is significantly faster. So fast, in fact, that
it is faster than the original `date_histogram`.

The `range` aggregation is *fairly* careful in how it rewrites, giving up
on the `filters` aggregation if it won't collect "filter by filter" and
falling back to its original execution mechanism.

So an aggregation like this:

```
POST _search
{
  "size": 0,
  "query": {
    "range": {
      "dropoff_datetime": {
        "gte": "2015-01-01 00:00:00",
        "lt": "2016-01-01 00:00:00"
      }
    }
  },
  "aggs": {
    "dropoffs_over_time": {
      "date_histogram": {
        "field": "dropoff_datetime",
        "fixed_interval": "60d",
        "time_zone": "America/New_York"
      }
    }
  }
}
```

is executed like:

```
POST _search
{
  "size": 0,
  "query": {
    "range": {
      "dropoff_datetime": {
        "gte": "2015-01-01 00:00:00",
        "lt": "2016-01-01 00:00:00"
      }
    }
  },
  "aggs": {
    "dropoffs_over_time": {
      "range": {
        "field": "dropoff_datetime",
        "ranges": [
          {"from": 1415250000000, "to": 1420434000000},
          {"from": 1420434000000, "to": 1425618000000},
          {"from": 1425618000000, "to": 1430798400000},
          {"from": 1430798400000, "to": 1435982400000},
          {"from": 1435982400000, "to": 1441166400000},
          {"from": 1441166400000, "to": 1446350400000},
          {"from": 1446350400000, "to": 1451538000000},
          {"from": 1451538000000}
        ]
      }
    }
  }
}
```

Which in turn is executed like this:

```
POST _search
{
  "size": 0,
  "query": {
    "range": {
      "dropoff_datetime": {
        "gte": "2015-01-01 00:00:00",
        "lt": "2016-01-01 00:00:00"
      }
    }
  },
  "aggs": {
    "dropoffs_over_time": {
      "filters": {
        "filters": {
          "1": {"range": {"dropoff_datetime": {"gte": "2014-12-30 00:00:00", "lt": "2015-01-05 05:00:00"}}},
          "2": {"range": {"dropoff_datetime": {"gte": "2015-01-05 05:00:00", "lt": "2015-03-06 05:00:00"}}},
          "3": {"range": {"dropoff_datetime": {"gte": "2015-03-06 00:00:00", "lt": "2015-05-05 00:00:00"}}},
          "4": {"range": {"dropoff_datetime": {"gte": "2015-05-05 00:00:00", "lt": "2015-07-04 00:00:00"}}},
          "5": {"range": {"dropoff_datetime": {"gte": "2015-07-04 00:00:00", "lt": "2015-09-02 00:00:00"}}},
          "6": {"range": {"dropoff_datetime": {"gte": "2015-09-02 00:00:00", "lt": "2015-11-01 00:00:00"}}},
          "7": {"range": {"dropoff_datetime": {"gte": "2015-11-01 00:00:00", "lt": "2015-12-31 00:00:00"}}},
          "8": {"range": {"dropoff_datetime": {"gte": "2015-12-31 00:00:00"}}}
        }
      }
    }
  }
}
```

And *that* is faster because we can execute it "filter by filter".

Finally, notice the `range` query filtering the data. That is required for
the data set that I'm using for testing. The "filter by filter" collection
mechanism for the `filters` agg needs special case handling when the query
is a `range` query and the filter is a `range` query and they are both on
the same field. That special case handling "merges" the range query.
Without it "filter by filter" collection is substantially slower. Its still
quite a bit quicker than the standard `filter` collection, but not nearly
as fast as it could be.
@nik9000
Copy link
Member Author

nik9000 commented Nov 9, 2020

run elasticsearch-ci/1

@nik9000 nik9000 merged commit b71b0c9 into elastic:7.x Nov 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant