Add a way to downsample metrics #66247

exekias · 2020-12-14T09:53:18Z

We store metrics coming from Beats or APM using the following convention: Metrics are stored as numeric fields in documents, together with a set of other fields representing the dimensions for these metrics (these are normally of type keyword). For example:

{
    "@timestamp": "2017-10-12T08:05:34.853Z",
    "container": {
        "memory": {
            "free": 8461623296,
            "used": {
                "bytes": 7159164928,
            }
        },
    },
    "host": {
        "name": "node-01"
    },
    "container"{
        "name": "nginx"
    }
}

We normally put together all metrics with the same dimensions in the same document, for storage & query efficiency.

Each combination of dimensions key-values creates a unique times series. This has some implications on the way we need to aggregate them. For example for the following data points:

host.name	container.name	t	t+10s	t+20s	t+30s	t+40s	t+50s
node01	nginx	100	102	103	103	102	104
node01	mysql	20	22	26	28	25	26
node02	nginx	50	55	55	55	49	45
node03	apache	35	38	49	56	57	60

We have 4 time series, across 3 hosts and 3 different container names. If we want to graph the "total containers memory usage per host" I would do a sum agg grouping (terms) by host.

This provides good results, as long as the date_histogram bucket size corresponds to the reporting period (10s). At time t:

node01: 100 + 20
node02: 50
node03: 35

Now, when the date_histogram bucket size is different from the reporting period the query will provide "wrong" results, as it will be aggregating multiple points for the same time series. At time t with a 20s bucket:

node01: 100 + 102 + 20 + 22
node02: 50 + 55
node03: 35 + 38

The reason is that we ended up with 2 points in the same bucket for each time series, so we are double counting them.

To get the expected results we need to downsample time series first, so we get a single data point per time series in each bucket, then we can apply the agg afterwards. In this case we could use "avg" as the downsampling function:

node01: avg(100, 102) + avg(20, 22)
node02: avg(50, 55)
node03: avg(35, 38)

The downsampling function may depend on the type of metric that is being queried, with avg, last/max or sum as possible options.

It would be nice to have a way to automatically downsample time series based on a given set of dimensions and a downsampling function. It would also be interesting to discuss if dimensions could be something known to Elasticsearch, so users/kibana don't need to provide them at query time.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-12-14T09:53:20Z

Pinging @elastic/es-analytics-geo (Team:Analytics)

polyfractal · 2021-01-21T16:50:37Z

I think we should rename this feature/functionality/ticket, as it is causing confusion in relation to #64777. I've had three conversations in the last week where different folks were confusing the two issues :)

This requested functionality isn't really downsampling or subsampling since traditionally sampling involves selecting a subset of the original data points to use as a proxy for the overall population. The requested feature here is really more of choosing an appropriate aggregate function to represent a bucket of time, rather than choosing a single point to represent the bucket.

Perhaps something like "time series aware aggregate function" or similar?

jsoriano · 2021-01-25T18:04:29Z

I think we should rename this feature/functionality/ticket, as it is causing confusion in relation to #64777. I've had three conversations in the last week where different folks were confusing the two issues :)

The term "downsampling" is quite extended in time series databases, introducing a new term here for this could be confusing in the observability context (though perhaps this could be solved by documentation).

This requested functionality isn't really downsampling or subsampling since traditionally sampling involves selecting a subset of the original data points to use as a proxy for the overall population.

I think that actually this feature fits quite well with the definition of Downsampling in this link 🙂: When the process is performed on a sequence of samples of a signal or other continuous function, it produces an approximation of the sequence that would have been obtained by sampling the signal at a lower rate

This is exactly what we would be looking for, an approximation for the sequence as if its samples would have been collected at a lower rate.

Perhaps something like "time series aware aggregate function" or similar?

Usually "downsampling" needs of an aggregate function, so depending on how this is implemented perhaps two terms are needed, one for the feature itself, and another one for the process that produces each value of the new sequence, the function.
For reference, in opentelemetry this function is called "aggregation", and each instrument (~type of metric), can define a different default function. In the example in the description, avg would be the aggregation.

not-napoleon · 2021-04-26T19:13:01Z

It seems to me that the key ask here, from an aggregations perspective, is the ability to condense a bucket to a single value via a metric aggregation (aggregate function, or downsampling function if you like), and then run another metric aggregation over those values, bucketed by another level. In the example above, for each time-host bucket, you want the sum of the averages over the container name. This idea - being able to condense buckets and then run further aggregations on them - strikes me a something that might be useful outside of the downsampling use case too.

We've talked about a few related ideas (sub-queries, windowing functions) internally to the aggs team, and I think the generic metric-of-metrics idea is worth considering. I just want to validate that metric-of-metrics, as described above, would meet your needs here, or if there's some other piece of the puzzle that I'm not seeing.

exekias · 2021-04-27T14:08:23Z

It seems to me that the key ask here, from an aggregations perspective, is the ability to condense a bucket to a single value via a metric aggregation (aggregate function, or downsampling function if you like), and then run another metric aggregation over those values, bucketed by another level

This is right, in order to obtain correct results we need to make sure we are not aggregating the same time series twice, hence we need to condense it first (by all its dimensions).

A way to do metric-of-metrics could work here, my only concern is around performance, as this would be widely used across all metric queries. I wonder if there are any optimization we can do for this specific use case. @imotov mentioned this issue here: #65623 (comment) and how it relates to #60619 (to some extent).

wchaparro · 2022-03-07T18:55:11Z

related to #74660

jasonrhodes · 2022-04-18T17:58:03Z

@not-napoleon / @imotov / @ruflin are we comfortable closing this in favor of the wider TSDB effort?

imotov · 2022-04-18T18:04:18Z

It is a part of the TSDB effort. We don't have a separate issue for that, so I am ok with keeping this one as a placeholder.

not-napoleon · 2022-04-18T18:11:22Z

We need to be careful not to get tripped up by terminology here. So far, TSDB has used downsampling to refer to the ILM action to decrease the index date resolution, and thus its on disk size. This ticket is talking about a query time action to apply the downsampling function to sub-buckets during aggregation, IIRC.

jasonrhodes · 2022-04-18T18:18:48Z

Good point, @not-napoleon — this ticket is particularly focused on something very similar to your summary here: #66247 (comment)

weizijun · 2022-05-19T08:12:19Z

This aggregation (#85798) can solve the problem, and it can easily support promQL.

wchaparro · 2023-06-30T13:18:22Z

Closing as not planned.

exekias added >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) needs:triage Requires assignment of a team area label labels Dec 14, 2020

$@polyfractal$ polyfractal added :Analytics/Aggregations Aggregations and removed needs:triage Requires assignment of a team area label labels Dec 14, 2020

jimczi mentioned this issue Jan 18, 2021

Add support for multi-field keys to terms aggs #65623

Closed

dgieselaar mentioned this issue Mar 17, 2021

[RAC] Alerts as Data Schema Definition elastic/kibana#93728

Closed

csoulios added the :StorageEngine/Rollup Turn fine-grained time-based data into coarser-grained data label Mar 9, 2022

imotov mentioned this issue Mar 14, 2022

Add a metric aggs to support TSDB high performance computing #84930

Closed

weizijun mentioned this issue Apr 12, 2022

TSDB: add a time_series_aggregation to support TSDB query and promQL #85798

Open

wchaparro closed this as not planned Won't fix, can't repro, duplicate, stale Jun 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a way to downsample metrics #66247

Add a way to downsample metrics #66247

exekias commented Dec 14, 2020

elasticmachine commented Dec 14, 2020

polyfractal commented Jan 21, 2021

jsoriano commented Jan 25, 2021

not-napoleon commented Apr 26, 2021

exekias commented Apr 27, 2021

wchaparro commented Mar 7, 2022

jasonrhodes commented Apr 18, 2022

imotov commented Apr 18, 2022

not-napoleon commented Apr 18, 2022

jasonrhodes commented Apr 18, 2022

weizijun commented May 19, 2022

wchaparro commented Jun 30, 2023

Add a way to downsample metrics #66247

Add a way to downsample metrics #66247

Comments

exekias commented Dec 14, 2020

elasticmachine commented Dec 14, 2020

polyfractal commented Jan 21, 2021

jsoriano commented Jan 25, 2021

not-napoleon commented Apr 26, 2021

exekias commented Apr 27, 2021

wchaparro commented Mar 7, 2022

jasonrhodes commented Apr 18, 2022

imotov commented Apr 18, 2022

not-napoleon commented Apr 18, 2022

jasonrhodes commented Apr 18, 2022

weizijun commented May 19, 2022

wchaparro commented Jun 30, 2023