Skip to content

Add a way to downsample metrics #66247

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
exekias opened this issue Dec 14, 2020 · 12 comments
Closed

Add a way to downsample metrics #66247

exekias opened this issue Dec 14, 2020 · 12 comments
Labels
:Analytics/Aggregations Aggregations >enhancement :StorageEngine/Rollup Turn fine-grained time-based data into coarser-grained data Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@exekias
Copy link

exekias commented Dec 14, 2020

We store metrics coming from Beats or APM using the following convention: Metrics are stored as numeric fields in documents, together with a set of other fields representing the dimensions for these metrics (these are normally of type keyword). For example:

{
    "@timestamp": "2017-10-12T08:05:34.853Z",
    "container": {
        "memory": {
            "free": 8461623296,
            "used": {
                "bytes": 7159164928,
            }
        },
    },
    "host": {
        "name": "node-01"
    },
    "container"{
        "name": "nginx"
    }
}

We normally put together all metrics with the same dimensions in the same document, for storage & query efficiency.

Each combination of dimensions key-values creates a unique times series. This has some implications on the way we need to aggregate them. For example for the following data points:

host.name container.name t t+10s t+20s t+30s t+40s t+50s
node01 nginx 100 102 103 103 102 104
node01 mysql 20 22 26 28 25 26
node02 nginx 50 55 55 55 49 45
node03 apache 35 38 49 56 57 60

We have 4 time series, across 3 hosts and 3 different container names. If we want to graph the "total containers memory usage per host" I would do a sum agg grouping (terms) by host.

This provides good results, as long as the date_histogram bucket size corresponds to the reporting period (10s). At time t:

node01: 100 + 20
node02: 50
node03: 35

Now, when the date_histogram bucket size is different from the reporting period the query will provide "wrong" results, as it will be aggregating multiple points for the same time series. At time t with a 20s bucket:

node01: 100 + 102 + 20 + 22
node02: 50 + 55
node03: 35 + 38

The reason is that we ended up with 2 points in the same bucket for each time series, so we are double counting them.

To get the expected results we need to downsample time series first, so we get a single data point per time series in each bucket, then we can apply the agg afterwards. In this case we could use "avg" as the downsampling function:

node01: avg(100, 102) + avg(20, 22)
node02: avg(50, 55)
node03: avg(35, 38)

The downsampling function may depend on the type of metric that is being queried, with avg, last/max or sum as possible options.

It would be nice to have a way to automatically downsample time series based on a given set of dimensions and a downsampling function. It would also be interesting to discuss if dimensions could be something known to Elasticsearch, so users/kibana don't need to provide them at query time.

@exekias exekias added >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) needs:triage Requires assignment of a team area label labels Dec 14, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@polyfractal polyfractal added :Analytics/Aggregations Aggregations and removed needs:triage Requires assignment of a team area label labels Dec 14, 2020
@polyfractal
Copy link
Contributor

I think we should rename this feature/functionality/ticket, as it is causing confusion in relation to #64777. I've had three conversations in the last week where different folks were confusing the two issues :)

This requested functionality isn't really downsampling or subsampling since traditionally sampling involves selecting a subset of the original data points to use as a proxy for the overall population. The requested feature here is really more of choosing an appropriate aggregate function to represent a bucket of time, rather than choosing a single point to represent the bucket.

Perhaps something like "time series aware aggregate function" or similar?

@jsoriano
Copy link
Member

I think we should rename this feature/functionality/ticket, as it is causing confusion in relation to #64777. I've had three conversations in the last week where different folks were confusing the two issues :)

The term "downsampling" is quite extended in time series databases, introducing a new term here for this could be confusing in the observability context (though perhaps this could be solved by documentation).

This requested functionality isn't really downsampling or subsampling since traditionally sampling involves selecting a subset of the original data points to use as a proxy for the overall population.

I think that actually this feature fits quite well with the definition of Downsampling in this link 🙂: When the process is performed on a sequence of samples of a signal or other continuous function, it produces an approximation of the sequence that would have been obtained by sampling the signal at a lower rate

This is exactly what we would be looking for, an approximation for the sequence as if its samples would have been collected at a lower rate.

Perhaps something like "time series aware aggregate function" or similar?

Usually "downsampling" needs of an aggregate function, so depending on how this is implemented perhaps two terms are needed, one for the feature itself, and another one for the process that produces each value of the new sequence, the function.
For reference, in opentelemetry this function is called "aggregation", and each instrument (~type of metric), can define a different default function. In the example in the description, avg would be the aggregation.

@not-napoleon
Copy link
Member

It seems to me that the key ask here, from an aggregations perspective, is the ability to condense a bucket to a single value via a metric aggregation (aggregate function, or downsampling function if you like), and then run another metric aggregation over those values, bucketed by another level. In the example above, for each time-host bucket, you want the sum of the averages over the container name. This idea - being able to condense buckets and then run further aggregations on them - strikes me a something that might be useful outside of the downsampling use case too.

We've talked about a few related ideas (sub-queries, windowing functions) internally to the aggs team, and I think the generic metric-of-metrics idea is worth considering. I just want to validate that metric-of-metrics, as described above, would meet your needs here, or if there's some other piece of the puzzle that I'm not seeing.

@exekias
Copy link
Author

exekias commented Apr 27, 2021

It seems to me that the key ask here, from an aggregations perspective, is the ability to condense a bucket to a single value via a metric aggregation (aggregate function, or downsampling function if you like), and then run another metric aggregation over those values, bucketed by another level

This is right, in order to obtain correct results we need to make sure we are not aggregating the same time series twice, hence we need to condense it first (by all its dimensions).

A way to do metric-of-metrics could work here, my only concern is around performance, as this would be widely used across all metric queries. I wonder if there are any optimization we can do for this specific use case. @imotov mentioned this issue here: #65623 (comment) and how it relates to #60619 (to some extent).

@wchaparro
Copy link
Member

related to #74660

@jasonrhodes
Copy link
Member

@not-napoleon / @imotov / @ruflin are we comfortable closing this in favor of the wider TSDB effort?

@imotov
Copy link
Contributor

imotov commented Apr 18, 2022

It is a part of the TSDB effort. We don't have a separate issue for that, so I am ok with keeping this one as a placeholder.

@not-napoleon
Copy link
Member

We need to be careful not to get tripped up by terminology here. So far, TSDB has used downsampling to refer to the ILM action to decrease the index date resolution, and thus its on disk size. This ticket is talking about a query time action to apply the downsampling function to sub-buckets during aggregation, IIRC.

@jasonrhodes
Copy link
Member

Good point, @not-napoleon — this ticket is particularly focused on something very similar to your summary here: #66247 (comment)

@weizijun
Copy link
Contributor

This aggregation (#85798) can solve the problem, and it can easily support promQL.

@wchaparro
Copy link
Member

Closing as not planned.

@wchaparro wchaparro closed this as not planned Won't fix, can't repro, duplicate, stale Jun 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >enhancement :StorageEngine/Rollup Turn fine-grained time-based data into coarser-grained data Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

No branches or pull requests

10 participants