Query documents before rollup #38837

TheBronx · 2019-02-13T09:45:11Z

Describe the feature:
When rolling up data, it would be nice to filter documents with a query. That is, instead of rolling up all documents on an index (or index pattern), aggregate only those that match the query.

The reason behind this is that once you rollup data, you cannot query it, and it would be probably too complex to store aggregated data in a way that supports certain queries. But filtering during the rollup job should be "easier" (I hope!) and that would be really useful.
For example, if we are storing HTTP requests on an index, we could create a few rollup jobs:

all documents, to see the overall traffic and maybe an average response time
documents matching q=status:500 to track errors over time
documents matching q=url:checkout to track interesting endpoints, maybe with a sum for the "amount" field too, to see the evolution of sales I don't know
...

(Each of these rollup jobs would go to a different rollup index of course)

This would be so powerful! What do you think?
Thank you!

I have found another issue here that sounds similar to me, but I am not sure so please feel free to close this one if the idea behind is the same: #34921
I also posted this on your discourse: https://discuss.elastic.co/t/filtering-documents-for-rollup/167417

elasticmachine · 2019-02-13T12:26:31Z

Pinging @elastic/es-analytics-geo

cbuescher · 2019-02-13T12:26:34Z

@TheBronx thanks for opening this issue. From what I understand so far, what you are trying to do can already be achieved using Filtered Aliases. You would define different aliases for your subset of documents and then point the rollup job to those. I haven't tried this in practice though, maybe @polyfractal has ideas about this or knows alternative approaches?

TheBronx · 2019-02-13T12:44:34Z

Okay, it actually works!
Creating the alias is a bit less "dummy friendly" than filling an input field in kibana haha but on the other hand it works, now 😄
The rollup job seems to be working fine, and the "overhead" of aliases is pretty much none right?
I didn't know about index aliases, thank you so much @cbuescher

cbuescher · 2019-02-13T13:17:49Z

Great to hear, maybe there are even simpler ways that @polyfractal knows about, so lets wait a bit for his thoughts but I think after that we can close then.

polyfractal · 2019-02-13T19:34:51Z

Filtered aliases would be the best (and I think only) way to do it right now. We made a decision to not allow filtering on the rollup job itself, to prevent a "mismatch" between the input data and the output rollup data. E.g. it might be confusing for a user consuming rollup data to see data missing, if they aren't aware that the job itself was filtered.

We may loosen that restriction in the future. But until then, a filtered alias would be the best way to do it.

the "overhead" of aliases is pretty much none right?

That's correct, the alias itself is essentially free, so the only extra cost is adding the filter itself :)

TheBronx · 2019-02-25T14:58:37Z

It is me again, I just found a problem with this approach 😢
The docu for ElasticSearch aliases says:

In this case, the alias is a point-in-time alias that will group all current indices that match, it will not automatically update as new indices that match this pattern are added/removed.

And that is exactly what I did, cause I am using logstash:

The alias is matching all indices that existed when I created it (2019.02.13), but no more data is being aggregated after that. The rollup job runs every hour but it is not finding anything new of course.

I would have to recreate the alias everyday for this approach to work right? Maybe this is not the best way to do it 😆
So if you are going to use filtered aliases in combination with rollup jobs, be careful, you cannot match indices before they are created, even though you use a pattern (logstash-*) that would matches those new indices.

It was too good to be true. Any other ideas?

fbaligand · 2019-12-18T11:08:26Z

I agree with @TheBronx, this is really a missing feature, very useful.
Aliases are stored in indices, so it is not flexible enough.

BTW, in Data Transforms, we can define a query. So it would be coherent to have also this option in rollup jobs.

polyfractal · 2020-06-01T17:00:59Z

I'm going to re-open this ticket as a placeholder. We're working on a big refactor of Rollup (changing how search works, integrating with ILM, etc) so this request is something we can reconsider in light of the new framework. It's a fairly common request so far over the lifetime of Rollup v1.

That said, I think a lot of the difficulties remain; could be trappy for the "consumer" of the rollup data if they don't know it has been filtered, and I'm not sure how it would work/look under the new setup. But now's the time to think through those things, hence the re-open :)

fbaligand · 2020-06-01T17:30:54Z

Thanks for re-open!
For the difficulty you quote, first, people that does the rollup is often the same than the one that consumes it. Then, consumer is surprised, he can still communicate with producer :)
To me, this is not a difficulty, just a fact.

gunplar · 2021-05-11T12:22:29Z

Hi everyone, +1 for this feature as well.
On a side note, I still need to somehow implement this feature on Elastic 7.10. Is it possible to create a filtered alias of a datastream (not just indices), then have a cron job performing this aliasing to make sure the alias is up to date?

cbuescher added >enhancement :StorageEngine/Rollup Turn fine-grained time-based data into coarser-grained data labels Feb 13, 2019

cbuescher closed this as completed Feb 13, 2019

cbuescher reopened this Feb 13, 2019

cbuescher closed this as completed Feb 14, 2019

$@polyfractal$ polyfractal reopened this Jun 1, 2020

$@polyfractal$ polyfractal mentioned this issue Jun 1, 2020

[Rollup] Support filtering of metrics #39604

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query documents before rollup #38837

Query documents before rollup #38837

TheBronx commented Feb 13, 2019

elasticmachine commented Feb 13, 2019

cbuescher commented Feb 13, 2019

TheBronx commented Feb 13, 2019

cbuescher commented Feb 13, 2019

polyfractal commented Feb 13, 2019

TheBronx commented Feb 25, 2019

fbaligand commented Dec 18, 2019

polyfractal commented Jun 1, 2020

fbaligand commented Jun 1, 2020

gunplar commented May 11, 2021

Query documents before rollup #38837

Query documents before rollup #38837

Comments

TheBronx commented Feb 13, 2019

elasticmachine commented Feb 13, 2019

cbuescher commented Feb 13, 2019

TheBronx commented Feb 13, 2019

cbuescher commented Feb 13, 2019

polyfractal commented Feb 13, 2019

TheBronx commented Feb 25, 2019

fbaligand commented Dec 18, 2019

polyfractal commented Jun 1, 2020

fbaligand commented Jun 1, 2020

gunplar commented May 11, 2021