Skip to content

Query documents before rollup #38837

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
TheBronx opened this issue Feb 13, 2019 · 10 comments
Open

Query documents before rollup #38837

TheBronx opened this issue Feb 13, 2019 · 10 comments
Labels
>enhancement :StorageEngine/Rollup Turn fine-grained time-based data into coarser-grained data

Comments

@TheBronx
Copy link

Describe the feature:
When rolling up data, it would be nice to filter documents with a query. That is, instead of rolling up all documents on an index (or index pattern), aggregate only those that match the query.

The reason behind this is that once you rollup data, you cannot query it, and it would be probably too complex to store aggregated data in a way that supports certain queries. But filtering during the rollup job should be "easier" (I hope!) and that would be really useful.
For example, if we are storing HTTP requests on an index, we could create a few rollup jobs:

  • all documents, to see the overall traffic and maybe an average response time
  • documents matching q=status:500 to track errors over time
  • documents matching q=url:checkout to track interesting endpoints, maybe with a sum for the "amount" field too, to see the evolution of sales I don't know
  • ...

(Each of these rollup jobs would go to a different rollup index of course)

This would be so powerful! What do you think?
Thank you!

I have found another issue here that sounds similar to me, but I am not sure so please feel free to close this one if the idea behind is the same: #34921
I also posted this on your discourse: https://discuss.elastic.co/t/filtering-documents-for-rollup/167417

@cbuescher cbuescher added >enhancement :StorageEngine/Rollup Turn fine-grained time-based data into coarser-grained data labels Feb 13, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo

@cbuescher
Copy link
Member

@TheBronx thanks for opening this issue. From what I understand so far, what you are trying to do can already be achieved using Filtered Aliases. You would define different aliases for your subset of documents and then point the rollup job to those. I haven't tried this in practice though, maybe @polyfractal has ideas about this or knows alternative approaches?

@TheBronx
Copy link
Author

Okay, it actually works!
Creating the alias is a bit less "dummy friendly" than filling an input field in kibana haha but on the other hand it works, now 😄
The rollup job seems to be working fine, and the "overhead" of aliases is pretty much none right?
I didn't know about index aliases, thank you so much @cbuescher

@cbuescher
Copy link
Member

Great to hear, maybe there are even simpler ways that @polyfractal knows about, so lets wait a bit for his thoughts but I think after that we can close then.

@cbuescher cbuescher reopened this Feb 13, 2019
@polyfractal
Copy link
Contributor

Filtered aliases would be the best (and I think only) way to do it right now. We made a decision to not allow filtering on the rollup job itself, to prevent a "mismatch" between the input data and the output rollup data. E.g. it might be confusing for a user consuming rollup data to see data missing, if they aren't aware that the job itself was filtered.

We may loosen that restriction in the future. But until then, a filtered alias would be the best way to do it.

the "overhead" of aliases is pretty much none right?

That's correct, the alias itself is essentially free, so the only extra cost is adding the filter itself :)

@TheBronx
Copy link
Author

It is me again, I just found a problem with this approach 😢
The docu for ElasticSearch aliases says:

In this case, the alias is a point-in-time alias that will group all current indices that match, it will not automatically update as new indices that match this pattern are added/removed.

And that is exactly what I did, cause I am using logstash:
alias
The alias is matching all indices that existed when I created it (2019.02.13), but no more data is being aggregated after that. The rollup job runs every hour but it is not finding anything new of course.

I would have to recreate the alias everyday for this approach to work right? Maybe this is not the best way to do it 😆
So if you are going to use filtered aliases in combination with rollup jobs, be careful, you cannot match indices before they are created, even though you use a pattern (logstash-*) that would matches those new indices.

It was too good to be true. Any other ideas?

@fbaligand
Copy link
Contributor

I agree with @TheBronx, this is really a missing feature, very useful.
Aliases are stored in indices, so it is not flexible enough.

BTW, in Data Transforms, we can define a query. So it would be coherent to have also this option in rollup jobs.

@polyfractal
Copy link
Contributor

I'm going to re-open this ticket as a placeholder. We're working on a big refactor of Rollup (changing how search works, integrating with ILM, etc) so this request is something we can reconsider in light of the new framework. It's a fairly common request so far over the lifetime of Rollup v1.

That said, I think a lot of the difficulties remain; could be trappy for the "consumer" of the rollup data if they don't know it has been filtered, and I'm not sure how it would work/look under the new setup. But now's the time to think through those things, hence the re-open :)

@fbaligand
Copy link
Contributor

Thanks for re-open!
For the difficulty you quote, first, people that does the rollup is often the same than the one that consumes it. Then, consumer is surprised, he can still communicate with producer :)
To me, this is not a difficulty, just a fact.

@gunplar
Copy link

gunplar commented May 11, 2021

Hi everyone, +1 for this feature as well.
On a side note, I still need to somehow implement this feature on Elastic 7.10. Is it possible to create a filtered alias of a datastream (not just indices), then have a cron job performing this aliasing to make sure the alias is up to date?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :StorageEngine/Rollup Turn fine-grained time-based data into coarser-grained data
Projects
None yet
Development

No branches or pull requests

6 participants