-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Index date field data with lower precision #64662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Pinging @elastic/es-search (:Search/Search) |
Pinging @elastic/es-analytics-geo (:Analytics/Aggregations) |
@jpountz, you might be interested on this :) |
Definitely! This makes a lot of sense to me as queries are very unlikely to care about the millisecond accuracy, even when indexing with In order to simplify sorting optimizations in Lucene, we'd like to be able to rely on the assumption that points and doc values store the same data, which is an assumption that breaks with your change. So maybe we should be the index field in a hidden sub field like we do for Regarding the rounding, I'd have a preference for rounding down rather than towards 0, ie. using |
Great idea and exciting change, I am very interested how much disk space we can save with it and also potential speedups on queries.
Our plan is to remove all the sort optimization from ES to Lucene. But I think we can change the Lucene sort optim logic to incorporate lower precision, but making ranges check less selective. We can also change Lucene API to account for lower precision, something like : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the approach too but I am a bit concerned by the complexity that it brings.
For some use cases (observability), the rounding can be done by the client or in an ingest processor if the millisecond precision is not needed (probes computed every 10s for instance).
Do you have some numbers regarding the benefit of this approach ? I'd expect significant differences to counter-balance the added complexity.
The point here is that you do not loose precision, doc values contain the date with millisecond precision and they are used in case a query requires such precision. I run the event data track for rally and in that case it shows a good savings in storage space:
Having said that, I agree with the complexity it brings. It will break some optimisations (e.g min aggregation cannot be some using the index), and as the PR shows, it breaks quite a few test. |
@nik9000 I'm putting this on your radar as well as you've been looking into using the points index to speed up date histograms. |
Oh, it already is on my radar! We merged the points optimization yesterday so we can compare. I think the important bit of this PR is that it still uses the PointRangeQuery if the bounds line up with the rounding. I'm fairly sure it does right now. So it shouldn't slow down date_histogram. I think this is ok for me. Though I think it'd be worth adding some tests around making sure the optimization stays intact. But that can come once we're more sure we're ok with this approach. |
too complex for now |
This PR is a prototype to index data with lower precision. Higher precision request are solved using doc values which contains full precision information. It currently index uses 1 minute precision for ms and nano.
This change means saving quite a bit of storage in the index as well as speed up queries that do not need high precision.
The big issue currently is that it breaks some optimisations that are around the code base for this type of data. My feeling is that we need to refactor first those optimisation to the resolution object. Then we can really do it.