Index date field data with lower precision #64662

iverase · 2020-11-05T16:50:06Z

This PR is a prototype to index data with lower precision. Higher precision request are solved using doc values which contains full precision information. It currently index uses 1 minute precision for ms and nano.

This change means saving quite a bit of storage in the index as well as speed up queries that do not need high precision.

The big issue currently is that it breaks some optimisations that are around the code base for this type of data. My feeling is that we need to refactor first those optimisation to the resolution object. Then we can really do it.

elasticmachine · 2020-11-05T16:50:08Z

Pinging @elastic/es-search (:Search/Search)

elasticmachine · 2020-11-05T16:50:08Z

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

iverase · 2020-11-05T16:50:55Z

@jpountz, you might be interested on this :)

jpountz · 2020-11-05T17:38:05Z

Definitely! This makes a lot of sense to me as queries are very unlikely to care about the millisecond accuracy, even when indexing with date_nanos. You went further and seem to assume that minute accuracy would be good enough, which I need to think more about.

In order to simplify sorting optimizations in Lucene, we'd like to be able to rely on the assumption that points and doc values store the same data, which is an assumption that breaks with your change. So maybe we should be the index field in a hidden sub field like we do for index_phrases/index_prefixes. @mayya-sharipova @jimczi I wonder if you have thoughts about this.

Regarding the rounding, I'd have a preference for rounding down rather than towards 0, ie. using Math.roundFloor instead of plain division. It would make things easier to reason about for me.

mayya-sharipova · 2020-11-06T12:19:20Z

Great idea and exciting change, I am very interested how much disk space we can save with it and also potential speedups on queries.

The big issue currently is that it breaks some optimisations that are around the code base for this type of data. My feeling is that we need to refactor first those optimisation to the resolution object. Then we can really do it.

Our plan is to remove all the sort optimization from ES to Lucene.
And indeed currently, for the sort optimization to work we need to have the same data in points and docvalues, which a user needs to indicate with sortField.setCanUsePoints().

But I think we can change the Lucene sort optim logic to incorporate lower precision, but making ranges check less selective. We can also change Lucene API to account for lower precision, something like : sortField.usePoints(6000).

jimczi

I like the approach too but I am a bit concerned by the complexity that it brings.
For some use cases (observability), the rounding can be done by the client or in an ingest processor if the millisecond precision is not needed (probes computed every 10s for instance).
Do you have some numbers regarding the benefit of this approach ? I'd expect significant differences to counter-balance the added complexity.

iverase · 2020-11-10T10:35:22Z

the rounding can be done by the client or in an ingest processor

The point here is that you do not loose precision, doc values contain the date with millisecond precision and they are used in case a query requires such precision.

I run the event data track for rally and in that case it shows a good savings in storage space:

                                                        Metric |         Task |    Baseline |   Contender |     Diff |   Unit |
|--------------------------------------------------------------:|-------------:|------------:|------------:|---------:|-------:|
|                    Cumulative indexing time of primary shards |              |     28.5395 |      29.624 |  1.08447 |    min |
|             Min cumulative indexing time across primary shard |              |     5.65355 |     5.87945 |   0.2259 |    min |
|          Median cumulative indexing time across primary shard |              |     5.67572 |     5.93058 |  0.25487 |    min |
|             Max cumulative indexing time across primary shard |              |     5.84443 |     5.97763 |   0.1332 |    min |
|           Cumulative indexing throttle time of primary shards |              |           0 |           0 |        0 |    min |
|    Min cumulative indexing throttle time across primary shard |              |           0 |           0 |        0 |    min |
| Median cumulative indexing throttle time across primary shard |              |           0 |           0 |        0 |    min |
|    Max cumulative indexing throttle time across primary shard |              |           0 |           0 |        0 |    min |
|                       Cumulative merge time of primary shards |              |     17.7744 |     17.2519 | -0.52255 |    min |
|                      Cumulative merge count of primary shards |              |         100 |          95 |       -5 |        |
|                Min cumulative merge time across primary shard |              |     3.16118 |     3.17163 |  0.01045 |    min |
|             Median cumulative merge time across primary shard |              |     3.65662 |      3.5197 | -0.13692 |    min |
|                Max cumulative merge time across primary shard |              |     3.94325 |      3.8275 | -0.11575 |    min |
|              Cumulative merge throttle time of primary shards |              |     9.25965 |     8.57765 |   -0.682 |    min |
|       Min cumulative merge throttle time across primary shard |              |     1.36503 |     1.42643 |   0.0614 |    min |
|    Median cumulative merge throttle time across primary shard |              |      1.9685 |     1.79625 | -0.17225 |    min |
|       Max cumulative merge throttle time across primary shard |              |      2.2988 |      2.0478 |   -0.251 |    min |
|                     Cumulative refresh time of primary shards |              |     2.11157 |     1.94973 | -0.16183 |    min |
|                    Cumulative refresh count of primary shards |              |         250 |         244 |       -6 |        |
|              Min cumulative refresh time across primary shard |              |    0.411033 |    0.376367 | -0.03467 |    min |
|           Median cumulative refresh time across primary shard |              |    0.420417 |    0.386233 | -0.03418 |    min |
|              Max cumulative refresh time across primary shard |              |    0.435333 |    0.408967 | -0.02637 |    min |
|                       Cumulative flush time of primary shards |              |     2.27405 |     1.75182 | -0.52223 |    min |
|                      Cumulative flush count of primary shards |              |          31 |          31 |        0 |        |
|                Min cumulative flush time across primary shard |              |    0.430533 |    0.313983 | -0.11655 |    min |
|             Median cumulative flush time across primary shard |              |    0.448967 |    0.356933 | -0.09203 |    min |
|                Max cumulative flush time across primary shard |              |      0.5132 |     0.38015 | -0.13305 |    min |
|                                       Total Young Gen GC time |              |      15.829 |      17.161 |    1.332 |      s |
|                                      Total Young Gen GC count |              |        2200 |        2247 |       47 |        |
|                                         Total Old Gen GC time |              |           0 |           0 |        0 |      s |
|                                        Total Old Gen GC count |              |           0 |           0 |        0 |        |
|                                                    Store size |              |     7.88101 |     6.73512 | -1.14589 |     GB |
|                                                 Translog size |              | 2.56114e-07 | 2.56114e-07 |        0 |     GB |
|                                        Heap used for segments |              |    0.907894 |    0.823696 |  -0.0842 |     MB |
|                                      Heap used for doc values |              |   0.0713234 |   0.0601692 | -0.01115 |     MB |
|                                           Heap used for terms |              |    0.712372 |    0.649933 | -0.06244 |     MB |
|                                           Heap used for norms |              |           0 |           0 |        0 |     MB |
|                                          Heap used for points |              |           0 |           0 |        0 |     MB |
|                                   Heap used for stored fields |              |    0.124199 |    0.113594 |  -0.0106 |     MB |
|                                                 Segment count |              |         251 |         229 |      -22 |        |
|                                                Min Throughput | index-append |     99769.6 |     99371.2 | -398.383 | docs/s |
|                                             Median Throughput | index-append |      101375 |      102459 |  1084.17 | docs/s |
|                                                Max Throughput | index-append |      101893 |      103086 |  1193.31 | docs/s |
|                                       50th percentile latency | index-append |     297.611 |     358.769 |  61.1581 |     ms |
|                                       90th percentile latency | index-append |      536.91 |     630.187 |  93.2769 |     ms |
|                                       99th percentile latency | index-append |     940.968 |     960.808 |  19.8405 |     ms |
|                                     99.9th percentile latency | index-append |     1082.01 |     1106.67 |   24.659 |     ms |
|                                      100th percentile latency | index-append |     1103.57 |      1539.8 |  436.226 |     ms |
|                                  50th percentile service time | index-append |     297.611 |     358.769 |  61.1581 |     ms |
|                                  90th percentile service time | index-append |      536.91 |     630.187 |  93.2769 |     ms |
|                                  99th percentile service time | index-append |     940.968 |     960.808 |  19.8405 |     ms |
|                                99.9th percentile service time | index-append |     1082.01 |     1106.67 |   24.659 |     ms |
|                                 100th percentile service time | index-append |     1103.57 |      1539.8 |  436.226 |     ms |
|                                                    error rate | index-append |           0 |           0 |        0 |      % |

Having said that, I agree with the complexity it brings. It will break some optimisations (e.g min aggregation cannot be some using the index), and as the PR shows, it breaks quite a few test.

jpountz · 2020-11-10T13:32:45Z

@nik9000 I'm putting this on your radar as well as you've been looking into using the points index to speed up date histograms.

nik9000 · 2020-11-10T14:45:07Z

@nik9000 I'm putting this on your radar as well as you've been looking into using the points index to speed up date histograms.

Oh, it already is on my radar! We merged the points optimization yesterday so we can compare. I think the important bit of this PR is that it still uses the PointRangeQuery if the bounds line up with the rounding. I'm fairly sure it does right now. So it shouldn't slow down date_histogram. I think this is ok for me. Though I think it'd be worth adding some tests around making sure the optimization stays intact. But that can come once we're more sure we're ok with this approach.

iverase · 2021-10-06T11:24:58Z

too complex for now

iverase added 19 commits October 7, 2020 17:53

TwoPhaseDateRangeQuery

50c1746

don't duplicate doc values

4b71097

cleanup

40e6e2d

Merge branch 'master' into TwoPhaseDatePoint

ea9f85d

iter

b98ff24

iter

10f3aee

iter

da6895d

iter

720b5da

RangeQueryBuilder

0a44f0e

sort optimization using distance feature query

519678f

sort optimization using distance feature query

758d0af

Merge branch 'master' into TwoPhaseDatePoint

ee4c2b5

1 min precision

a484617

1 min precision

b0b2f7c

fix queries

0978b80

fix queries

d10957a

fix queries

6dffe9c

fix sort

2ac73c9

Merge branch 'master' into TwoPhaseDatePoint

5183511

iverase added :Analytics/Aggregations Aggregations :Search/Search Search-related issues that do not fall into other categories labels Nov 5, 2020

elasticmachine added Team:Search Meta label for search team Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) labels Nov 5, 2020

iverase marked this pull request as draft November 5, 2020 16:51

iverase added 2 commits November 6, 2020 10:10

Merge branch 'master' into TwoPhaseDatePoint

bb11644

compile, style issues

68e7e97

iverase added 3 commits November 8, 2020 10:21

disable reading data from index in agg

cc7b75f

Merge branch 'master' into TwoPhaseDatePoint

49e7498

Merge branch 'master' into TwoPhaseDatePoint

f6a4158

jimczi reviewed Nov 9, 2020

View reviewed changes

Merge branch 'master' into TwoPhaseDatePoint

4a246e4

$@polyfractal$ polyfractal mentioned this pull request Nov 12, 2020

Ideas for aggregation performance improvements #65019

Open

iverase closed this Oct 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Index date field data with lower precision #64662

Index date field data with lower precision #64662

Uh oh!

iverase commented Nov 5, 2020

Uh oh!

elasticmachine commented Nov 5, 2020

Uh oh!

elasticmachine commented Nov 5, 2020

Uh oh!

iverase commented Nov 5, 2020

Uh oh!

jpountz commented Nov 5, 2020

Uh oh!

mayya-sharipova commented Nov 6, 2020 •

edited

Loading

Uh oh!

jimczi left a comment

Uh oh!

iverase commented Nov 10, 2020

Uh oh!

jpountz commented Nov 10, 2020

Uh oh!

nik9000 commented Nov 10, 2020

Uh oh!

iverase commented Oct 6, 2021

Uh oh!

Uh oh!

Index date field data with lower precision #64662

Index date field data with lower precision #64662

Uh oh!

Conversation

iverase commented Nov 5, 2020

Uh oh!

elasticmachine commented Nov 5, 2020

Uh oh!

elasticmachine commented Nov 5, 2020

Uh oh!

iverase commented Nov 5, 2020

Uh oh!

jpountz commented Nov 5, 2020

Uh oh!

mayya-sharipova commented Nov 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jimczi left a comment

Choose a reason for hiding this comment

Uh oh!

iverase commented Nov 10, 2020

Uh oh!

jpountz commented Nov 10, 2020

Uh oh!

nik9000 commented Nov 10, 2020

Uh oh!

iverase commented Oct 6, 2021

Uh oh!

Uh oh!

mayya-sharipova commented Nov 6, 2020 •

edited

Loading