[RollupV2]: make RollupAction available and improve some features #82944

weizijun · 2022-01-24T14:21:11Z

Background

Rollup is very useful in TSDB case. TSDB can use rollup to do timestamp downsample.
The Rollup refactoring is under development at Refactor rollups meta (AKA Rollup V2) #42720, and it seems to be blocked.

I know rollups is a bit of a complicated beast of a project. Yes, there may be a lot of design details to discuss. But downsampling is very useful in our use case. So I'm continuing to develop rollup refactoring.

Detail

I think the RollupAction is the basement of the rollup refactoring. So I do some improvement to make the API available to use and add some feature in the RollupAction API.

Here's a list of what I've done:

Fix RollupAction API bugs:

Fix when write sparse metric column, it will cause write exception.
Fix when read sparse group column, it will cause NPE exception.
Fix when date_histogram field spare, calculate range query exception.
Fix when source index not found, it will case NPE exception.
Fix when rollup index is exists, it will continue to rollup until finished rollup, cause resize exception.
Fix when origin index's shard is more than 1, RollupIndexerAction get shard list error.

Add some features

Rollup group, metrics support wildcard fields.
Support rollup index reuse the origin index's settings.
Support get rollup tasks status, and support cancel rollup tasks.
Support rollup group terms configured _tsid field.
Support rollup time_series index in the new execution mode(Add a flag to control in index order execution mode to aggregations #82129). It greatly improves the performance of rollup tasks.
Logging optimize: log the execution cost, optimize index failed log, and so on.
Rollup ilm policy index optimize, remove the ilm police in tmp index, and add back in the resize settings
Add many tests, to make the code coverage more than 90%.

Since this is the first PR of my rollup feature, I won't change the code that might affect other features.

Some features are ready in our internal repository. Here is the TODO list in the next PRs:

Support rollup aggregate_metric_double type metrics, as rollup index use aggregate_metric_double to store metric type, so we can't rollup a rollupe index.
Support search rollup index, now when rollup a data stream index, the rollup index will also add to data stream's concrete indices, It will cause error search result. And the search rollup function will select the best indices for a search DSL that match the rollup config.

There are a few more things to discuss:

If the origin index must be a read-only index? if yes, it will add the check logic.
Now I add the logic, that rollup a standard index to time_series, as the rollup index is made of timestamp\dimensions\metrics. Is it ok to do this? And how to deal with the start_time and end_time, now I add the largest bounds as the standard index has no bounds.
Now, there is only one thread per node to execute rollup task. In some node with many cpus, it may slow down rollup tasks. But adding more threads to execute rollup task will take more cpu costs, Maybe the thread size can be adapted to the node.processors setting.
How to handle the failed index docs, record the exception or failed the rollup task?
How to set the rollup index number_of_shards? Now the shard count of rollup index is the same with origin index. But in many case, the size of rollup index is much small than the origin index. Is it necessary to set the custom rollup index number_of_shards.
How to do anomaly testing in Elasticsearch automatically? The node shut down case? The disk grows to 100% case? The rolluping shard reallocate to other node case?

And about the Rollup performance, In this PR, I don't change the process of unsorted index. But I think it will have some performance improvement. I will add a todo task to think more about it.

package revert fixup fixup

weizijun · 2022-01-24T15:09:14Z

...ugin/rollup/src/main/java/org/elasticsearch/xpack/rollup/v2/indexer/metrics/MetricField.java

+import java.util.ArrayList;
+import java.util.List;
+
+public abstract class MetricField {


In this PR, I don't add the AggregateDoubleMetricField child class, as I will modify some code of AggregateDoubleMetric, I will add the class, and support aggregate_metric_double rollup in the next PR.

weizijun · 2022-01-25T03:25:45Z

@elasticmachine update branch

…PR, and remove the local TimeSeriesIndexSearcher

elasticmachine · 2022-01-26T19:17:06Z

Pinging @elastic/es-analytics-geo (Team:Analytics)

* upstream/master: (100 commits) Avoid duplicate _type fields in v7 compat layer (elastic#83239) Bump bundled JDK to 17.0.2+8 (elastic#83243) [DOCS] Correct header syntax (elastic#83275) Add unit tests for indices.recovery.max_bytes_per_sec default values (elastic#83261) [DOCS] Add note that write indices are not replicated (elastic#82997) Add notes on indexing to kNN search guide (elastic#83188) Fix get-snapshot-api :docs:integTest (elastic#83273) FilterPathBasedFilter support match fieldname with dot (elastic#83178) Fix compilation issues in example-plugins (elastic#83258) fix ClusterStateListener javadoc (elastic#83246) Speed up Building Indices Lookup in Metadata (elastic#83241) Mute whole suite for elastic#82502 (elastic#83252) Make PeerFinder log messages happier (elastic#83222) [Docs] Add supported _terms_enum field types (elastic#83244) Add an aggregator for IPv4 and IPv6 subnets (elastic#82410) [CI] Fix 70_time_series/default sort yaml test failures (elastic#83217) Update test-failure Issue Template to include "needs:triage" label elastic#83226 Add an index->step cache to the PolicyStepsRegistry (elastic#82316) Improve support for joda datetime to java datetime transition in Painless (elastic#83099) Fix joda migration for week based methods in Painless (elastic#83232) ... # Conflicts: # x-pack/plugin/rollup/src/main/java/org/elasticsearch/xpack/rollup/v2/TransportRollupAction.java

csoulios

@weizijun thank you for submitting this PR. It adds a lot of functionality and fixes many issues.

For a start, I would like to say that the existing RollupAction implementation is
more a POC than a robust production-ready code. We took the decision to stop
working on this code until the TSDB project has taken shape, so that we
can make the most of the TSDB features.

Time series indices present all those characteristics that allow us to optimise downsampling for performance as well as user experience. Therefore, for the first release of rollups, we intend to support downsampling only for time series indices. Implementing downsampling for non time series indices is not in our immediate plans.

Rollup configuration

So, starting with the RollupAction, we plan to simplify the rollup configuration
by reducing the input parameters to the bare minimum, now that we are able to
infer most of them from the source index mapping.

For example, for time series indices the the timestamp field is always called
@timestamp. So, we must only add the downsampling interval and the time zone.

Also, for downsampling we are not allowed to modify the dimensions and and we must keep _tsid as is. So, the groups part of the rollup configuration will be shrinked to only the downsampling interval.

In the same way, we can remove the metrics part of the configuration by
extracting the metric fields from the index mapping.

This would simplify the Rollup API for downsampling to something like
the following:

POST /<index>/_rollup/<rollup_index>
{
    "calendar_interval": "1M",
    "time_zone": "UTC" // Default value is UTC
}

Performance

Regarding performance, we plan to use the TimeSeriesMetricsService (#82129) so that we iterate over the time series index. This allows us to leverage the fact that
the index is sorted per _tsid

Finally, this PR contains so many fixes, improvements and changes that I would like us
to add in smaller chunks

weizijun · 2022-02-07T11:00:16Z

@csoulios, Thank you so much for your answer.

Time series indices present all those characteristics that allow us to optimise downsampling for performance as well as user experience. Therefore, for the first release of rollups, we intend to support downsampling only for time series indices. Implementing downsampling for non time series indices is not in our immediate plans.

yeah, I do the performance improvement for time_series indices in TimeSeriesRollupShardIndexer, it use TimeSeriesIndexSearcher to iterator all doc in order. It can great faster than the normal indices. And it's okay about use TimeSeriesMetricsService to improve the performance, we can have more discussion.

This would simplify the Rollup API for downsampling to something like the following:
POST /<index>/_rollup/<rollup_index>
{
    "calendar_interval": "1M",
    "time_zone": "UTC" // Default value is UTC
}

About Rollup configuration, I think TSDB downsampling is a special case of Rollup. Downsampling terms group has only _tsid, and metrics are all fields with time_series_metric configured, and the date_histogram field is @timestamp. Users only need to configure interval and time_zone.

I prefer downsampling is a new API, like:

POST /<index>/_downsample/<rollup_index>
{
    "fixed_interval": "10m"
}

it equals the API:

POST /<index>/_rollup/<rollup_index>
{
  "groups": {
    "date_histogram": {
      "field": "@timestamp",
      "fixed_interval": "10m"
    },
    "terms": {
      "fields": [
        "_tsid"
      ]
    }
  },
  "metrics": [
    {
      "field": "m1",
      "metrics": [
        "min",
        "max",
        "sum",
        "value_count"
      ]
    },
    {
      "field": "m2",
      "metrics": [
        "min",
        "max",
        "sum",
        "value_count"
      ]
    }
  ]
}

The new DownsampleAction can reuse many code with RollupAction.
We Alibaba Cloud Elasticsearch has been practice to use RollupAction downsample TSDB indices. User only configure the interval settings, our service generate the RollupAction config, it's easy to use.

The code of this PR is our practice in Alibaba Cloud Elasticsearch, so I add many feature. I hope we can have more deep discuss, and to see how to implement the rollup and downsample feature.

* upstream/master: (166 commits) Bind host all instead of just _site_ when needed (elastic#83145) [DOCS] Fix min/max agg snippets for histograms (elastic#83695) [DOCS] Add deprecation notice for system indices (elastic#83688) Cache ILM policy name on IndexMetadata (elastic#83603) [DOCS] Fix 8.0 breaking changes sort order (elastic#83685) [ML] fix random sampling background query consistency (elastic#83676) Move internal APIs into their own namespace '_internal' Runtime fields core-with-mapped tests support tsdb (elastic#83577) Optimize calculating the presence of a quorum (elastic#83638) Use switch expressions in EnableAllocationDecider and NodeShutdownAllocationDecider (elastic#83641) Note libffi error message in tmpdir docs (elastic#83662) Fix TransportDesiredNodesActionsIT batch tests (elastic#83659) [DOCS] Remove unused upgrade doc files (elastic#83617) [ML] Wait for model process to stop in stop deployment (elastic#83644) [ML] Fix submit after shutdown in process worker service (elastic#83645) Remove req/resp classes associated with HLRC (elastic#83599) Introduce index.version.compatibility setting (elastic#83264) Rename InternalTestCluster#getMasterNodeInstance (elastic#83407) Mute TimeSeriesIndexSearcherTests testCollectInOrderAcrossSegments (elastic#83648) Add rollover add max_primary_shard_docs condition (elastic#80981) ... # Conflicts: # x-pack/plugin/rollup/build.gradle # x-pack/plugin/rollup/src/test/java/org/elasticsearch/xpack/rollup/v2/RollupActionSingleNodeTests.java

* upstream/master: (167 commits) Mute FrozenSearchableSnapshotsIntegTests#testCreateAndRestorePartialSearchableSnapshot Mute LdapSessionFactoryTests#testSslTrustIsReloaded Fix spotless violation from last commit Mute GeoGridTilerTestCase#testGeoGridSetValuesBoundingBoxes_UnboundedGeoShapeCellValues Small formatting clean up (elastic#84144) Always re-run Feature migrations which have encountered errors (elastic#83918) [DOCS] Clarify `orientation` usage for WKT and GeoJSON polygons (elastic#84025) Group field caps response by index mapping hash (elastic#83494) Shrink join queries in slow log (elastic#83914) TSDB: Reject the nested object fields that are configured time_series_dimension (elastic#83920) [DOCS] Remove note about partial response from Bulk API docs (elastic#84053) Allow regular data streams to be migrated to tsdb data streams. (elastic#83843) [DOCS] Fix `ignore_unavailable` parameter definition (elastic#84071) Make Metadata extend AbstractCollection (elastic#83791) Add API specs for OpenID Connect APIs Revert "Clean up for superuser role name references (elastic#83627)" (elastic#84096) Update Lucene analysis base url (elastic#84094) Avoid null threadContext in ResultDeduplicator (elastic#84093) Use static empty store files metadata (elastic#84034) Preserve context in snapshotDeletionListeners (elastic#84089) ... # Conflicts: # x-pack/plugin/rollup/build.gradle

csoulios · 2022-06-01T15:18:53Z

Closing this PR because it overlaps with the work at PR #85708

@weizijun and I have discussed about the dowsampling implementation and the path forward.
Thank you for this contribution.

add rollup improvement

faa1f59

package revert fixup fixup

elasticsearchmachine added v8.1.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Jan 24, 2022

some improvement

d6c7793

weizijun commented Jan 24, 2022

View reviewed changes

elasticmachine and others added 3 commits January 24, 2022 20:25

Merge branch 'master' into rollup-index-improvement

e54a1e2

PR: elastic#82129 has merged, use the TimeSeriesIndexSearcher in the …

d266d9c

…PR, and remove the local TimeSeriesIndexSearcher

add some comments

531263c

nik9000 assigned csoulios Jan 26, 2022

nik9000 added the :StorageEngine/Rollup Turn fine-grained time-based data into coarser-grained data label Jan 26, 2022

elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jan 26, 2022

mark-vieira added v8.2.0 and removed v8.1.0 labels Feb 2, 2022

csoulios reviewed Feb 4, 2022

View reviewed changes

weizijun added 3 commits February 9, 2022 17:28

Cleanup data steams teardown

81a36d2

changelog

76b7aee

csoulios added the :StorageEngine/TSDB You know, for Metrics label Feb 10, 2022

salvatore-campagna added v8.3.0 and removed v8.2.0 labels Mar 30, 2022

weizijun mentioned this pull request May 21, 2022

TSDB: Implement downsampling on time-series indices #85708

Merged

craigtaverner added v8.4.0 and removed v8.3.0 labels May 25, 2022

csoulios closed this Jun 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RollupV2]: make RollupAction available and improve some features #82944

[RollupV2]: make RollupAction available and improve some features #82944

Uh oh!

weizijun commented Jan 24, 2022

Uh oh!

weizijun Jan 24, 2022

Uh oh!

weizijun commented Jan 25, 2022

Uh oh!

elasticmachine commented Jan 26, 2022

Uh oh!

csoulios left a comment

Uh oh!

weizijun commented Feb 7, 2022

Uh oh!

csoulios commented Jun 1, 2022

Uh oh!

Uh oh!

[RollupV2]: make RollupAction available and improve some features #82944

[RollupV2]: make RollupAction available and improve some features #82944

Uh oh!

Conversation

weizijun commented Jan 24, 2022

Background

Detail

Uh oh!

weizijun Jan 24, 2022

Choose a reason for hiding this comment

Uh oh!

weizijun commented Jan 25, 2022

Uh oh!

elasticmachine commented Jan 26, 2022

Uh oh!

csoulios left a comment

Choose a reason for hiding this comment

Rollup configuration

Performance

Uh oh!

weizijun commented Feb 7, 2022

Uh oh!

csoulios commented Jun 1, 2022

Uh oh!

Uh oh!