Skip to content

[RollupV2]: make RollupAction available and improve some features #82944

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from

Conversation

weizijun
Copy link
Contributor

Background

  1. Rollup is very useful in TSDB case. TSDB can use rollup to do timestamp downsample.
  2. The Rollup refactoring is under development at Refactor rollups meta (AKA Rollup V2) #42720, and it seems to be blocked.

I know rollups is a bit of a complicated beast of a project. Yes, there may be a lot of design details to discuss. But downsampling is very useful in our use case. So I'm continuing to develop rollup refactoring.

Detail

I think the RollupAction is the basement of the rollup refactoring. So I do some improvement to make the API available to use and add some feature in the RollupAction API.

Here's a list of what I've done:

  1. Fix RollupAction API bugs:
  • Fix when write sparse metric column, it will cause write exception.
  • Fix when read sparse group column, it will cause NPE exception.
  • Fix when date_histogram field spare, calculate range query exception.
  • Fix when source index not found, it will case NPE exception.
  • Fix when rollup index is exists, it will continue to rollup until finished rollup, cause resize exception.
  • Fix when origin index's shard is more than 1, RollupIndexerAction get shard list error.
  1. Add some features
  • Rollup group, metrics support wildcard fields.
  • Support rollup index reuse the origin index's settings.
  • Support get rollup tasks status, and support cancel rollup tasks.
  • Support rollup group terms configured _tsid field.
  • Support rollup time_series index in the new execution mode(Add a flag to control in index order execution mode to aggregations #82129). It greatly improves the performance of rollup tasks.
  • Logging optimize: log the execution cost, optimize index failed log, and so on.
  • Rollup ilm policy index optimize, remove the ilm police in tmp index, and add back in the resize settings
  • Add many tests, to make the code coverage more than 90%.

Since this is the first PR of my rollup feature, I won't change the code that might affect other features.

Some features are ready in our internal repository. Here is the TODO list in the next PRs:

  • Support rollup aggregate_metric_double type metrics, as rollup index use aggregate_metric_double to store metric type, so we can't rollup a rollupe index.
  • Support search rollup index, now when rollup a data stream index, the rollup index will also add to data stream's concrete indices, It will cause error search result. And the search rollup function will select the best indices for a search DSL that match the rollup config.

There are a few more things to discuss:

  • If the origin index must be a read-only index? if yes, it will add the check logic.
  • Now I add the logic, that rollup a standard index to time_series, as the rollup index is made of timestamp\dimensions\metrics. Is it ok to do this? And how to deal with the start_time and end_time, now I add the largest bounds as the standard index has no bounds.
  • Now, there is only one thread per node to execute rollup task. In some node with many cpus, it may slow down rollup tasks. But adding more threads to execute rollup task will take more cpu costs, Maybe the thread size can be adapted to the node.processors setting.
  • How to handle the failed index docs, record the exception or failed the rollup task?
  • How to set the rollup index number_of_shards? Now the shard count of rollup index is the same with origin index. But in many case, the size of rollup index is much small than the origin index. Is it necessary to set the custom rollup index number_of_shards.
  • How to do anomaly testing in Elasticsearch automatically? The node shut down case? The disk grows to 100% case? The rolluping shard reallocate to other node case?

And about the Rollup performance, In this PR, I don't change the process of unsorted index. But I think it will have some performance improvement. I will add a todo task to think more about it.

package revert

fixup

fixup
@elasticsearchmachine elasticsearchmachine added v8.1.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Jan 24, 2022
import java.util.ArrayList;
import java.util.List;

public abstract class MetricField {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this PR, I don't add the AggregateDoubleMetricField child class, as I will modify some code of AggregateDoubleMetric, I will add the class, and support aggregate_metric_double rollup in the next PR.

@weizijun
Copy link
Contributor Author

@elasticmachine update branch

@nik9000 nik9000 added the :StorageEngine/Rollup Turn fine-grained time-based data into coarser-grained data label Jan 26, 2022
@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jan 26, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

* upstream/master: (100 commits)
  Avoid duplicate _type fields in v7 compat layer (elastic#83239)
  Bump bundled JDK to 17.0.2+8 (elastic#83243)
  [DOCS] Correct header syntax (elastic#83275)
  Add unit tests for indices.recovery.max_bytes_per_sec default values (elastic#83261)
  [DOCS] Add note that write indices are not replicated (elastic#82997)
  Add notes on indexing to kNN search guide (elastic#83188)
  Fix get-snapshot-api :docs:integTest (elastic#83273)
  FilterPathBasedFilter support match fieldname with dot (elastic#83178)
  Fix compilation issues in example-plugins (elastic#83258)
  fix ClusterStateListener javadoc (elastic#83246)
  Speed up Building Indices Lookup in Metadata (elastic#83241)
  Mute whole suite for elastic#82502 (elastic#83252)
  Make PeerFinder log messages happier (elastic#83222)
  [Docs] Add supported _terms_enum field types (elastic#83244)
  Add an aggregator for IPv4 and IPv6 subnets (elastic#82410)
  [CI] Fix 70_time_series/default sort yaml test failures (elastic#83217)
  Update test-failure Issue Template to include "needs:triage" label elastic#83226
  Add an index->step cache to the PolicyStepsRegistry (elastic#82316)
  Improve support for joda datetime to java datetime transition in Painless (elastic#83099)
  Fix joda migration for week based methods in Painless (elastic#83232)
  ...

# Conflicts:
#	x-pack/plugin/rollup/src/main/java/org/elasticsearch/xpack/rollup/v2/TransportRollupAction.java
@mark-vieira mark-vieira added v8.2.0 and removed v8.1.0 labels Feb 2, 2022
Copy link
Contributor

@csoulios csoulios left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weizijun thank you for submitting this PR. It adds a lot of functionality and fixes many issues.

For a start, I would like to say that the existing RollupAction implementation is
more a POC than a robust production-ready code. We took the decision to stop
working on this code until the TSDB project has taken shape, so that we
can make the most of the TSDB features.

Time series indices present all those characteristics that allow us to optimise downsampling for performance as well as user experience. Therefore, for the first release of rollups, we intend to support downsampling only for time series indices. Implementing downsampling for non time series indices is not in our immediate plans.

Rollup configuration

So, starting with the RollupAction, we plan to simplify the rollup configuration
by reducing the input parameters to the bare minimum, now that we are able to
infer most of them from the source index mapping.

For example, for time series indices the the timestamp field is always called
@timestamp. So, we must only add the downsampling interval and the time zone.

Also, for downsampling we are not allowed to modify the dimensions and and we must keep _tsid as is. So, the groups part of the rollup configuration will be shrinked to only the downsampling interval.

In the same way, we can remove the metrics part of the configuration by
extracting the metric fields from the index mapping.

This would simplify the Rollup API for downsampling to something like
the following:

POST /<index>/_rollup/<rollup_index>
{
    "calendar_interval": "1M",
    "time_zone": "UTC" // Default value is UTC
}

Performance

Regarding performance, we plan to use the TimeSeriesMetricsService (#82129) so that we iterate over the time series index. This allows us to leverage the fact that
the index is sorted per _tsid

Finally, this PR contains so many fixes, improvements and changes that I would like us
to add in smaller chunks

@weizijun
Copy link
Contributor Author

weizijun commented Feb 7, 2022

@csoulios, Thank you so much for your answer.

Time series indices present all those characteristics that allow us to optimise downsampling for performance as well as user experience. Therefore, for the first release of rollups, we intend to support downsampling only for time series indices. Implementing downsampling for non time series indices is not in our immediate plans.

yeah, I do the performance improvement for time_series indices in TimeSeriesRollupShardIndexer, it use TimeSeriesIndexSearcher to iterator all doc in order. It can great faster than the normal indices. And it's okay about use TimeSeriesMetricsService to improve the performance, we can have more discussion.

This would simplify the Rollup API for downsampling to something like the following:

POST /<index>/_rollup/<rollup_index>
{
    "calendar_interval": "1M",
    "time_zone": "UTC" // Default value is UTC
}

About Rollup configuration, ​I think TSDB downsampling is a special case of Rollup. Downsampling terms group has only _tsid, and metrics are all fields with time_series_metric configured, and the date_histogram field is @timestamp. Users only need to configure interval and time_zone.

I prefer downsampling is a new API, like:

POST /<index>/_downsample/<rollup_index>
{
    "fixed_interval": "10m"
}

it equals the API:

POST /<index>/_rollup/<rollup_index>
{
  "groups": {
    "date_histogram": {
      "field": "@timestamp",
      "fixed_interval": "10m"
    },
    "terms": {
      "fields": [
        "_tsid"
      ]
    }
  },
  "metrics": [
    {
      "field": "m1",
      "metrics": [
        "min",
        "max",
        "sum",
        "value_count"
      ]
    },
    {
      "field": "m2",
      "metrics": [
        "min",
        "max",
        "sum",
        "value_count"
      ]
    }
  ]
}

The new DownsampleAction can reuse many code with RollupAction.
We Alibaba Cloud Elasticsearch has been practice to use RollupAction downsample TSDB indices. User only configure the interval settings, our service generate the RollupAction config, it's easy to use.

The code of this PR is our practice in Alibaba Cloud Elasticsearch, so I add many feature. I hope we can have more deep discuss, and to see how to implement the rollup and downsample feature.

* upstream/master: (166 commits)
  Bind host all instead of just _site_ when needed (elastic#83145)
  [DOCS] Fix min/max agg snippets for histograms (elastic#83695)
  [DOCS] Add deprecation notice for system indices (elastic#83688)
  Cache ILM policy name on IndexMetadata (elastic#83603)
  [DOCS] Fix 8.0 breaking changes sort order (elastic#83685)
  [ML] fix random sampling background query consistency (elastic#83676)
  Move internal APIs into their own namespace '_internal'
  Runtime fields core-with-mapped tests support tsdb (elastic#83577)
  Optimize calculating the presence of a quorum (elastic#83638)
  Use switch expressions in EnableAllocationDecider and NodeShutdownAllocationDecider (elastic#83641)
  Note libffi error message in tmpdir docs (elastic#83662)
  Fix TransportDesiredNodesActionsIT batch tests (elastic#83659)
  [DOCS] Remove unused upgrade doc files (elastic#83617)
  [ML] Wait for model process to stop in stop deployment (elastic#83644)
  [ML] Fix submit after shutdown in process worker service (elastic#83645)
  Remove req/resp classes associated with HLRC (elastic#83599)
  Introduce index.version.compatibility setting (elastic#83264)
  Rename InternalTestCluster#getMasterNodeInstance (elastic#83407)
  Mute TimeSeriesIndexSearcherTests testCollectInOrderAcrossSegments (elastic#83648)
  Add rollover add max_primary_shard_docs condition (elastic#80981)
  ...

# Conflicts:
#	x-pack/plugin/rollup/build.gradle
#	x-pack/plugin/rollup/src/test/java/org/elasticsearch/xpack/rollup/v2/RollupActionSingleNodeTests.java
@csoulios csoulios added the :StorageEngine/TSDB You know, for Metrics label Feb 10, 2022
* upstream/master: (167 commits)
  Mute FrozenSearchableSnapshotsIntegTests#testCreateAndRestorePartialSearchableSnapshot
  Mute LdapSessionFactoryTests#testSslTrustIsReloaded
  Fix spotless violation from last commit
  Mute GeoGridTilerTestCase#testGeoGridSetValuesBoundingBoxes_UnboundedGeoShapeCellValues
  Small formatting clean up (elastic#84144)
  Always re-run Feature migrations which have encountered errors (elastic#83918)
  [DOCS] Clarify `orientation` usage for WKT and GeoJSON polygons (elastic#84025)
  Group field caps response by index mapping hash (elastic#83494)
  Shrink join queries in slow log (elastic#83914)
  TSDB: Reject the nested object fields that are configured time_series_dimension (elastic#83920)
  [DOCS] Remove note about partial response from Bulk API docs (elastic#84053)
  Allow regular data streams to be migrated to tsdb data streams. (elastic#83843)
  [DOCS] Fix `ignore_unavailable` parameter definition (elastic#84071)
  Make Metadata extend AbstractCollection (elastic#83791)
  Add API specs for OpenID Connect APIs
  Revert "Clean up for superuser role name references (elastic#83627)" (elastic#84096)
  Update Lucene analysis base url (elastic#84094)
  Avoid null threadContext in ResultDeduplicator (elastic#84093)
  Use static empty store files metadata (elastic#84034)
  Preserve context in snapshotDeletionListeners (elastic#84089)
  ...

# Conflicts:
#	x-pack/plugin/rollup/build.gradle
@csoulios
Copy link
Contributor

csoulios commented Jun 1, 2022

Closing this PR because it overlaps with the work at PR #85708

@weizijun and I have discussed about the dowsampling implementation and the path forward.
Thank you for this contribution.

@csoulios csoulios closed this Jun 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
external-contributor Pull request authored by a developer outside the Elasticsearch team :StorageEngine/Rollup Turn fine-grained time-based data into coarser-grained data :StorageEngine/TSDB You know, for Metrics Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.4.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants