Skip to content

Commit 5b06581

Browse files
debadairandreidandakrone
authored
[DOCS] Add top-level Data management section. (#64185) (#64321)
* [DOCS] Add top-level Data management section. * Edits * Edits * Fixed xrefs * Apply suggestions from code review Co-authored-by: Andrei Dan <[email protected]> Co-authored-by: Lee Hinman <[email protected]> * Update docs/reference/datatiers.asciidoc * Update docs/reference/datatiers.asciidoc Co-authored-by: Andrei Dan <[email protected]> Co-authored-by: Lee Hinman <[email protected]> Co-authored-by: Andrei Dan <[email protected]> Co-authored-by: Lee Hinman <[email protected]>
1 parent af98730 commit 5b06581

File tree

3 files changed

+122
-77
lines changed

3 files changed

+122
-77
lines changed
+33
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
[role="xpack"]
2+
[[data-management]]
3+
= Data management
4+
5+
[partintro]
6+
--
7+
The data you store in {es} generally falls into one of two categories:
8+
9+
* Content: a collection of items you want to search, such as a catalog of products
10+
* Time series data: a stream of continuously-generated timestamped data, such as log entries
11+
12+
Content might be frequently updated,
13+
but the value of the content remains relatively constant over time.
14+
You want to be able to retrieve items quickly regardless of how old they are.
15+
16+
Time series data keeps accumulating over time, so you need strategies for
17+
balancing the value of the data against the cost of storing it.
18+
As it ages, it tends to become less important and less-frequently accessed,
19+
so you can move it to less expensive, less performant hardware.
20+
For your oldest data, what matters is that you have access to the data.
21+
It's ok if queries take longer to complete.
22+
23+
To help you manage your data, {es} enables you to:
24+
25+
* Define <<data-tiers, multiple tiers>> of data nodes with different performance characteristics.
26+
* Automatically transition indices through the data tiers according to your performance needs and retention policies
27+
with <<index-lifecycle-management, {ilm}>> ({ilm-init}).
28+
* Leverage <<searchable-snapshots, searchable snapshots>> stored in a remote repository to provide resiliency
29+
for your older indices while reducing operating costs and maintaining search performance.
30+
* Perform <<async-search-intro, asynchronous searches>> of data stored on less-performant hardware.
31+
--
32+
33+
include::datatiers.asciidoc[]

docs/reference/datatiers.asciidoc

+87-75
Original file line numberDiff line numberDiff line change
@@ -1,100 +1,112 @@
11
[role="xpack"]
22
[[data-tiers]]
3-
=== Data tiers
4-
5-
Common data lifecycle management patterns revolve around transitioning indices
6-
through multiple collections of nodes with different hardware characteristics in order
7-
to fulfil evolving CRUD, search, and aggregation needs as indices age. The concept
8-
of a tiered hardware architecture is not new in {es}.
9-
<<index-lifecycle-management, Index Lifecycle Management>> is instrumental in
10-
implementing tiered architectures by automating the managemnt of indices according to
11-
performance, resiliency and data retention requirements.
12-
<<overview-index-lifecycle-management, Hot/warm/cold>> architectures are common
13-
for timeseries data such as logging and metrics.
14-
15-
A data tier is a collection of nodes with the same role. Data tiers are an integrated
16-
solution offering better support for optimising cost and improving performance.
17-
Formalized data tiers in ES allow configuration of the lifecycle and location of data
18-
in a hot/warm/cold topology without requiring the use of custom node attributes.
19-
Each tier formalises specific characteristics and data behaviours.
20-
21-
The node roles that can currently define data tiers are:
22-
23-
* <<data-content-node, data_content>>
24-
* <<data-hot-node, data_hot>>
25-
* <<data-warm-node, data_warm>>
26-
* <<data-cold-node, data_cold>>
27-
28-
The more generic <<data-node, data role>> is not a data tier role, but
29-
it is the default node role if no roles are configured. If a node has the
30-
<<data-node, data>> role we treat the node as if it has all of the tier
31-
roles assigned.
3+
== Data tiers
324

33-
[[content-tier]]
34-
==== Content tier
5+
A _data tier_ is a collection of nodes with the same data role that
6+
typically share the same hardware profile:
357

36-
The content tier is made of one or more nodes that have the <<data-content-node, data_content>>
37-
role. A content tier is designed to store and search user created content. Non-timeseries data
38-
doesn't necessarily follow the hot-warm-cold path. The hardware profiles are quite different to
39-
the <<hot-tier, hot tier>>. User created content prioritises high CPU to support complex
40-
queries and aggregations in a timely manner, as opposed to the <<hot-tier, hot tier>> which
41-
prioritises high IO.
42-
The content data has very long data retention characteristics and from a resiliency perspective
43-
the indices in this tier should be configured to use one or more replicas.
8+
* <<content-tier, Content tier>> nodes handle the indexing and query load for content such as a product catalog.
9+
* <<hot-tier, Hot tier>> nodes handle the indexing load for time series data such as logs or metrics
10+
and hold your most recent, most-frequently-accessed data.
11+
* <<warm-tier, Warm tier>> nodes hold time series data that is accessed less-frequently
12+
and rarely needs to be updated.
13+
* <<cold-tier, Cold tier>> nodes hold time series data that is accessed occasionally and not normally updated.
4414

45-
NOTE: new indices that are not part of <<data-streams, data streams>> will be automatically allocated to the
46-
<<content-tier>>
15+
When you index documents directly to a specific index, they remain on content tier nodes indefinitely.
4716

48-
[[hot-tier]]
49-
==== Hot tier
17+
When you index documents to a data stream, they initially reside on hot tier nodes.
18+
You can configure <<index-lifecycle-management, {ilm}>> ({ilm-init}) policies
19+
to automatically transition your time series data through the hot, warm, and cold tiers
20+
according to your performance, resiliency and data retention requirements.
21+
22+
A node's <<data-node, data role>> is configured in `elasticsearch.yml`.
23+
For example, the highest-performance nodes in a cluster might be assigned to both the hot and content tiers:
5024

51-
The hot tier is made of one or more nodes that have the <<data-hot-node, data_hot>> role.
52-
It is the {es} entry point for timeseries data. This tier needs to be fast both for reads
53-
and writes, requiring more hardware resources such as SSD drives. The hot tier is usually
54-
hosting the data from recent days. From a resiliency perspective the indices in this
25+
[source,yaml]
26+
--------------------------------------------------
27+
node.roles: ["data_hot", "data_content"]
28+
--------------------------------------------------
29+
30+
[discrete]
31+
[[content-tier]]
32+
=== Content tier
33+
34+
Data stored in the content tier is generally a collection of items such as a product catalog or article archive.
35+
Unlike time series data, the value of the content remains relatively constant over time,
36+
so it doesn't make sense to move it to a tier with different performance characteristics as it ages.
37+
Content data typically has long data retention requirements, and you want to be able to retrieve
38+
items quickly regardless of how old they are.
39+
40+
Content tier nodes are usually optimized for query performance--they prioritize processing power over IO throughput
41+
so they can process complex searches and aggregations and return results quickly.
42+
While they are also responsible for indexing, content data is generally not ingested at as high a rate
43+
as time series data such as logs and metrics. From a resiliency perspective the indices in this
5544
tier should be configured to use one or more replicas.
5645

57-
NOTE: new indices that are part of a <<data-streams, data stream>> will be automatically allocated to the
58-
<<hot-tier>>
46+
New indices are automatically allocated to the <<content-tier>> unless they are part of a data stream.
47+
48+
[discrete]
49+
[[hot-tier]]
50+
=== Hot tier
5951

52+
The hot tier is the {es} entry point for time series data and holds your most-recent,
53+
most-frequently-searched time series data.
54+
Nodes in the hot tier need to be fast for both reads and writes,
55+
which requires more hardware resources and faster storage (SSDs).
56+
For resiliency, indices in the hot tier should be configured to use one or more replicas.
57+
58+
New indices that are part of a <<data-streams, data stream>> are automatically allocated to the
59+
hot tier.
60+
61+
[discrete]
6062
[[warm-tier]]
61-
==== Warm tier
63+
=== Warm tier
6264

63-
The warm tier is made of one or more nodes that have the <<data-warm-node, data_warm>> role.
64-
This tier is where data goes once it is not queried as frequently as in the <<hot-tier, hot tier>>.
65-
It is a medium-fast tier that still allows data updates. The warm tier is usually
66-
hosting the data from recent weeks. From a resiliency perspective the indices in this
67-
tier should be configured to use one or more replicas.
65+
Time series data can move to the warm tier once it is being queried less frequently
66+
than the recently-indexed data in the hot tier.
67+
The warm tier typically holds data from recent weeks.
68+
Updates are still allowed, but likely infrequent.
69+
Nodes in the warm tier generally don't need to be as fast as those in the hot tier.
70+
For resiliency, indices in the warm tier should be configured to use one or more replicas.
6871

72+
[discrete]
6973
[[cold-tier]]
70-
==== Cold tier
74+
=== Cold tier
7175

72-
The cold tier is made of one or more nodes that have the <<data-cold-node, data_cold>> role.
73-
Once the data in the <<warm-tier, warm tier>> is not updated anymore it can transition to the
74-
cold tier. The cold tier is still a responsive query tier but as the data transitions into this
75-
tier it can be compressed, shrunken, or configured to have zero replicas and be backed by
76-
a <<ilm-searchable-snapshot, snapshot>>. The cold tier is usually hosting the data from recent
77-
months or years.
76+
Once data in the warm tier is no longer being updated, it can move to the cold tier.
77+
The cold tier typically holds the data from recent months or years.
78+
The cold tier is still a responsive query tier, but data in the cold tier is not normally updated.
79+
As data transitions into the cold tier it can be compressed and shrunken.
80+
For resiliency, indices in the cold tier can rely on
81+
<<ilm-searchable-snapshot, searchable snapshots>>, eliminating the need for replicas.
7882

7983
[discrete]
8084
[[data-tier-allocation]]
8185
=== Data tier index allocation
8286

83-
When an index is created {es} will automatically allocate the index to the <<content-tier, Content tier>>
84-
if the index is not part of a <<data-streams, data stream>> or to the <<hot-tier, Hot tier>> if the index
85-
is part of a <<data-streams, data stream>>.
86-
{es} will configure the <<tier-preference-allocation-filter, `index.routing.allocation.include._tier_preference`>>
87-
to `data_content` or `data_hot` respectively.
87+
When you create an index, by default {es} sets
88+
<<tier-preference-allocation-filter, `index.routing.allocation.include._tier_preference`>>
89+
to `data_content` to automatically allocate the index shards to the content tier.
90+
91+
When {es} creates an index as part of a <<data-streams, data stream>>,
92+
by default {es} sets
93+
<<tier-preference-allocation-filter, `index.routing.allocation.include._tier_preference`>>
94+
to `data_hot` to automatically allocate the index shards to the hot tier.
8895

89-
These heuristics can be overridden by specifying any <<shard-allocation-filtering, shard allocation filtering>>
96+
You can override the automatic tier-based allocation by specifying
97+
<<shard-allocation-filtering, shard allocation filtering>>
9098
settings in the create index request or index template that matches the new index.
91-
Specifying any configuration, including `null`, for `index.routing.allocation.include._tier_preference` will
92-
also opt out of the automatic new index allocation to tiers.
99+
100+
You can also explicitly set `index.routing.allocation.include._tier_preference`
101+
to opt out of the default tier-based allocation.
102+
If you set the tier preference to `null`, {es} ignores the data tier roles during allocation.
103+
93104
[discrete]
94105
[[data-tier-migration]]
95-
=== Data tier index migration
106+
=== Automatic data tier migration
96107

97-
<<index-lifecycle-management, Index Lifecycle Management>> automates the transition of managed
98-
indices through the available data tiers using the `migrate` action which is injected
99-
in every phase, unless it's manually specified in the phase or an
100-
<<ilm-allocate-action, allocate action>> modifying the allocation rules is manually configured.
108+
{ilm-init} automatically transitions managed
109+
indices through the available data tiers using the <<ilm-migrate-action, migrate>> action.
110+
By default, this action is automatically injected in every phase.
111+
You can explicitly specify the migrate action to override the default behavior,
112+
or use the <<ilm-allocate-action, allocate action>> to manually specify allocation rules.

docs/reference/index.asciidoc

+2-2
Original file line numberDiff line numberDiff line change
@@ -30,8 +30,6 @@ include::indices/index-templates.asciidoc[]
3030

3131
include::data-streams/data-streams.asciidoc[]
3232

33-
include::datatiers.asciidoc[]
34-
3533
include::ingest.asciidoc[]
3634

3735
include::search/search-your-data/search-your-data.asciidoc[]
@@ -46,6 +44,8 @@ include::sql/index.asciidoc[]
4644

4745
include::scripting.asciidoc[]
4846

47+
include::data-management.asciidoc[]
48+
4949
include::ilm/index.asciidoc[]
5050

5151
ifdef::permanently-unreleased-branch[]

0 commit comments

Comments
 (0)