|
1 | 1 | [role="xpack"]
|
2 | 2 | [[data-tiers]]
|
3 |
| -=== Data tiers |
4 |
| - |
5 |
| -Common data lifecycle management patterns revolve around transitioning indices |
6 |
| -through multiple collections of nodes with different hardware characteristics in order |
7 |
| -to fulfil evolving CRUD, search, and aggregation needs as indices age. The concept |
8 |
| -of a tiered hardware architecture is not new in {es}. |
9 |
| -<<index-lifecycle-management, Index Lifecycle Management>> is instrumental in |
10 |
| -implementing tiered architectures by automating the managemnt of indices according to |
11 |
| -performance, resiliency and data retention requirements. |
12 |
| -<<overview-index-lifecycle-management, Hot/warm/cold>> architectures are common |
13 |
| -for timeseries data such as logging and metrics. |
14 |
| - |
15 |
| -A data tier is a collection of nodes with the same role. Data tiers are an integrated |
16 |
| -solution offering better support for optimising cost and improving performance. |
17 |
| -Formalized data tiers in ES allow configuration of the lifecycle and location of data |
18 |
| -in a hot/warm/cold topology without requiring the use of custom node attributes. |
19 |
| -Each tier formalises specific characteristics and data behaviours. |
20 |
| - |
21 |
| -The node roles that can currently define data tiers are: |
22 |
| - |
23 |
| -* <<data-content-node, data_content>> |
24 |
| -* <<data-hot-node, data_hot>> |
25 |
| -* <<data-warm-node, data_warm>> |
26 |
| -* <<data-cold-node, data_cold>> |
27 |
| - |
28 |
| -The more generic <<data-node, data role>> is not a data tier role, but |
29 |
| -it is the default node role if no roles are configured. If a node has the |
30 |
| -<<data-node, data>> role we treat the node as if it has all of the tier |
31 |
| -roles assigned. |
| 3 | +== Data tiers |
32 | 4 |
|
33 |
| -[[content-tier]] |
34 |
| -==== Content tier |
| 5 | +A _data tier_ is a collection of nodes with the same data role that |
| 6 | +typically share the same hardware profile: |
35 | 7 |
|
36 |
| -The content tier is made of one or more nodes that have the <<data-content-node, data_content>> |
37 |
| -role. A content tier is designed to store and search user created content. Non-timeseries data |
38 |
| -doesn't necessarily follow the hot-warm-cold path. The hardware profiles are quite different to |
39 |
| -the <<hot-tier, hot tier>>. User created content prioritises high CPU to support complex |
40 |
| -queries and aggregations in a timely manner, as opposed to the <<hot-tier, hot tier>> which |
41 |
| -prioritises high IO. |
42 |
| -The content data has very long data retention characteristics and from a resiliency perspective |
43 |
| -the indices in this tier should be configured to use one or more replicas. |
| 8 | +* <<content-tier, Content tier>> nodes handle the indexing and query load for content such as a product catalog. |
| 9 | +* <<hot-tier, Hot tier>> nodes handle the indexing load for time series data such as logs or metrics |
| 10 | +and hold your most recent, most-frequently-accessed data. |
| 11 | +* <<warm-tier, Warm tier>> nodes hold time series data that is accessed less-frequently |
| 12 | +and rarely needs to be updated. |
| 13 | +* <<cold-tier, Cold tier>> nodes hold time series data that is accessed occasionally and not normally updated. |
44 | 14 |
|
45 |
| -NOTE: new indices that are not part of <<data-streams, data streams>> will be automatically allocated to the |
46 |
| -<<content-tier>> |
| 15 | +When you index documents directly to a specific index, they remain on content tier nodes indefinitely. |
47 | 16 |
|
48 |
| -[[hot-tier]] |
49 |
| -==== Hot tier |
| 17 | +When you index documents to a data stream, they initially reside on hot tier nodes. |
| 18 | +You can configure <<index-lifecycle-management, {ilm}>> ({ilm-init}) policies |
| 19 | +to automatically transition your time series data through the hot, warm, and cold tiers |
| 20 | +according to your performance, resiliency and data retention requirements. |
| 21 | + |
| 22 | +A node's <<data-node, data role>> is configured in `elasticsearch.yml`. |
| 23 | +For example, the highest-performance nodes in a cluster might be assigned to both the hot and content tiers: |
50 | 24 |
|
51 |
| -The hot tier is made of one or more nodes that have the <<data-hot-node, data_hot>> role. |
52 |
| -It is the {es} entry point for timeseries data. This tier needs to be fast both for reads |
53 |
| -and writes, requiring more hardware resources such as SSD drives. The hot tier is usually |
54 |
| -hosting the data from recent days. From a resiliency perspective the indices in this |
| 25 | +[source,yaml] |
| 26 | +-------------------------------------------------- |
| 27 | +node.roles: ["data_hot", "data_content"] |
| 28 | +-------------------------------------------------- |
| 29 | + |
| 30 | +[discrete] |
| 31 | +[[content-tier]] |
| 32 | +=== Content tier |
| 33 | + |
| 34 | +Data stored in the content tier is generally a collection of items such as a product catalog or article archive. |
| 35 | +Unlike time series data, the value of the content remains relatively constant over time, |
| 36 | +so it doesn't make sense to move it to a tier with different performance characteristics as it ages. |
| 37 | +Content data typically has long data retention requirements, and you want to be able to retrieve |
| 38 | +items quickly regardless of how old they are. |
| 39 | + |
| 40 | +Content tier nodes are usually optimized for query performance--they prioritize processing power over IO throughput |
| 41 | +so they can process complex searches and aggregations and return results quickly. |
| 42 | +While they are also responsible for indexing, content data is generally not ingested at as high a rate |
| 43 | +as time series data such as logs and metrics. From a resiliency perspective the indices in this |
55 | 44 | tier should be configured to use one or more replicas.
|
56 | 45 |
|
57 |
| -NOTE: new indices that are part of a <<data-streams, data stream>> will be automatically allocated to the |
58 |
| -<<hot-tier>> |
| 46 | +New indices are automatically allocated to the <<content-tier>> unless they are part of a data stream. |
| 47 | + |
| 48 | +[discrete] |
| 49 | +[[hot-tier]] |
| 50 | +=== Hot tier |
59 | 51 |
|
| 52 | +The hot tier is the {es} entry point for time series data and holds your most-recent, |
| 53 | +most-frequently-searched time series data. |
| 54 | +Nodes in the hot tier need to be fast for both reads and writes, |
| 55 | +which requires more hardware resources and faster storage (SSDs). |
| 56 | +For resiliency, indices in the hot tier should be configured to use one or more replicas. |
| 57 | + |
| 58 | +New indices that are part of a <<data-streams, data stream>> are automatically allocated to the |
| 59 | +hot tier. |
| 60 | + |
| 61 | +[discrete] |
60 | 62 | [[warm-tier]]
|
61 |
| -==== Warm tier |
| 63 | +=== Warm tier |
62 | 64 |
|
63 |
| -The warm tier is made of one or more nodes that have the <<data-warm-node, data_warm>> role. |
64 |
| -This tier is where data goes once it is not queried as frequently as in the <<hot-tier, hot tier>>. |
65 |
| -It is a medium-fast tier that still allows data updates. The warm tier is usually |
66 |
| -hosting the data from recent weeks. From a resiliency perspective the indices in this |
67 |
| -tier should be configured to use one or more replicas. |
| 65 | +Time series data can move to the warm tier once it is being queried less frequently |
| 66 | +than the recently-indexed data in the hot tier. |
| 67 | +The warm tier typically holds data from recent weeks. |
| 68 | +Updates are still allowed, but likely infrequent. |
| 69 | +Nodes in the warm tier generally don't need to be as fast as those in the hot tier. |
| 70 | +For resiliency, indices in the warm tier should be configured to use one or more replicas. |
68 | 71 |
|
| 72 | +[discrete] |
69 | 73 | [[cold-tier]]
|
70 |
| -==== Cold tier |
| 74 | +=== Cold tier |
71 | 75 |
|
72 |
| -The cold tier is made of one or more nodes that have the <<data-cold-node, data_cold>> role. |
73 |
| -Once the data in the <<warm-tier, warm tier>> is not updated anymore it can transition to the |
74 |
| -cold tier. The cold tier is still a responsive query tier but as the data transitions into this |
75 |
| -tier it can be compressed, shrunken, or configured to have zero replicas and be backed by |
76 |
| -a <<ilm-searchable-snapshot, snapshot>>. The cold tier is usually hosting the data from recent |
77 |
| -months or years. |
| 76 | +Once data in the warm tier is no longer being updated, it can move to the cold tier. |
| 77 | +The cold tier typically holds the data from recent months or years. |
| 78 | +The cold tier is still a responsive query tier, but data in the cold tier is not normally updated. |
| 79 | +As data transitions into the cold tier it can be compressed and shrunken. |
| 80 | +For resiliency, indices in the cold tier can rely on |
| 81 | +<<ilm-searchable-snapshot, searchable snapshots>>, eliminating the need for replicas. |
78 | 82 |
|
79 | 83 | [discrete]
|
80 | 84 | [[data-tier-allocation]]
|
81 | 85 | === Data tier index allocation
|
82 | 86 |
|
83 |
| -When an index is created {es} will automatically allocate the index to the <<content-tier, Content tier>> |
84 |
| -if the index is not part of a <<data-streams, data stream>> or to the <<hot-tier, Hot tier>> if the index |
85 |
| -is part of a <<data-streams, data stream>>. |
86 |
| -{es} will configure the <<tier-preference-allocation-filter, `index.routing.allocation.include._tier_preference`>> |
87 |
| -to `data_content` or `data_hot` respectively. |
| 87 | +When you create an index, by default {es} sets |
| 88 | +<<tier-preference-allocation-filter, `index.routing.allocation.include._tier_preference`>> |
| 89 | +to `data_content` to automatically allocate the index shards to the content tier. |
| 90 | + |
| 91 | +When {es} creates an index as part of a <<data-streams, data stream>>, |
| 92 | +by default {es} sets |
| 93 | +<<tier-preference-allocation-filter, `index.routing.allocation.include._tier_preference`>> |
| 94 | +to `data_hot` to automatically allocate the index shards to the hot tier. |
88 | 95 |
|
89 |
| -These heuristics can be overridden by specifying any <<shard-allocation-filtering, shard allocation filtering>> |
| 96 | +You can override the automatic tier-based allocation by specifying |
| 97 | +<<shard-allocation-filtering, shard allocation filtering>> |
90 | 98 | settings in the create index request or index template that matches the new index.
|
91 |
| -Specifying any configuration, including `null`, for `index.routing.allocation.include._tier_preference` will |
92 |
| -also opt out of the automatic new index allocation to tiers. |
| 99 | + |
| 100 | +You can also explicitly set `index.routing.allocation.include._tier_preference` |
| 101 | +to opt out of the default tier-based allocation. |
| 102 | +If you set the tier preference to `null`, {es} ignores the data tier roles during allocation. |
| 103 | + |
93 | 104 | [discrete]
|
94 | 105 | [[data-tier-migration]]
|
95 |
| -=== Data tier index migration |
| 106 | +=== Automatic data tier migration |
96 | 107 |
|
97 |
| -<<index-lifecycle-management, Index Lifecycle Management>> automates the transition of managed |
98 |
| -indices through the available data tiers using the `migrate` action which is injected |
99 |
| -in every phase, unless it's manually specified in the phase or an |
100 |
| -<<ilm-allocate-action, allocate action>> modifying the allocation rules is manually configured. |
| 108 | +{ilm-init} automatically transitions managed |
| 109 | +indices through the available data tiers using the <<ilm-migrate-action, migrate>> action. |
| 110 | +By default, this action is automatically injected in every phase. |
| 111 | +You can explicitly specify the migrate action to override the default behavior, |
| 112 | +or use the <<ilm-allocate-action, allocate action>> to manually specify allocation rules. |
0 commit comments