You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Added additional entries for troubleshooting unhealthy cluster (#119914) (#120234)
* Added additional entries for troubleshooting unhealthy cluster
Reordered "Re-enable shard allocation" because not as common as other causes
Added additional causes of yellow statuses
Changed watermark commadn to include high and low watermark so users can make their cluster operate once again.
* Drive-by copyedit with suggestions for concision and some formatting fixes.
* Concision and some formatting fixes.
* Colon added
* Update docs/reference/troubleshooting/common-issues/red-yellow-cluster-status.asciidoc
* Title change
* Update docs/reference/troubleshooting/common-issues/red-yellow-cluster-status.asciidoc
* Spelling fix
* Update docs/reference/troubleshooting/common-issues/red-yellow-cluster-status.asciidoc
* Update docs/reference/troubleshooting/common-issues/red-yellow-cluster-status.asciidoc
* Update docs/reference/troubleshooting/common-issues/red-yellow-cluster-status.asciidoc
* Update docs/reference/troubleshooting/common-issues/red-yellow-cluster-status.asciidoc
---------
Co-authored-by: Kofi B <[email protected]>
Co-authored-by: Liam Thompson <[email protected]>
Co-authored-by: shainaraskas <[email protected]>
{es} will never assign a replica to the same node as the primary shard. A single-node cluster will always have yellow status. To change to green, set <<dynamic-index-number-of-replicas,number_of_replicas>> to 0 for all indices.
88
85
89
-
[source,console]
90
-
----
91
-
PUT _cluster/settings
92
-
{
93
-
"persistent" : {
94
-
"cluster.routing.allocation.enable" : null
95
-
}
96
-
}
97
-
----
98
-
99
-
See https://www.youtube.com/watch?v=MiKKUdZvwnI[this video] for walkthrough of troubleshooting "no allocations are allowed".
86
+
Therefore, if the number of replicas equals or exceeds the number of nodes, some shards won't be allocated.
100
87
101
88
[discrete]
102
89
[[fix-cluster-status-recover-nodes]]
103
90
===== Recover lost nodes
104
91
105
92
Shards often become unassigned when a data node leaves the cluster. This can
106
-
occur for several reasons, ranging from connectivity issues to hardware failure.
93
+
occur for several reasons:
94
+
95
+
* A manual node restart will cause a temporary unhealthy cluster state until the node recovers.
96
+
97
+
* When a node becomes overloaded or fails, it can temporarily disrupt the cluster’s health, leading to an unhealthy state. Prolonged garbage collection (GC) pauses, caused by out-of-memory errors or high memory usage during intensive searches, can trigger this state. See <<fix-cluster-status-jvm,Reduce JVM memory pressure>> for more JVM-related issues.
98
+
99
+
* Network issues can prevent reliable node communication, causing shards to become out of sync. Check the logs for repeated messages about nodes leaving and rejoining the cluster.
100
+
107
101
After you resolve the issue and recover the node, it will rejoin the cluster.
108
102
{es} will then automatically allocate any unassigned shards.
109
103
104
+
You can monitor this process by <<cluster-health,checking your cluster health>>. The number of unallocated shards should progressively decrease until green status is reached.
105
+
110
106
To avoid wasting resources on temporary issues, {es} <<delayed-allocation,delays
111
107
allocation>> by one minute by default. If you've recovered a node and don’t want
112
108
to wait for the delay period, you can call the <<cluster-reroute,cluster reroute
@@ -155,7 +151,7 @@ replica, it remains unassigned. To fix this, you can:
155
151
156
152
* Change the `index.number_of_replicas` index setting to reduce the number of
157
153
replicas for each primary shard. We recommend keeping at least one replica per
158
-
primary.
154
+
primary for high availability.
159
155
160
156
[source,console]
161
157
----
@@ -166,7 +162,6 @@ PUT _settings
166
162
----
167
163
// TEST[s/^/PUT my-index\n/]
168
164
169
-
170
165
[discrete]
171
166
[[fix-cluster-status-disk-space]]
172
167
===== Free up or increase disk space
@@ -187,6 +182,8 @@ If your nodes are running low on disk space, you have a few options:
187
182
188
183
* Upgrade your nodes to increase disk space.
189
184
185
+
* Add more nodes to the cluster.
186
+
190
187
* Delete unneeded indices to free up space. If you use {ilm-init}, you can
191
188
update your lifecycle policy to use <<ilm-searchable-snapshot,searchable
192
189
snapshots>> or add a delete phase. If you no longer need to search the data, you
@@ -219,11 +216,39 @@ watermark or set it to an explicit byte value.
0 commit comments