Skip to content

Commit dc63fa1

Browse files
georgewallacethekofimensahleemthomposhainaraskas
authored
Added additional entries for troubleshooting unhealthy cluster (#119914) (#120234)
* Added additional entries for troubleshooting unhealthy cluster Reordered "Re-enable shard allocation" because not as common as other causes Added additional causes of yellow statuses Changed watermark commadn to include high and low watermark so users can make their cluster operate once again. * Drive-by copyedit with suggestions for concision and some formatting fixes. * Concision and some formatting fixes. * Colon added * Update docs/reference/troubleshooting/common-issues/red-yellow-cluster-status.asciidoc * Title change * Update docs/reference/troubleshooting/common-issues/red-yellow-cluster-status.asciidoc * Spelling fix * Update docs/reference/troubleshooting/common-issues/red-yellow-cluster-status.asciidoc * Update docs/reference/troubleshooting/common-issues/red-yellow-cluster-status.asciidoc * Update docs/reference/troubleshooting/common-issues/red-yellow-cluster-status.asciidoc * Update docs/reference/troubleshooting/common-issues/red-yellow-cluster-status.asciidoc --------- Co-authored-by: Kofi B <[email protected]> Co-authored-by: Liam Thompson <[email protected]> Co-authored-by: shainaraskas <[email protected]>
1 parent 82c70ac commit dc63fa1

File tree

1 file changed

+48
-23
lines changed

1 file changed

+48
-23
lines changed

docs/reference/troubleshooting/common-issues/red-yellow-cluster-status.asciidoc

Lines changed: 48 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -78,35 +78,31 @@ A shard can become unassigned for several reasons. The following tips outline th
7878
most common causes and their solutions.
7979

8080
[discrete]
81-
[[fix-cluster-status-reenable-allocation]]
82-
===== Re-enable shard allocation
81+
[[fix-cluster-status-only-one-node]]
82+
===== Single node cluster
8383

84-
You typically disable allocation during a <<restart-cluster,restart>> or other
85-
cluster maintenance. If you forgot to re-enable allocation afterward, {es} will
86-
be unable to assign shards. To re-enable allocation, reset the
87-
`cluster.routing.allocation.enable` cluster setting.
84+
{es} will never assign a replica to the same node as the primary shard. A single-node cluster will always have yellow status. To change to green, set <<dynamic-index-number-of-replicas,number_of_replicas>> to 0 for all indices.
8885

89-
[source,console]
90-
----
91-
PUT _cluster/settings
92-
{
93-
"persistent" : {
94-
"cluster.routing.allocation.enable" : null
95-
}
96-
}
97-
----
98-
99-
See https://www.youtube.com/watch?v=MiKKUdZvwnI[this video] for walkthrough of troubleshooting "no allocations are allowed".
86+
Therefore, if the number of replicas equals or exceeds the number of nodes, some shards won't be allocated.
10087

10188
[discrete]
10289
[[fix-cluster-status-recover-nodes]]
10390
===== Recover lost nodes
10491

10592
Shards often become unassigned when a data node leaves the cluster. This can
106-
occur for several reasons, ranging from connectivity issues to hardware failure.
93+
occur for several reasons:
94+
95+
* A manual node restart will cause a temporary unhealthy cluster state until the node recovers.
96+
97+
* When a node becomes overloaded or fails, it can temporarily disrupt the cluster’s health, leading to an unhealthy state. Prolonged garbage collection (GC) pauses, caused by out-of-memory errors or high memory usage during intensive searches, can trigger this state. See <<fix-cluster-status-jvm,Reduce JVM memory pressure>> for more JVM-related issues.
98+
99+
* Network issues can prevent reliable node communication, causing shards to become out of sync. Check the logs for repeated messages about nodes leaving and rejoining the cluster.
100+
107101
After you resolve the issue and recover the node, it will rejoin the cluster.
108102
{es} will then automatically allocate any unassigned shards.
109103

104+
You can monitor this process by <<cluster-health,checking your cluster health>>. The number of unallocated shards should progressively decrease until green status is reached.
105+
110106
To avoid wasting resources on temporary issues, {es} <<delayed-allocation,delays
111107
allocation>> by one minute by default. If you've recovered a node and don’t want
112108
to wait for the delay period, you can call the <<cluster-reroute,cluster reroute
@@ -155,7 +151,7 @@ replica, it remains unassigned. To fix this, you can:
155151

156152
* Change the `index.number_of_replicas` index setting to reduce the number of
157153
replicas for each primary shard. We recommend keeping at least one replica per
158-
primary.
154+
primary for high availability.
159155

160156
[source,console]
161157
----
@@ -166,7 +162,6 @@ PUT _settings
166162
----
167163
// TEST[s/^/PUT my-index\n/]
168164

169-
170165
[discrete]
171166
[[fix-cluster-status-disk-space]]
172167
===== Free up or increase disk space
@@ -187,6 +182,8 @@ If your nodes are running low on disk space, you have a few options:
187182

188183
* Upgrade your nodes to increase disk space.
189184

185+
* Add more nodes to the cluster.
186+
190187
* Delete unneeded indices to free up space. If you use {ilm-init}, you can
191188
update your lifecycle policy to use <<ilm-searchable-snapshot,searchable
192189
snapshots>> or add a delete phase. If you no longer need to search the data, you
@@ -219,11 +216,39 @@ watermark or set it to an explicit byte value.
219216
PUT _cluster/settings
220217
{
221218
"persistent": {
222-
"cluster.routing.allocation.disk.watermark.low": "30gb"
219+
"cluster.routing.allocation.disk.watermark.low": "90%",
220+
"cluster.routing.allocation.disk.watermark.high": "95%"
223221
}
224222
}
225223
----
226-
// TEST[s/"30gb"/null/]
224+
// TEST[s/"90%"/null/]
225+
// TEST[s/"95%"/null/]
226+
227+
[IMPORTANT]
228+
====
229+
This is usually a temporary solution and may cause instability if disk space is not freed up.
230+
====
231+
232+
[discrete]
233+
[[fix-cluster-status-reenable-allocation]]
234+
===== Re-enable shard allocation
235+
236+
You typically disable allocation during a <<restart-cluster,restart>> or other
237+
cluster maintenance. If you forgot to re-enable allocation afterward, {es} will
238+
be unable to assign shards. To re-enable allocation, reset the
239+
`cluster.routing.allocation.enable` cluster setting.
240+
241+
[source,console]
242+
----
243+
PUT _cluster/settings
244+
{
245+
"persistent" : {
246+
"cluster.routing.allocation.enable" : null
247+
}
248+
}
249+
----
250+
251+
See https://www.youtube.com/watch?v=MiKKUdZvwnI[this video] for walkthrough of troubleshooting "no allocations are allowed".
227252

228253
[discrete]
229254
[[fix-cluster-status-jvm]]
@@ -271,4 +296,4 @@ POST _cluster/reroute?metric=none
271296
// TEST[s/^/PUT my-index\n/]
272297
// TEST[catch:bad_request]
273298

274-
See https://www.youtube.com/watch?v=6OAg9IyXFO4[this video] for a walkthrough of troubleshooting `no_valid_shard_copy`.
299+
See https://www.youtube.com/watch?v=6OAg9IyXFO4[this video] for a walkthrough of troubleshooting `no_valid_shard_copy`.

0 commit comments

Comments
 (0)