Skip to content

Commit 7763a10

Browse files
committed
Move 'lost cluster state updates' issue to DONE
Relates elastic#34714.
1 parent 7f3b9c8 commit 7763a10

File tree

1 file changed

+27
-16
lines changed

1 file changed

+27
-16
lines changed

docs/resiliency/index.asciidoc

Lines changed: 27 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -63,22 +63,6 @@ to create new scenarios. We have currently ported all published Jepsen scenarios
6363
framework. As the Jepsen tests evolve, we will continue porting new scenarios that are not covered yet. We are committed to investigating
6464
all new scenarios and will report issues that we find on this page and in our GitHub repository.
6565

66-
[float]
67-
=== Repeated network partitions can cause cluster state updates to be lost (STATUS: ONGOING)
68-
69-
During a networking partition, cluster state updates (like mapping changes or shard assignments)
70-
are committed if a majority of the master-eligible nodes received the update correctly. This means that the current master has access
71-
to enough nodes in the cluster to continue to operate correctly. When the network partition heals, the isolated nodes catch
72-
up with the current state and receive the previously missed changes. However, if a second partition happens while the cluster
73-
is still recovering from the previous one *and* the old master falls on the minority side, it may be that a new master is elected
74-
which has not yet catch up. If that happens, cluster state updates can be lost.
75-
76-
This problem is mostly fixed by {GIT}20384[#20384] (v5.0.0), which takes committed cluster state updates into account during master
77-
election. This considerably reduces the chance of this rare problem occurring but does not fully mitigate it. If the second partition
78-
happens concurrently with a cluster state update and blocks the cluster state commit message from reaching a majority of nodes, it may be
79-
that the in flight update will be lost. If the now-isolated master can still acknowledge the cluster state update to the client this
80-
will amount to the loss of an acknowledged change. Fixing that last scenario needs considerable work. We are currently working on it but have no ETA yet.
81-
8266
[float]
8367
=== Better request retry mechanism when nodes are disconnected (STATUS: ONGOING)
8468

@@ -170,6 +154,33 @@ shard.
170154

171155
== Completed
172156

157+
[float]
158+
=== Repeated network partitions can cause cluster state updates to be lost (STATUS: DONE, v7.0.0)
159+
160+
During a networking partition, cluster state updates (like mapping changes or
161+
shard assignments) are committed if a majority of the master-eligible nodes
162+
received the update correctly. This means that the current master has access to
163+
enough nodes in the cluster to continue to operate correctly. When the network
164+
partition heals, the isolated nodes catch up with the current state and receive
165+
the previously missed changes. However, if a second partition happens while the
166+
cluster is still recovering from the previous one *and* the old master falls on
167+
the minority side, it may be that a new master is elected which has not yet
168+
catch up. If that happens, cluster state updates can be lost.
169+
170+
This problem is mostly fixed by {GIT}20384[#20384] (v5.0.0), which takes
171+
committed cluster state updates into account during master election. This
172+
considerably reduces the chance of this rare problem occurring but does not
173+
fully mitigate it. If the second partition happens concurrently with a cluster
174+
state update and blocks the cluster state commit message from reaching a
175+
majority of nodes, it may be that the in flight update will be lost. If the
176+
now-isolated master can still acknowledge the cluster state update to the client
177+
this will amount to the loss of an acknowledged change.
178+
179+
Fixing this last scenario was one of the goals of {GIT}32006[#32006] and its
180+
sub-issues. See particularly {GIT}32171[#32171] and
181+
https://github.com/elastic/elasticsearch-formal-models/blob/master/ZenWithTerms/tla/ZenWithTerms.tla[the
182+
TLA+ formal model] used to verify these changes.
183+
173184
[float]
174185
=== Port Jepsen tests dealing with loss of acknowledged writes to our testing framework (STATUS: DONE, V5.0.0)
175186

0 commit comments

Comments
 (0)