You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/resiliency/index.asciidoc
+27-16Lines changed: 27 additions & 16 deletions
Original file line number
Diff line number
Diff line change
@@ -63,22 +63,6 @@ to create new scenarios. We have currently ported all published Jepsen scenarios
63
63
framework. As the Jepsen tests evolve, we will continue porting new scenarios that are not covered yet. We are committed to investigating
64
64
all new scenarios and will report issues that we find on this page and in our GitHub repository.
65
65
66
-
[float]
67
-
=== Repeated network partitions can cause cluster state updates to be lost (STATUS: ONGOING)
68
-
69
-
During a networking partition, cluster state updates (like mapping changes or shard assignments)
70
-
are committed if a majority of the master-eligible nodes received the update correctly. This means that the current master has access
71
-
to enough nodes in the cluster to continue to operate correctly. When the network partition heals, the isolated nodes catch
72
-
up with the current state and receive the previously missed changes. However, if a second partition happens while the cluster
73
-
is still recovering from the previous one *and* the old master falls on the minority side, it may be that a new master is elected
74
-
which has not yet catch up. If that happens, cluster state updates can be lost.
75
-
76
-
This problem is mostly fixed by {GIT}20384[#20384] (v5.0.0), which takes committed cluster state updates into account during master
77
-
election. This considerably reduces the chance of this rare problem occurring but does not fully mitigate it. If the second partition
78
-
happens concurrently with a cluster state update and blocks the cluster state commit message from reaching a majority of nodes, it may be
79
-
that the in flight update will be lost. If the now-isolated master can still acknowledge the cluster state update to the client this
80
-
will amount to the loss of an acknowledged change. Fixing that last scenario needs considerable work. We are currently working on it but have no ETA yet.
81
-
82
66
[float]
83
67
=== Better request retry mechanism when nodes are disconnected (STATUS: ONGOING)
84
68
@@ -170,6 +154,33 @@ shard.
170
154
171
155
== Completed
172
156
157
+
[float]
158
+
=== Repeated network partitions can cause cluster state updates to be lost (STATUS: DONE, v7.0.0)
159
+
160
+
During a networking partition, cluster state updates (like mapping changes or
161
+
shard assignments) are committed if a majority of the master-eligible nodes
162
+
received the update correctly. This means that the current master has access to
163
+
enough nodes in the cluster to continue to operate correctly. When the network
164
+
partition heals, the isolated nodes catch up with the current state and receive
165
+
the previously missed changes. However, if a second partition happens while the
166
+
cluster is still recovering from the previous one *and* the old master falls on
167
+
the minority side, it may be that a new master is elected which has not yet
168
+
catch up. If that happens, cluster state updates can be lost.
169
+
170
+
This problem is mostly fixed by {GIT}20384[#20384] (v5.0.0), which takes
171
+
committed cluster state updates into account during master election. This
172
+
considerably reduces the chance of this rare problem occurring but does not
173
+
fully mitigate it. If the second partition happens concurrently with a cluster
174
+
state update and blocks the cluster state commit message from reaching a
175
+
majority of nodes, it may be that the in flight update will be lost. If the
176
+
now-isolated master can still acknowledge the cluster state update to the client
177
+
this will amount to the loss of an acknowledged change.
178
+
179
+
Fixing this last scenario was one of the goals of {GIT}32006[#32006] and its
180
+
sub-issues. See particularly {GIT}32171[#32171] and
0 commit comments