Skip to content

Commit fc1d9d0

Browse files
authored
ReplicationOperation should fail gracefully (#115341)
Problem: finishAsFailed could be called asynchronously in the middle of operations like runPostReplicationActions which try to sync the translog. finishAsFailed immediately triggers the failure of the resultListener which releases the index shard primary operation permit. This means that runPostReplicationActions may try to sync the translog without an operation permit. Solution: We refactor the infrastructure of ReplicationOperation regarding pendingActions and the resultListener, by replacing them with a RefCountingListener. This way, if there are async failures, they are aggregated, and the result listener is called once, after all mid-way operations are done. For the specific error we got in issue #97183, this means that a call to onNoLongerPrimary (which can happen if we fail to fail a replica shard or mark it as stale) will not immediately release the primary operation permit and the assertion in the translog sync will be honored. Fixes #97183
1 parent 8eb4d04 commit fc1d9d0

File tree

2 files changed

+214
-195
lines changed

2 files changed

+214
-195
lines changed

0 commit comments

Comments
 (0)