Fix two separate but related problems with watch retry handling #338

KJTsanaktsidis · 2024-02-20T08:19:44Z

This PR contains two commits which each fix an aspect of retry handling with WATCH/MULTI.

First patch: 9546682

It's important that all of a transaction actually happens on the same connection, with no transparent reconnection allowed inside ::RedisClient. So, we wrap a watch transaction in ensure_connected_cluster_scoped(retryable: false).
We don't need to call UNWATCH on connection errors (since the connection state is already broken). redis-rb and RedisClient don't do this either.

Second patch: 99de3d3

If we have a WATCH, then the MULTI must exec on exactly that node; it should not be allowed for that to be redirected to a different node, because then the MULTI isn't on the same connection as the WATCH anymore!

The first part of this patch fixes this by disabling redirection handling in transaction.rb if we are in a watch. I also added a test test_the_state_of_cluster_resharding_with_reexecuted_watch for this.

That means that the user block is also re-executed in this case; that's actually what we want. If WATCH is re-executed, then it's also vital that the user code which does redis reads is also re-executed, so that the code can make the decision about what to put in the transaction again (based on potentially updated information).

However, the second part of this patch is a lot trickier...

This change causes a different test, test_the_state_of_cluster_resharding_with_transaction_and_watch, to
break. That test is asserting that

if we watch something where the slot is in the middle of migrating between nodes,
and all the keys we wanted to watch are on the new node,
that the client correctly retries the transaction with ASKING against the new node

Disabling redirection handling obviously makes this stop working, and it is possible to handle this case correctly. We need to record whether or not we had to issue an ASKING on the WATCH for the transaction, and if so, pre-emptively issue an ASKING on the MULTI too. That's because this slot is not yet actually assigned to the node we're connected to (it's IMPORTING).

It may well not be worth it, and I'm also OK with just failing WATCH/MULTI on slots which are currently migrating. That would imply:

Keeping test_the_state_of_cluster_resharding_with_reexecuted_watch
Deleting test_the_state_of_cluster_resharding_with_transaction_and_watch

* It's important that all of a transaction actually happens on the same connection, with no transparent reconnection allowed inside ::RedisClient. So, we wrap a watch transaction in ensure_connected_cluster_scoped(retryable: false). * We don't need to call UNWATCH on connection errors (since the connection state is already broken). redis-rb and RedisClient don't do this either.

If we have a WATCH, then the MULTI _must_ exec on exactly that node; it should not be allowed for that to be redirected to a different node, because then the MULTI isn't on the same connection as the WATCH anymore! The first part of this patch fixes this by disabling redirection handling in transaction.rb if we are in a watch. I also added a test test_the_state_of_cluster_resharding_with_reexecuted_watch for this. However, the second part of this patch is a lot trickier... This change causes a different test, test_the_state_of_cluster_resharding_with_transaction_and_watch, to break. That test is asserting that - if we watch something where the slot is in the middle of migrating between nodes, - and all the keys we wanted to watch are on the new node, - that the client correctly retries the transaction with ASKING against the new node Disabling redirection handling obviously makes this stop working, and it _is_ possible to handle this case correctly. We need to record whether or not we had to issue an ASKING on the WATCH for the transaction, and if so, pre-emptively issue an ASKING on the MULTI too. That's because this slot is not yet actually assigned to the node we're connected to (it's IMPORTING). It may well not be worth it, and I'm also OK with just failing WATCH/MULTI on slots which are currently migrating. That would imply: - Keeping test_the_state_of_cluster_resharding_with_reexecuted_watch - Deleting test_the_state_of_cluster_resharding_with_transaction_and_watch

supercaracal

Thank you for your quite understandable reporting and fixing the bug. I'm sorry for my delayed review.

KJTsanaktsidis · 2024-03-08T22:58:53Z

Thank you!

KJ Tsanaktsidis added 2 commits February 20, 2024 18:25

KJTsanaktsidis mentioned this pull request Feb 20, 2024

Implementation for transaction #299

Closed

supercaracal self-requested a review February 20, 2024 08:24

supercaracal approved these changes Mar 8, 2024

View reviewed changes

supercaracal merged commit 7eff2f2 into redis-rb:master Mar 8, 2024

KJTsanaktsidis deleted the ktsanaktsidis/watch_fixes_2 branch March 8, 2024 22:58

supercaracal mentioned this pull request Mar 14, 2024

fix: fix a dedicated method for watch to be the same manner as the multi method #342

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix two separate but related problems with watch retry handling #338

Fix two separate but related problems with watch retry handling #338

KJTsanaktsidis commented Feb 20, 2024 •

edited

Loading

supercaracal left a comment

KJTsanaktsidis commented Mar 8, 2024

Fix two separate but related problems with watch retry handling #338

Fix two separate but related problems with watch retry handling #338

Conversation

KJTsanaktsidis commented Feb 20, 2024 • edited Loading

supercaracal left a comment

Choose a reason for hiding this comment

KJTsanaktsidis commented Mar 8, 2024

KJTsanaktsidis commented Feb 20, 2024 •

edited

Loading