Skip to content

Fix two separate but related problems with watch retry handling #338

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 8, 2024

Conversation

KJTsanaktsidis
Copy link
Contributor

@KJTsanaktsidis KJTsanaktsidis commented Feb 20, 2024

This PR contains two commits which each fix an aspect of retry handling with WATCH/MULTI.


First patch: 9546682

  • It's important that all of a transaction actually happens on the same connection, with no transparent reconnection allowed inside ::RedisClient. So, we wrap a watch transaction in ensure_connected_cluster_scoped(retryable: false).
  • We don't need to call UNWATCH on connection errors (since the connection state is already broken). redis-rb and RedisClient don't do this either.

Second patch: 99de3d3

If we have a WATCH, then the MULTI must exec on exactly that node; it should not be allowed for that to be redirected to a different node, because then the MULTI isn't on the same connection as the WATCH anymore!

The first part of this patch fixes this by disabling redirection handling in transaction.rb if we are in a watch. I also added a test test_the_state_of_cluster_resharding_with_reexecuted_watch for this.

That means that the user block is also re-executed in this case; that's actually what we want. If WATCH is re-executed, then it's also vital that the user code which does redis reads is also re-executed, so that the code can make the decision about what to put in the transaction again (based on potentially updated information).

However, the second part of this patch is a lot trickier...

This change causes a different test, test_the_state_of_cluster_resharding_with_transaction_and_watch, to
break. That test is asserting that

  • if we watch something where the slot is in the middle of migrating between nodes,
  • and all the keys we wanted to watch are on the new node,
  • that the client correctly retries the transaction with ASKING against the new node

Disabling redirection handling obviously makes this stop working, and it is possible to handle this case correctly. We need to record whether or not we had to issue an ASKING on the WATCH for the transaction, and if so, pre-emptively issue an ASKING on the MULTI too. That's because this slot is not yet actually assigned to the node we're connected to (it's IMPORTING).

It may well not be worth it, and I'm also OK with just failing WATCH/MULTI on slots which are currently migrating. That would imply:

  • Keeping test_the_state_of_cluster_resharding_with_reexecuted_watch
  • Deleting test_the_state_of_cluster_resharding_with_transaction_and_watch

KJ Tsanaktsidis added 2 commits February 20, 2024 18:25
* It's important that all of a transaction actually happens on the same
  connection, with no transparent reconnection allowed inside
  ::RedisClient. So, we wrap a watch transaction in
  ensure_connected_cluster_scoped(retryable: false).
* We don't need to call UNWATCH on connection errors (since the
  connection state is already broken). redis-rb and RedisClient don't do
  this either.
If we have a WATCH, then the MULTI _must_ exec on exactly that node; it
should not be allowed for that to be redirected to a different node,
because then the MULTI isn't on the same connection as the WATCH
anymore!

The first part of this patch fixes this by disabling redirection
handling in transaction.rb if we are in a watch. I also added a test
test_the_state_of_cluster_resharding_with_reexecuted_watch for this.

However, the second part of this patch is a lot trickier...

This change causes a different test,
test_the_state_of_cluster_resharding_with_transaction_and_watch, to
break. That test is asserting that
- if we watch something where the slot is in the middle of migrating
  between nodes,
- and all the keys we wanted to watch are on the new node,
- that the client correctly retries the transaction with ASKING against
  the new node

Disabling redirection handling obviously makes this stop working, and it
_is_ possible to handle this case correctly. We need to record whether
or not we had to issue an ASKING on the WATCH for the transaction, and
if so, pre-emptively issue an ASKING on the MULTI too. That's because
this slot is not yet actually assigned to the node we're connected to
(it's IMPORTING).

It may well not be worth it, and I'm also OK with just failing
WATCH/MULTI on slots which are currently migrating. That would imply:

- Keeping test_the_state_of_cluster_resharding_with_reexecuted_watch
- Deleting test_the_state_of_cluster_resharding_with_transaction_and_watch
Copy link
Member

@supercaracal supercaracal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your quite understandable reporting and fixing the bug. I'm sorry for my delayed review.

@supercaracal supercaracal merged commit 7eff2f2 into redis-rb:master Mar 8, 2024
@KJTsanaktsidis
Copy link
Contributor Author

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants