Caching of cluster slots doesn't self-heal once it doesn't know the whole cluster slot mapping, even though cluster self healed #3620

TimLovellSmith · 2025-04-27T12:51:38Z

With redis-py version "5 2 1"

The error message
SlotNotCoveredError('Slot \"4890\" not covered by the cluster. \"require_full_coverage=False\"

is raised, it seems from

Line 742 in 77e8db2

raise SlotNotCoveredError(f'Slot "{slot}" is not covered by the cluster.')

unfortunately, if you have a client instance which is raising this error, its not going to self-heal in an important case where it needs to:

That case is, the cluster is actually healthy, and serving that slot (verified per cluster slot mapping) (also note that operations succeed on some client machines but not others), but the client happens to be missing that slot from its local slots cache!

From hastily reviewing code, it appears to me the current redis-py assumption is that every time a slot is missing from its cache, the way to get the slot mapping back into the cache is either, reconnecting, and recreating the cluster mapping, or receiving a MOVED error.

Problem is, this moved error won't actually happen, if the client never gets that error, because the cluster is already healthy and serving the slot - in fact, the client just never even sends the request, so it won't land on a wrong node, and receive a MOVED error response from the server.

What would be needed instead would be something like:
-deliberately route requests to the wrong node, in order to receive the MOVED error and update the cluster mapping, when we don't know it
or
-have an event handler to refresh the cluster mapping when its not completely known

The text was updated successfully, but these errors were encountered:

petyaslavova · 2025-04-28T06:46:46Z

Hi @TimLovellSmith, I ran into the same issue last week and am currently working on a fix.

petyaslavova · 2025-04-29T16:41:42Z

Closed with PR #3621

petyaslavova self-assigned this Apr 28, 2025

petyaslavova mentioned this issue Apr 28, 2025

When SlotNotCoveredError is raised, the cluster topology should be reinitialized as part of error handling and retrying of the commands. #3621

Merged

6 tasks

petyaslavova closed this as completed Apr 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caching of cluster slots doesn't self-heal once it doesn't know the whole cluster slot mapping, even though cluster self healed #3620

Caching of cluster slots doesn't self-heal once it doesn't know the whole cluster slot mapping, even though cluster self healed #3620

TimLovellSmith commented Apr 27, 2025

petyaslavova commented Apr 28, 2025

petyaslavova commented Apr 29, 2025

Caching of cluster slots doesn't self-heal once it doesn't know the whole cluster slot mapping, even though cluster self healed #3620

Caching of cluster slots doesn't self-heal once it doesn't know the whole cluster slot mapping, even though cluster self healed #3620

Comments

TimLovellSmith commented Apr 27, 2025

petyaslavova commented Apr 28, 2025

petyaslavova commented Apr 29, 2025