Skip to content

Reconnect Strategy doesn't work if connection is lost #2948

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
edrose opened this issue May 7, 2025 · 8 comments
Open

Reconnect Strategy doesn't work if connection is lost #2948

edrose opened this issue May 7, 2025 · 8 comments
Assignees
Labels

Comments

@edrose
Copy link

edrose commented May 7, 2025

Description

I'm having some issues with the reconnect strategy in this library. It seems to work as per the documentation on the initial connection to Redis, however the behaviour changes if the connection to Redis is lost at a later stage.

The documentation states that there are three different types that can be returned:

  1. A number can be returned to indicate that the client should retry after the specified delay
  2. An error can be returned to terminate the connection
  3. false can be returned to prevent reconnection and hold the client in some kind of broken state for testing

Behaviour on Initial Connection

  • The RedisSocket.connect() function is called
  • This calls the private function RedisSocket.#connect() which deals with retries
  • There is a line in packages/client/lib/client/socket.ts:218 where the output of the reconnect strategy function is evaluated - if it's a number then a retry is scheduled; if it's not a number then it throws whatever it is.
  • Assuming the value is not false (which I assume is a noop in javascript) it bubbles up to the initial call to RedisSocket.connect() and the promise called by the library user rejects with the error

This is the behaviour that is documented.


Behaviour when connection is lost

  • The socket "error" event fires, which calls into RedisSocket.#onSocketError()
  • RedisSocket.#onSocketError() re-emits the error event and also emits "reconnecting" before calling this.#connect() on line 269
  • Crucially, that line functionally looks like this: this.#connect().catch(() => {}));

Any error in connecting is completely ignored. This means that is has the same effect as if the reconnect strategy function returned false which (quoting the documentation here) "should only be used for testing purposes". This silently leaves the client in a broken state.

What I expect to happen

At the very least expect the "error" event to be emitted with the error returned by the reconnect strategy. Even better would be for a "terminated" or similar signal to be emitted to tell the library that the retry strategy failed and allow it to do it's own recovery.

Node.js Version

18.20.6

Redis Server Version

7.3.241

Node Redis Version

5.0.1

Platform

Linux

Logs

@edrose edrose added the Bug label May 7, 2025
@nkaradzhov nkaradzhov self-assigned this May 7, 2025
@nkaradzhov
Copy link
Collaborator

@edrose I dont see the issue here. Even though this.#connect().catch(() => {})) swallows the error, #connect will still emit an error event, which will in turn trigger #onSocketError, continuing in a loop until the reconnection strategy returns something different than a number

@edrose
Copy link
Author

edrose commented May 20, 2025

True, internally there is nothing wrong with the logic - it does exactly what it's told to by the reconnection strategy. However as a user of the library I need to be able to detect when the library has given up retrying, otherwise I just have a dead client.

Listening for 'error' events isn't sufficient because those don't tell me if the client has given up retrying or not. If it has given up then further logic is needed to re-initialise the client, maybe re-fetch some configuration or even do a full restart.


The context is that we have a webapp and Redis running in a Kubernetes cluster on AWS spot instances. Every so often we get spot interruptions that cause our Redis instances to move to different nodes. When this occurs they go down, then come back up with a new IP address. Since we have replication this can occur with no downtime, but it does cause error events to be emitted whenever it happens, even if that doesn't cause an interruption in service.

To detect and recover from a persistent error I need to be able to tell the difference between "an error occurred and I'm going to try to fix it" and "an error occurred and I can't do anything more".

@nkaradzhov
Copy link
Collaborator

@edrose AFAIK the only thing that controls when the library will choose to stop retrying is the reconnectStrategy, which is controlled by the user. So, you have to return false or Error from your custom reconnectStrategy. At that point, you could invoke any other custom logic you wish. Here is a minimal example:

import { GenericContainer } from 'testcontainers';
import { createClient } from 'redis';

const container = await new GenericContainer('redis')
  .withExposedPorts(6379)
  .start();

const makeCappedRetryStrategy = (maxRetries: number, cleanup: () => void) => {
  return (retries: number, cause: Error) => {
    console.log(`retrying... ${retries}`);
    if (retries >= maxRetries) {
      cleanup();
      return false;
    }
    return 1000;
  }
}

const client = createClient({
  url: `redis://localhost:${container.getMappedPort(6379)}`,
  socket: {
    reconnectStrategy: makeCappedRetryStrategy(3, () => {
      console.log('cleanup function called');
    })
  }
});

client.on('error', () => {});
await client.connect();
await container.exec('redis-cli shutdown');

In principal, Im not against firing a 'terminated' event, but fundamentally, the behavior is controlled by the reconnectStrategy.

@nkaradzhov
Copy link
Collaborator

@edrose what do you think?

@edrose
Copy link
Author

edrose commented May 22, 2025

I don't think that code works when .duplicate() is used. So if duplicate clients are created for pub/sub then the reconnect strategy will also duplicate, which duplicates whatever work is done when the reconnect strategy function fails. There isn't any way of knowing which client called the callback function, so extra care must be taken to ensure that extra logic isn't called multiple times.

I think the pattern of using the side-effects of a callback function is also just asking for bugs to occur, and I would never do extra work in the callback unless the documentation explicitly states that it's acceptable. As a user of the library I don't know what state the Redis client is in when it calls the callback, so I wouldn't assume that it's safe to modify the state of the client whilst in that callback context.

As a bare minimum I would be using process.nextTick() to do any extra logic in the callback if I was forced to, but I still think it would be much cleaner and safer if an event is emitted.

@nkaradzhov
Copy link
Collaborator

I see your point. I think its sensible to add the event. @bobymicroby do you see any issues for us emitting a 'terminated' event on client when reconnectStrategy returns false ( which is when we kill the socket and client is unusable from that point on ) ?

@bobymicroby
Copy link
Member

The code should work when duplicate is used as every client gets its own "retry" function instance. If it doesn't, it's a bug we missed, but from what I see there shouldn't be a problem. The retryStrategy function that the user supplies has no side effects and is pure. It receives the strategy(retries, cause) - retries so far and the latest error - so you are free to make informed decisions and to decide if you want to keep retrying or not based on that.

@bobymicroby
Copy link
Member

What we can do ( just thinking out loud, not sure if it makes sense ) is wrap the last error in RetryStrategyExhausted(originalError) before emitting it here: https://github.com/bobymicroby/node-redis/blob/065eb5e9145db9152923a44b63a356c55f76cf3b/packages/client/lib/client/socket.ts#L192. Unfortunately this can break clients which expect the raw error and will need more consideration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants