Handling connection timeouts / dropped network / db failover #1075

jessedpate · 2016-07-11T21:20:17Z

Hello,

I'm working on setting up a health check within our system, and have noticed that our Postgres implementation doesn't seem to handle connection timeouts, dropped network, or a db failover event very gracefully (at least in the way it works when using a multi-az deployment in Amazon RDS). The issue is very similar to one that was fixed in the mysql package a while back -- mysqljs/mysql#821.

We effectively have a class that looks similar to this:

const Promise = require('bluebird');
const pg = require('pg');

class PostgresSQLInterface extends Service {
    constructor(dbConnectionOptions, logger) {
        const defaults = {
            host: 'localhost',
            port: '5432',

            user: '',
            password: null,

            max: 10, // Poolsize
            min: 10,

            keepAlive: true,
            idleTimeoutMillis: 10000
        };

        this.state = {
            config: _.merge({}, defaults, dbConnectionOptions)
        };

        this.createDatabasePool();
    }

    createDatabasePool() {
        this.state.pool = new pg.Pool(this.state.config);

        this.state.pool.on('connect', () => {
            this.state.healthy = true;
            console.log('HEALTHY');
        });

        this.state.pool.on('error', () => {
            this.state.healthy = false;
            console.log('UNHEALTHY');
        });
    }

    *getDatabaseConnection() {
        try {
            const connection = yield this.state.pool.connect();

            return connection;
        }
        catch (error) {
            throw new Errors.DatabaseError("Could not connect to Postgres", {
                originalError: error
            });
        }
    }    
};

In a normal environment, when our application connects, I see it spit out "HEALTHY" x the number of connections it made, as expected. However, there are a few issues:

1.) If the connection is severed (I turn off my wireless, kill my VPN, or trigger a reboot on the RDS instance), no error events are raised even though the documentation for pg-pool states an error is "Emitted whenever an idle client in the pool encounters an error. This is common when your PostgreSQL server shuts down, reboots, or a network partition otherwise causes it to become unavailable while your pool has connected clients." My expectation would be to see a string of "UNHEALTHY" statements.

2.) Similarly, if I initiate a failover in RDS, no error events are raised but I am unable to run queries. This may be due to how AWS handles the DNS entries around multi-az deployments, but I cannot test this while issue #1 remains unresolved.

3.) Calling getDatabaseConnection when no replies are received from postgres (sudo iptables -A INPUT -p tcp --source-port 5432 -j DROP, sudo iptables -F afterwards to restore) hangs on grabbing a connection, as there is no timeout setting.

Am I missing a setting in my configuration, or is this a situation that just hasn't come up yet for someone else? If it's the latter, I'd be more than happy to create a fix for the issue, but want to make sure I'm not overlooking something first.

Node version: 6.2.x
PG version: 6.0.2
Postgres version: 9.5.x

Thanks!

brianc · 2016-07-11T21:23:36Z

Hey - thanks for bringing this up! There are some tests around the pool emitting an error event on idle connections, but it looks like there are situations where the event isn't being emitted properly - that's unfortunate! I don't think you're missing anything in your config - I'm interested in any more research & fix you find about this issue. If there's anything I can do to help out on the fix or diagnose more or answer questions please let me know!

jessedpate · 2016-07-11T21:33:22Z

Thanks Brian! I'll dive into it a little today and let you know if I run into any questions.

vitaly-t · 2016-07-12T13:17:58Z

Is this an issue with the new pool?

I did very good stress-testing for lost connectivity with versions 4.x - 5.1, and it would never fall over, as long as you provide pg.on('error', handler).

jessedpate · 2016-07-12T14:33:45Z

After diving in some, this doesn't seem to be an issue with pool itself. The default pg.Client connection isn't catching the problem and emitting the error event -- which makes sense, because in at least one case the problem occurs when the server goes away without sending a RST packet. It's difficult to respond to an error that isn't sent to you.

The options to account for this are basically:
1.) Ping the server at regular intervals to make sure it's still there/reachable
2.) Implement a timeout when attempting to acquire a connection

vitaly-t · 2016-07-12T14:55:14Z

@jessedpate when the server is gone, you are getting a connection error right away, so there is no need for a timeout in this case.

And when you are executing queries through the pool, you can just log the connection errors. Doing it separately can be an unnecessary overhead, since every query would report the connectivity issue.

jessedpate · 2016-07-12T15:33:42Z

You would think that would be the case!

describe.only('events', function () {
  this.timeout(20000)
  it('emits error when a connection is dropped', function (done) {
    console.log('creating pool')
    var pool = new Pool({
      min: 1,
      host: 'some.remote.host',
      user: 'username',
      database: 'db_name',
      password: 'some_password'
    })
    console.log('pool created')
    pool.on('connect', function () {
      console.log('received pool connect emit')
    })
    pool.on('error', function () {
      done()
    })
  })
})

The above test times out if I turn off my network connection, or if I use iptables to drop return packets on 5432. Running on a local server, it did what I expected when I shut down the database (since it received a hang up from the server).

Now obviously, in a production environment, a full network outage is not likely to happen (and if it does, we've probably got bigger problems). However, the second scenario (dropping replies on 5432) very closely simulates what happens when an Amazon RDS instance is rebooted or fails over. When that happens queries (and therefore requests from users) end up hanging until the upstream server decides it has waited long enough.

psypersky · 2016-07-13T21:51:32Z

I was trying the Pool and started my App in localhost with postgres off and couldn't get an error in this case.

my code is very simple

import { Pool } from 'pg';

var pool;
try {
  pool = new Pool();
}
catch(e) {
  console.error('Error creating pool', e);
}

pool.on('error', function(error, client) {
  console.error('got pool error', error);
});

pool.query('select NOW()')
.then(res => {
  console.log('Got db res', res.rows[0]);
})
.catch(err => {
  console.error('query error', e.message, e.stack);
});

ReinsBrain · 2016-07-21T15:41:40Z

I'm bumping up against this too. If we're looking for a solution, I'd like to offer a suggestion to look at allowing the connection to have a "statement_timeout" configuration option:

https://www.postgresql.org/docs/current/static/runtime-config-client.html#GUC-STATEMENT-TIMEOUT

vitaly-t · 2016-07-21T15:51:04Z

I believe this issue was somehow introduced with the new Pool, because up to 5.1 there was no problem handling database disconnection, that's a fact. One could just provide a handler on the library itself: pg.on('error', handler), and it always worked.

jadbox · 2016-07-21T18:24:40Z

At least with Redshift, I do not get any error events from either Pool or Client instances.

brianc · 2016-07-21T18:27:17Z

I'm definitely open to a pull request to address this - I haven't
experienced this problem in my own code (yet). Likewise if you can put
together a repeatable, self-contained test file that reproduces the issue
that might help me get to this sooner. Unfortunately modifying IP tables
or shutting down the network connection wont really work in an automated
test environment, so if there's some way you can cause the same error w/o
modifying things externally then we're in business.

On Thu, Jul 21, 2016 at 1:24 PM, Jonathan [email protected] wrote:

At least with Redshift, I do not get any error events from either Pool or
Client instances.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1075 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AADDoToc5CwhVq-aVzlR5aX-A8BOKutRks5qX7logaJpZM4JJyxP
.

jessedpate · 2016-07-21T18:32:30Z

I've found a few ways to mock this out in a test environment. I am
currently trying to unblock s few people at work, but I should be able to
look at putting in a pr for this soon.

On Thu, Jul 21, 2016, 11:27 AM Brian C [email protected] wrote:

I'm definitely open to a pull request to address this - I haven't
experienced this problem in my own code (yet). Likewise if you can put
together a repeatable, self-contained test file that reproduces the issue
that might help me get to this sooner. Unfortunately modifying IP tables
or shutting down the network connection wont really work in an automated
test environment, so if there's some way you can cause the same error w/o
modifying things externally then we're in business.

On Thu, Jul 21, 2016 at 1:24 PM, Jonathan [email protected]
wrote:

At least with Redshift, I do not get any error events from either Pool or
Client instances.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<
#1075 (comment)
,
or mute the thread
<
https://github.com/notifications/unsubscribe-auth/AADDoToc5CwhVq-aVzlR5aX-A8BOKutRks5qX7logaJpZM4JJyxP

.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1075 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAv5o9YXJX8LVfng7T97syBtaf1TZVyfks5qX7oKgaJpZM4JJyxP
.

brianc · 2016-07-21T18:34:37Z

awesoooooooome

On Thu, Jul 21, 2016 at 1:32 PM, Jesse Pate [email protected]
wrote:

I've found a few ways to mock this out in a test environment. I am
currently trying to unblock s few people at work, but I should be able to
look at putting in a pr for this soon.

On Thu, Jul 21, 2016, 11:27 AM Brian C [email protected] wrote:

I'm definitely open to a pull request to address this - I haven't
experienced this problem in my own code (yet). Likewise if you can put
together a repeatable, self-contained test file that reproduces the issue
that might help me get to this sooner. Unfortunately modifying IP tables
or shutting down the network connection wont really work in an automated
test environment, so if there's some way you can cause the same error w/o
modifying things externally then we're in business.

On Thu, Jul 21, 2016 at 1:24 PM, Jonathan [email protected]
wrote:

At least with Redshift, I do not get any error events from either Pool
or
Client instances.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<

#1075 (comment)

,
or mute the thread
<

https://github.com/notifications/unsubscribe-auth/AADDoToc5CwhVq-aVzlR5aX-A8BOKutRks5qX7logaJpZM4JJyxP

.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<
#1075 (comment)
,
or mute the thread
<
https://github.com/notifications/unsubscribe-auth/AAv5o9YXJX8LVfng7T97syBtaf1TZVyfks5qX7oKgaJpZM4JJyxP

.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1075 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AADDodz2gQMpE6-CqH7S3692HpBhHNNZks5qX7s_gaJpZM4JJyxP
.

ReinsBrain · 2016-07-22T20:27:05Z

Im not so sure if the problem I have now is really related to this thread. We have nginx proxy sitting in front of our middle tier and the timeout is not really that long and shorter, I suspect, than the default client timeout on a connection to postgres. Is why I wanted to propose the ability to configure client connection time out... I'd like to be able to set the client timeout shorter than the nginx timeout to ensure that we're not sending back 500 to clients when a transaction just took to long to complete... is this worthy/welcome of starting another issue?

jessedpate · 2016-07-22T20:40:30Z

Ultimately, I think the solution to this issue will end up solving your use
case as well -- when you're dealing with (potentially) no response from a
server, the best way to handle it is by allowing for the setting of a
timeout.

On Fri, Jul 22, 2016 at 1:27 PM Rein Petersen [email protected]
wrote:

Im not so sure if the problem I have now is really related to this thread.
We have nginx proxy sitting in front of our middle tier and the timeout is
not really that long and shorter, I suspect, than the default client
timeout on a connection to postgres. Is why I wanted to propose the ability
to configure client connection time out... I'd like to be able to set the
client timeout shorter than the nginx timeout to ensure that we're not
sending back 500 to clients when a transaction just took to long to
complete... is this worthy/welcome of starting another issue?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1075 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAv5o0nwfFWlwmWSyIGX0hH6l2NOb4Uiks5qYSefgaJpZM4JJyxP
.

Neocid · 2016-08-15T18:11:55Z

We do have the same issue.

AWS with Postgres in HA.
Client open connection pool to database.
When the database failover, connections are still established but all query fails.

allspiritseve · 2016-10-04T15:26:37Z

@jessedpate did you make any further progress on this issue?

vitaly-t · 2016-10-04T15:29:57Z

I would be surprised. As many times as I've tried, steps for reproducing the issue are super-unreliable.

This is why we've made no progress here. Sometimes it takes me 20 times to kill the connection physically to be able see the problem.

It seems that simply closing the port from PostgreSQL itself doesn't quite do the trick....

From PostgreSQL I'm referring to running this command:

SELECT 
    pg_terminate_backend(pid) 
FROM 
    pg_stat_activity 
WHERE 
    -- don't kill my own connection!
    pid <> pg_backend_pid()
    -- don't kill the connections to other databases
    AND datname = 'database_name'

You really have to kill the connection physically to be able to see the problem.

Something tells me it may have a bearing on Node.js freaking out about a dead port... but then again, the issue is related to version 5.2+ only, so probably not...

P.S. I'm not giving up, I will try to reproduce it again when I find time.

xsellier · 2016-11-08T19:13:20Z

Hey there,

I had the very same issue till I found why.
On the debian server, keepalive packet was set to default values, ie:

net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200

I updated it for something more web-server-friendly:

net.ipv4.tcp_keepalive_intvl = 15
net.ipv4.tcp_keepalive_probes = 2
net.ipv4.tcp_keepalive_time = 60

So far so good, my issue was fixed. I hope it can help you folks ! (documentation)

diegoperini · 2016-11-15T10:28:07Z

This issue is critical. Is there any established workarounds until the fix is ready?

xsellier · 2016-11-15T14:07:16Z

@diegoperini If you update the sysctl configuration with

net.ipv4.tcp_keepalive_intvl = 15
net.ipv4.tcp_keepalive_probes = 2
net.ipv4.tcp_keepalive_time = 60

and if you use the db pool connection, everything must be ok.

davepacheco · 2016-11-15T18:34:19Z

@xsellier That workaround is useful, but it affects a lot of other components on the same system. It's not really a substitute for the client library handling this case appropriately.

srijs · 2017-01-23T01:55:28Z

Hi!

I have a way to reproduce the issue with RDS failover in an AWS setting (using RDS and EC2). The core of the problem is that the network interface of the primary RDS instance stops responding to packets as soon as the failover is initiated, without sending RST packets. Therefore the connection will be stuck in the ESTABLISHED state for a long time.

To work around it, I have patched this library to set an idle timeout on the underlying socket. The downside of this is that it affects long-running queries as well. It works for us since we don't rely on long-running queries in our production environment.

Would the library maintainers be open to accept a PR exposing an option to set the idle timeout on the socket to solve this issue?

jessedpate · 2017-01-23T01:59:03Z

That is the approach other libraries have implemented (mongo, mysql, elastic search, et cetera). There really isn't another way around the issue as best as I've ever been able to see.

…

On Sun, Jan 22, 2017 at 5:55 PM Sam Rijs ***@***.***> wrote: Hi! I have a way to reproduce the issue with RDS failover in an AWS setting (using RDS and EC2). The core of the problem is that the network interface of the primary RDS instance stops responding to packets as soon as the failover is initiated, without sending RST packets. Therefore the connection will be stuck in the ESTABLISHED state for a long time. To work around it, I have patched this library to set an idle timeout on the underlying socket. The downside of this is that it affects long-running queries as well. It works for us since we don't rely on long-running queries in our production environment. Would the library maintainers be open to accept a PR exposing an option to set the idle timeout on the socket to solve this issue? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1075 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAv5o6bRRMnl-Vcx9-eDiY7F4GL5l4fcks5rVAiSgaJpZM4JJyxP> .

srijs · 2017-01-23T04:23:18Z

An alternative solution which we could realise with the native client is to set the keepalives_idle, keepalives_interval and keepalives_count settings to more aggressive values (or exposing them as an option) by passing them as query parameters to the connection string.

vitaly-t · 2017-03-11T17:37:52Z

@srijs this all doesn't explain what exactly broke this, as connections worked perfectly up till version 5.1

brianc · 2017-05-24T15:45:58Z

Adding keep-alive configuration & better network partition handling to the 7.0 milestone.

okischuang · 2017-06-06T07:14:14Z

Using keep-alive to control the connection isn't a perfect solution to this problem, which will affect other application on the same os environment.
I will vote for client-connection-timeout and I've implemented and patch it on our production which seems solve our problem of HA.
I am very willing to send the pr but I need some time to familiar with the rule of contribution here.

brianc · 2017-06-06T17:23:28Z

@okischuang send the patch on over! I'd love to take a look at it.

also - what do you mean keep alive would affect other applications on the same os environment?

vitaly-t · 2017-06-06T21:11:30Z

Re-posting from another issue here, as it is relevant here...

From my recent conversation with a developer who was testing restoring connections through the driver...

Simply by disrupting TCP packages on the network he managed to kill the process where node-postgres 6.2.3 was used, while this didn't happen with v5.1.

So the issue with restoring broken connections is 100% there, just needs to be found.

brianc · 2017-06-09T20:28:39Z

I think I've found the fix for this. When the backend server disconnects or socket is closed unexpectedly the client would emit an end event, but not an error event. This resulted in silent disconnections, pools full of disconnected clients. inability to handle db failover, etc. I've put a patch out at [email protected] - please check it out & lemme know if it fixes your issue!

brianc · 2017-06-14T14:13:51Z

Closing for now as I think this has been taken care of! 💃

je-al · 2019-10-04T16:10:01Z

Hey, sorry for reviving such an old issue but, from my own testing attempting to recover from Multi-AZ RDS failover scenarios (which I'm simulating by issuing a docker pause on a db instance and then updating a DNS record to point to another) I would think this is actually not solved. I believe (I'm currently attempting to get a confirmation from AWS) what @jessedpate claims here is right, which would mean we actually need a way to hook in a timeout at the socket level.

Our current workaround amounts to:

        stream.setTimeout(20000);
        stream.on('timeout', () => {
          stream.end();
          connection.emit('end');
        });

added to lib/connection.js.

This is actually possible with the current implementation, since the library allows passing an already existing socket object in the Client configuration, but some libraries (knex) don't actually have hook-points available to modify the creation of the connection (at least from what I could see), and had us jumping around some loops to achieve the same behaviour (afterCreate on tarn's config).

All in all, I think it'd be worthwhile to actually allow for configuration of the property, since as mentioned above, it's available on other drivers.

vitaly-t mentioned this issue Sep 10, 2016

node-postgres fixed @5.1 vitaly-t/pg-promise#206

Closed

vitaly-t mentioned this issue Jan 12, 2017

Add PG.prototype.disconnect method? #1193

Closed

brianc added this to the [email protected] milestone May 24, 2017

brianc mentioned this issue Jun 9, 2017

Emit error when backend unexpectedly disconnects #1316

Merged

brianc closed this as completed Jun 14, 2017

robinbiondi mentioned this issue Dec 7, 2017

Dropped network / db failover not supported because of [email protected] loopbackio/loopback-connector-postgresql#311

Closed

vlapo mentioned this issue Jun 10, 2019

Auto-failover in TypeORM typeorm/typeorm#4262

Closed

jerroydmoore mentioned this issue Dec 5, 2019

pool.on('error') Documentation Inaccurate brianc/node-pg-pool#138

Open

shousper mentioned this issue Nov 10, 2021

AWS RDS failover triggers "the database system is shutting down" flood #2650

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling connection timeouts / dropped network / db failover #1075

Handling connection timeouts / dropped network / db failover #1075

jessedpate commented Jul 11, 2016

brianc commented Jul 11, 2016

jessedpate commented Jul 11, 2016

vitaly-t commented Jul 12, 2016

jessedpate commented Jul 12, 2016

vitaly-t commented Jul 12, 2016

jessedpate commented Jul 12, 2016

psypersky commented Jul 13, 2016

ReinsBrain commented Jul 21, 2016

vitaly-t commented Jul 21, 2016 •

edited

Loading

jadbox commented Jul 21, 2016

brianc commented Jul 21, 2016

jessedpate commented Jul 21, 2016

brianc commented Jul 21, 2016

ReinsBrain commented Jul 22, 2016

jessedpate commented Jul 22, 2016

Neocid commented Aug 15, 2016

allspiritseve commented Oct 4, 2016

vitaly-t commented Oct 4, 2016 •

edited

Loading

xsellier commented Nov 8, 2016 •

edited

Loading

diegoperini commented Nov 15, 2016

xsellier commented Nov 15, 2016

davepacheco commented Nov 15, 2016

srijs commented Jan 23, 2017

jessedpate commented Jan 23, 2017 via email

srijs commented Jan 23, 2017

vitaly-t commented Mar 11, 2017

brianc commented May 24, 2017

okischuang commented Jun 6, 2017

brianc commented Jun 6, 2017 •

edited

Loading

vitaly-t commented Jun 6, 2017 •

edited

Loading

brianc commented Jun 9, 2017

brianc commented Jun 14, 2017

je-al commented Oct 4, 2019 •

edited

Loading

Handling connection timeouts / dropped network / db failover #1075

Handling connection timeouts / dropped network / db failover #1075

Comments

jessedpate commented Jul 11, 2016

brianc commented Jul 11, 2016

jessedpate commented Jul 11, 2016

vitaly-t commented Jul 12, 2016

jessedpate commented Jul 12, 2016

vitaly-t commented Jul 12, 2016

jessedpate commented Jul 12, 2016

psypersky commented Jul 13, 2016

ReinsBrain commented Jul 21, 2016

vitaly-t commented Jul 21, 2016 • edited Loading

jadbox commented Jul 21, 2016

brianc commented Jul 21, 2016

jessedpate commented Jul 21, 2016

brianc commented Jul 21, 2016

ReinsBrain commented Jul 22, 2016

jessedpate commented Jul 22, 2016

Neocid commented Aug 15, 2016

allspiritseve commented Oct 4, 2016

vitaly-t commented Oct 4, 2016 • edited Loading

xsellier commented Nov 8, 2016 • edited Loading

diegoperini commented Nov 15, 2016

xsellier commented Nov 15, 2016

davepacheco commented Nov 15, 2016

srijs commented Jan 23, 2017

jessedpate commented Jan 23, 2017 via email

srijs commented Jan 23, 2017

vitaly-t commented Mar 11, 2017

brianc commented May 24, 2017

okischuang commented Jun 6, 2017

brianc commented Jun 6, 2017 • edited Loading

vitaly-t commented Jun 6, 2017 • edited Loading

brianc commented Jun 9, 2017

brianc commented Jun 14, 2017

je-al commented Oct 4, 2019 • edited Loading

vitaly-t commented Jul 21, 2016 •

edited

Loading

vitaly-t commented Oct 4, 2016 •

edited

Loading

xsellier commented Nov 8, 2016 •

edited

Loading

brianc commented Jun 6, 2017 •

edited

Loading

vitaly-t commented Jun 6, 2017 •

edited

Loading

je-al commented Oct 4, 2019 •

edited

Loading