Skip to content

Handling connection timeouts / dropped network / db failover #1075

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jessedpate opened this issue Jul 11, 2016 · 33 comments
Closed

Handling connection timeouts / dropped network / db failover #1075

jessedpate opened this issue Jul 11, 2016 · 33 comments

Comments

@jessedpate
Copy link

Hello,

I'm working on setting up a health check within our system, and have noticed that our Postgres implementation doesn't seem to handle connection timeouts, dropped network, or a db failover event very gracefully (at least in the way it works when using a multi-az deployment in Amazon RDS). The issue is very similar to one that was fixed in the mysql package a while back -- mysqljs/mysql#821.

We effectively have a class that looks similar to this:

const Promise = require('bluebird');
const pg = require('pg');

class PostgresSQLInterface extends Service {
    constructor(dbConnectionOptions, logger) {
        const defaults = {
            host: 'localhost',
            port: '5432',

            user: '',
            password: null,

            max: 10, // Poolsize
            min: 10,

            keepAlive: true,
            idleTimeoutMillis: 10000
        };

        this.state = {
            config: _.merge({}, defaults, dbConnectionOptions)
        };

        this.createDatabasePool();
    }

    createDatabasePool() {
        this.state.pool = new pg.Pool(this.state.config);

        this.state.pool.on('connect', () => {
            this.state.healthy = true;
            console.log('HEALTHY');
        });

        this.state.pool.on('error', () => {
            this.state.healthy = false;
            console.log('UNHEALTHY');
        });
    }

    *getDatabaseConnection() {
        try {
            const connection = yield this.state.pool.connect();

            return connection;
        }
        catch (error) {
            throw new Errors.DatabaseError("Could not connect to Postgres", {
                originalError: error
            });
        }
    }    
};

In a normal environment, when our application connects, I see it spit out "HEALTHY" x the number of connections it made, as expected. However, there are a few issues:

1.) If the connection is severed (I turn off my wireless, kill my VPN, or trigger a reboot on the RDS instance), no error events are raised even though the documentation for pg-pool states an error is "Emitted whenever an idle client in the pool encounters an error. This is common when your PostgreSQL server shuts down, reboots, or a network partition otherwise causes it to become unavailable while your pool has connected clients." My expectation would be to see a string of "UNHEALTHY" statements.

2.) Similarly, if I initiate a failover in RDS, no error events are raised but I am unable to run queries. This may be due to how AWS handles the DNS entries around multi-az deployments, but I cannot test this while issue #1 remains unresolved.

3.) Calling getDatabaseConnection when no replies are received from postgres (sudo iptables -A INPUT -p tcp --source-port 5432 -j DROP, sudo iptables -F afterwards to restore) hangs on grabbing a connection, as there is no timeout setting.

Am I missing a setting in my configuration, or is this a situation that just hasn't come up yet for someone else? If it's the latter, I'd be more than happy to create a fix for the issue, but want to make sure I'm not overlooking something first.

Node version: 6.2.x
PG version: 6.0.2
Postgres version: 9.5.x

Thanks!

@brianc
Copy link
Owner

brianc commented Jul 11, 2016

Hey - thanks for bringing this up! There are some tests around the pool emitting an error event on idle connections, but it looks like there are situations where the event isn't being emitted properly - that's unfortunate! I don't think you're missing anything in your config - I'm interested in any more research & fix you find about this issue. If there's anything I can do to help out on the fix or diagnose more or answer questions please let me know!

@jessedpate
Copy link
Author

Thanks Brian! I'll dive into it a little today and let you know if I run into any questions.

@vitaly-t
Copy link
Contributor

Is this an issue with the new pool?

I did very good stress-testing for lost connectivity with versions 4.x - 5.1, and it would never fall over, as long as you provide pg.on('error', handler).

@jessedpate
Copy link
Author

After diving in some, this doesn't seem to be an issue with pool itself. The default pg.Client connection isn't catching the problem and emitting the error event -- which makes sense, because in at least one case the problem occurs when the server goes away without sending a RST packet. It's difficult to respond to an error that isn't sent to you.

The options to account for this are basically:
1.) Ping the server at regular intervals to make sure it's still there/reachable
2.) Implement a timeout when attempting to acquire a connection

@vitaly-t
Copy link
Contributor

@jessedpate when the server is gone, you are getting a connection error right away, so there is no need for a timeout in this case.

And when you are executing queries through the pool, you can just log the connection errors. Doing it separately can be an unnecessary overhead, since every query would report the connectivity issue.

@jessedpate
Copy link
Author

You would think that would be the case!

describe.only('events', function () {
  this.timeout(20000)
  it('emits error when a connection is dropped', function (done) {
    console.log('creating pool')
    var pool = new Pool({
      min: 1,
      host: 'some.remote.host',
      user: 'username',
      database: 'db_name',
      password: 'some_password'
    })
    console.log('pool created')
    pool.on('connect', function () {
      console.log('received pool connect emit')
    })
    pool.on('error', function () {
      done()
    })
  })
})

The above test times out if I turn off my network connection, or if I use iptables to drop return packets on 5432. Running on a local server, it did what I expected when I shut down the database (since it received a hang up from the server).

Now obviously, in a production environment, a full network outage is not likely to happen (and if it does, we've probably got bigger problems). However, the second scenario (dropping replies on 5432) very closely simulates what happens when an Amazon RDS instance is rebooted or fails over. When that happens queries (and therefore requests from users) end up hanging until the upstream server decides it has waited long enough.

@psypersky
Copy link

I was trying the Pool and started my App in localhost with postgres off and couldn't get an error in this case.

my code is very simple

import { Pool } from 'pg';

var pool;
try {
  pool = new Pool();
}
catch(e) {
  console.error('Error creating pool', e);
}

pool.on('error', function(error, client) {
  console.error('got pool error', error);
});

pool.query('select NOW()')
.then(res => {
  console.log('Got db res', res.rows[0]);
})
.catch(err => {
  console.error('query error', e.message, e.stack);
});

@ReinsBrain
Copy link

I'm bumping up against this too. If we're looking for a solution, I'd like to offer a suggestion to look at allowing the connection to have a "statement_timeout" configuration option:

https://www.postgresql.org/docs/current/static/runtime-config-client.html#GUC-STATEMENT-TIMEOUT

@vitaly-t
Copy link
Contributor

vitaly-t commented Jul 21, 2016

I believe this issue was somehow introduced with the new Pool, because up to 5.1 there was no problem handling database disconnection, that's a fact. One could just provide a handler on the library itself: pg.on('error', handler), and it always worked.

@jadbox
Copy link
Contributor

jadbox commented Jul 21, 2016

At least with Redshift, I do not get any error events from either Pool or Client instances.

@brianc
Copy link
Owner

brianc commented Jul 21, 2016

I'm definitely open to a pull request to address this - I haven't
experienced this problem in my own code (yet). Likewise if you can put
together a repeatable, self-contained test file that reproduces the issue
that might help me get to this sooner. Unfortunately modifying IP tables
or shutting down the network connection wont really work in an automated
test environment, so if there's some way you can cause the same error w/o
modifying things externally then we're in business.

On Thu, Jul 21, 2016 at 1:24 PM, Jonathan [email protected] wrote:

At least with Redshift, I do not get any error events from either Pool or
Client instances.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1075 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AADDoToc5CwhVq-aVzlR5aX-A8BOKutRks5qX7logaJpZM4JJyxP
.

@jessedpate
Copy link
Author

I've found a few ways to mock this out in a test environment. I am
currently trying to unblock s few people at work, but I should be able to
look at putting in a pr for this soon.

On Thu, Jul 21, 2016, 11:27 AM Brian C [email protected] wrote:

I'm definitely open to a pull request to address this - I haven't
experienced this problem in my own code (yet). Likewise if you can put
together a repeatable, self-contained test file that reproduces the issue
that might help me get to this sooner. Unfortunately modifying IP tables
or shutting down the network connection wont really work in an automated
test environment, so if there's some way you can cause the same error w/o
modifying things externally then we're in business.

On Thu, Jul 21, 2016 at 1:24 PM, Jonathan [email protected]
wrote:

At least with Redshift, I do not get any error events from either Pool or
Client instances.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<
#1075 (comment)
,
or mute the thread
<
https://github.com/notifications/unsubscribe-auth/AADDoToc5CwhVq-aVzlR5aX-A8BOKutRks5qX7logaJpZM4JJyxP

.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1075 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAv5o9YXJX8LVfng7T97syBtaf1TZVyfks5qX7oKgaJpZM4JJyxP
.

@brianc
Copy link
Owner

brianc commented Jul 21, 2016

awesoooooooome

On Thu, Jul 21, 2016 at 1:32 PM, Jesse Pate [email protected]
wrote:

I've found a few ways to mock this out in a test environment. I am
currently trying to unblock s few people at work, but I should be able to
look at putting in a pr for this soon.

On Thu, Jul 21, 2016, 11:27 AM Brian C [email protected] wrote:

I'm definitely open to a pull request to address this - I haven't
experienced this problem in my own code (yet). Likewise if you can put
together a repeatable, self-contained test file that reproduces the issue
that might help me get to this sooner. Unfortunately modifying IP tables
or shutting down the network connection wont really work in an automated
test environment, so if there's some way you can cause the same error w/o
modifying things externally then we're in business.

On Thu, Jul 21, 2016 at 1:24 PM, Jonathan [email protected]
wrote:

At least with Redshift, I do not get any error events from either Pool
or
Client instances.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<

#1075 (comment)

,
or mute the thread
<

https://github.com/notifications/unsubscribe-auth/AADDoToc5CwhVq-aVzlR5aX-A8BOKutRks5qX7logaJpZM4JJyxP

.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<
#1075 (comment)
,
or mute the thread
<
https://github.com/notifications/unsubscribe-auth/AAv5o9YXJX8LVfng7T97syBtaf1TZVyfks5qX7oKgaJpZM4JJyxP

.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1075 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AADDodz2gQMpE6-CqH7S3692HpBhHNNZks5qX7s_gaJpZM4JJyxP
.

@ReinsBrain
Copy link

Im not so sure if the problem I have now is really related to this thread. We have nginx proxy sitting in front of our middle tier and the timeout is not really that long and shorter, I suspect, than the default client timeout on a connection to postgres. Is why I wanted to propose the ability to configure client connection time out... I'd like to be able to set the client timeout shorter than the nginx timeout to ensure that we're not sending back 500 to clients when a transaction just took to long to complete... is this worthy/welcome of starting another issue?

@jessedpate
Copy link
Author

Ultimately, I think the solution to this issue will end up solving your use
case as well -- when you're dealing with (potentially) no response from a
server, the best way to handle it is by allowing for the setting of a
timeout.

On Fri, Jul 22, 2016 at 1:27 PM Rein Petersen [email protected]
wrote:

Im not so sure if the problem I have now is really related to this thread.
We have nginx proxy sitting in front of our middle tier and the timeout is
not really that long and shorter, I suspect, than the default client
timeout on a connection to postgres. Is why I wanted to propose the ability
to configure client connection time out... I'd like to be able to set the
client timeout shorter than the nginx timeout to ensure that we're not
sending back 500 to clients when a transaction just took to long to
complete... is this worthy/welcome of starting another issue?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1075 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAv5o0nwfFWlwmWSyIGX0hH6l2NOb4Uiks5qYSefgaJpZM4JJyxP
.

@Neocid
Copy link

Neocid commented Aug 15, 2016

We do have the same issue.

AWS with Postgres in HA.
Client open connection pool to database.
When the database failover, connections are still established but all query fails.

@allspiritseve
Copy link

@jessedpate did you make any further progress on this issue?

@vitaly-t
Copy link
Contributor

vitaly-t commented Oct 4, 2016

I would be surprised. As many times as I've tried, steps for reproducing the issue are super-unreliable.

This is why we've made no progress here. Sometimes it takes me 20 times to kill the connection physically to be able see the problem.

It seems that simply closing the port from PostgreSQL itself doesn't quite do the trick....

From PostgreSQL I'm referring to running this command:

SELECT 
    pg_terminate_backend(pid) 
FROM 
    pg_stat_activity 
WHERE 
    -- don't kill my own connection!
    pid <> pg_backend_pid()
    -- don't kill the connections to other databases
    AND datname = 'database_name'

You really have to kill the connection physically to be able to see the problem.

Something tells me it may have a bearing on Node.js freaking out about a dead port... but then again, the issue is related to version 5.2+ only, so probably not...

P.S. I'm not giving up, I will try to reproduce it again when I find time.

@xsellier
Copy link

xsellier commented Nov 8, 2016

Hey there,

I had the very same issue till I found why.
On the debian server, keepalive packet was set to default values, ie:

net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200

I updated it for something more web-server-friendly:

net.ipv4.tcp_keepalive_intvl = 15
net.ipv4.tcp_keepalive_probes = 2
net.ipv4.tcp_keepalive_time = 60

So far so good, my issue was fixed. I hope it can help you folks ! (documentation)

@diegoperini
Copy link

This issue is critical. Is there any established workarounds until the fix is ready?

@xsellier
Copy link

@diegoperini If you update the sysctl configuration with

net.ipv4.tcp_keepalive_intvl = 15
net.ipv4.tcp_keepalive_probes = 2
net.ipv4.tcp_keepalive_time = 60

and if you use the db pool connection, everything must be ok.

@davepacheco
Copy link

@xsellier That workaround is useful, but it affects a lot of other components on the same system. It's not really a substitute for the client library handling this case appropriately.

@srijs
Copy link

srijs commented Jan 23, 2017

Hi!

I have a way to reproduce the issue with RDS failover in an AWS setting (using RDS and EC2). The core of the problem is that the network interface of the primary RDS instance stops responding to packets as soon as the failover is initiated, without sending RST packets. Therefore the connection will be stuck in the ESTABLISHED state for a long time.

To work around it, I have patched this library to set an idle timeout on the underlying socket. The downside of this is that it affects long-running queries as well. It works for us since we don't rely on long-running queries in our production environment.

Would the library maintainers be open to accept a PR exposing an option to set the idle timeout on the socket to solve this issue?

@jessedpate
Copy link
Author

jessedpate commented Jan 23, 2017 via email

@srijs
Copy link

srijs commented Jan 23, 2017

An alternative solution which we could realise with the native client is to set the keepalives_idle, keepalives_interval and keepalives_count settings to more aggressive values (or exposing them as an option) by passing them as query parameters to the connection string.

@vitaly-t
Copy link
Contributor

@srijs this all doesn't explain what exactly broke this, as connections worked perfectly up till version 5.1

@brianc brianc added this to the [email protected] milestone May 24, 2017
@brianc
Copy link
Owner

brianc commented May 24, 2017

Adding keep-alive configuration & better network partition handling to the 7.0 milestone.

@okischuang
Copy link

Using keep-alive to control the connection isn't a perfect solution to this problem, which will affect other application on the same os environment.
I will vote for client-connection-timeout and I've implemented and patch it on our production which seems solve our problem of HA.
I am very willing to send the pr but I need some time to familiar with the rule of contribution here.

@brianc
Copy link
Owner

brianc commented Jun 6, 2017

@okischuang send the patch on over! I'd love to take a look at it.

also - what do you mean keep alive would affect other applications on the same os environment?

@vitaly-t
Copy link
Contributor

vitaly-t commented Jun 6, 2017

Re-posting from another issue here, as it is relevant here...


From my recent conversation with a developer who was testing restoring connections through the driver...

Simply by disrupting TCP packages on the network he managed to kill the process where node-postgres 6.2.3 was used, while this didn't happen with v5.1.

So the issue with restoring broken connections is 100% there, just needs to be found.

@brianc
Copy link
Owner

brianc commented Jun 9, 2017

I think I've found the fix for this. When the backend server disconnects or socket is closed unexpectedly the client would emit an end event, but not an error event. This resulted in silent disconnections, pools full of disconnected clients. inability to handle db failover, etc. I've put a patch out at [email protected] - please check it out & lemme know if it fixes your issue!

@brianc
Copy link
Owner

brianc commented Jun 14, 2017

Closing for now as I think this has been taken care of! 💃

@je-al
Copy link

je-al commented Oct 4, 2019

Hey, sorry for reviving such an old issue but, from my own testing attempting to recover from Multi-AZ RDS failover scenarios (which I'm simulating by issuing a docker pause on a db instance and then updating a DNS record to point to another) I would think this is actually not solved. I believe (I'm currently attempting to get a confirmation from AWS) what @jessedpate claims here is right, which would mean we actually need a way to hook in a timeout at the socket level.

Our current workaround amounts to:

        stream.setTimeout(20000);
        stream.on('timeout', () => {
          stream.end();
          connection.emit('end');
        });

added to lib/connection.js.

This is actually possible with the current implementation, since the library allows passing an already existing socket object in the Client configuration, but some libraries (knex) don't actually have hook-points available to modify the creation of the connection (at least from what I could see), and had us jumping around some loops to achieve the same behaviour (afterCreate on tarn's config).

All in all, I think it'd be worthwhile to actually allow for configuration of the property, since as mentioned above, it's available on other drivers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests