-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Handling connection timeouts / dropped network / db failover #1075
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hey - thanks for bringing this up! There are some tests around the pool emitting an error event on idle connections, but it looks like there are situations where the event isn't being emitted properly - that's unfortunate! I don't think you're missing anything in your config - I'm interested in any more research & fix you find about this issue. If there's anything I can do to help out on the fix or diagnose more or answer questions please let me know! |
Thanks Brian! I'll dive into it a little today and let you know if I run into any questions. |
Is this an issue with the new pool? I did very good stress-testing for lost connectivity with versions 4.x - 5.1, and it would never fall over, as long as you provide |
After diving in some, this doesn't seem to be an issue with pool itself. The default pg.Client connection isn't catching the problem and emitting the error event -- which makes sense, because in at least one case the problem occurs when the server goes away without sending a RST packet. It's difficult to respond to an error that isn't sent to you. The options to account for this are basically: |
@jessedpate when the server is gone, you are getting a connection error right away, so there is no need for a timeout in this case. And when you are executing queries through the pool, you can just log the connection errors. Doing it separately can be an unnecessary overhead, since every query would report the connectivity issue. |
You would think that would be the case! describe.only('events', function () {
this.timeout(20000)
it('emits error when a connection is dropped', function (done) {
console.log('creating pool')
var pool = new Pool({
min: 1,
host: 'some.remote.host',
user: 'username',
database: 'db_name',
password: 'some_password'
})
console.log('pool created')
pool.on('connect', function () {
console.log('received pool connect emit')
})
pool.on('error', function () {
done()
})
})
}) The above test times out if I turn off my network connection, or if I use iptables to drop return packets on 5432. Running on a local server, it did what I expected when I shut down the database (since it received a hang up from the server). Now obviously, in a production environment, a full network outage is not likely to happen (and if it does, we've probably got bigger problems). However, the second scenario (dropping replies on 5432) very closely simulates what happens when an Amazon RDS instance is rebooted or fails over. When that happens queries (and therefore requests from users) end up hanging until the upstream server decides it has waited long enough. |
I was trying the Pool and started my App in localhost with postgres off and couldn't get an error in this case. my code is very simple
|
I'm bumping up against this too. If we're looking for a solution, I'd like to offer a suggestion to look at allowing the connection to have a "statement_timeout" configuration option: https://www.postgresql.org/docs/current/static/runtime-config-client.html#GUC-STATEMENT-TIMEOUT |
I believe this issue was somehow introduced with the new |
At least with Redshift, I do not get any error events from either |
I'm definitely open to a pull request to address this - I haven't On Thu, Jul 21, 2016 at 1:24 PM, Jonathan [email protected] wrote:
|
I've found a few ways to mock this out in a test environment. I am On Thu, Jul 21, 2016, 11:27 AM Brian C [email protected] wrote:
|
awesoooooooome On Thu, Jul 21, 2016 at 1:32 PM, Jesse Pate [email protected]
|
Im not so sure if the problem I have now is really related to this thread. We have nginx proxy sitting in front of our middle tier and the timeout is not really that long and shorter, I suspect, than the default client timeout on a connection to postgres. Is why I wanted to propose the ability to configure client connection time out... I'd like to be able to set the client timeout shorter than the nginx timeout to ensure that we're not sending back 500 to clients when a transaction just took to long to complete... is this worthy/welcome of starting another issue? |
Ultimately, I think the solution to this issue will end up solving your use On Fri, Jul 22, 2016 at 1:27 PM Rein Petersen [email protected]
|
We do have the same issue. AWS with Postgres in HA. |
@jessedpate did you make any further progress on this issue? |
I would be surprised. As many times as I've tried, steps for reproducing the issue are super-unreliable. This is why we've made no progress here. Sometimes it takes me 20 times to kill the connection physically to be able see the problem. It seems that simply closing the port from PostgreSQL itself doesn't quite do the trick.... From PostgreSQL I'm referring to running this command: SELECT
pg_terminate_backend(pid)
FROM
pg_stat_activity
WHERE
-- don't kill my own connection!
pid <> pg_backend_pid()
-- don't kill the connections to other databases
AND datname = 'database_name' You really have to kill the connection physically to be able to see the problem. Something tells me it may have a bearing on Node.js freaking out about a dead port... but then again, the issue is related to version 5.2+ only, so probably not... P.S. I'm not giving up, I will try to reproduce it again when I find time. |
Hey there, I had the very same issue till I found why.
I updated it for something more web-server-friendly:
So far so good, my issue was fixed. I hope it can help you folks ! (documentation) |
This issue is critical. Is there any established workarounds until the fix is ready? |
@diegoperini If you update the sysctl configuration with
and if you use the db pool connection, everything must be ok. |
@xsellier That workaround is useful, but it affects a lot of other components on the same system. It's not really a substitute for the client library handling this case appropriately. |
Hi! I have a way to reproduce the issue with RDS failover in an AWS setting (using RDS and EC2). The core of the problem is that the network interface of the primary RDS instance stops responding to packets as soon as the failover is initiated, without sending RST packets. Therefore the connection will be stuck in the ESTABLISHED state for a long time. To work around it, I have patched this library to set an idle timeout on the underlying socket. The downside of this is that it affects long-running queries as well. It works for us since we don't rely on long-running queries in our production environment. Would the library maintainers be open to accept a PR exposing an option to set the idle timeout on the socket to solve this issue? |
That is the approach other libraries have implemented (mongo, mysql,
elastic search, et cetera). There really isn't another way around the issue
as best as I've ever been able to see.
…On Sun, Jan 22, 2017 at 5:55 PM Sam Rijs ***@***.***> wrote:
Hi!
I have a way to reproduce the issue with RDS failover in an AWS setting
(using RDS and EC2). The core of the problem is that the network interface
of the primary RDS instance stops responding to packets as soon as the
failover is initiated, without sending RST packets. Therefore the
connection will be stuck in the ESTABLISHED state for a long time.
To work around it, I have patched this library to set an idle timeout on
the underlying socket. The downside of this is that it affects long-running
queries as well. It works for us since we don't rely on long-running
queries in our production environment.
Would the library maintainers be open to accept a PR exposing an option to
set the idle timeout on the socket to solve this issue?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1075 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAv5o6bRRMnl-Vcx9-eDiY7F4GL5l4fcks5rVAiSgaJpZM4JJyxP>
.
|
An alternative solution which we could realise with the native client is to set the |
@srijs this all doesn't explain what exactly broke this, as connections worked perfectly up till version 5.1 |
Adding keep-alive configuration & better network partition handling to the 7.0 milestone. |
Using keep-alive to control the connection isn't a perfect solution to this problem, which will affect other application on the same os environment. |
@okischuang send the patch on over! I'd love to take a look at it. also - what do you mean keep alive would affect other applications on the same os environment? |
Re-posting from another issue here, as it is relevant here... From my recent conversation with a developer who was testing restoring connections through the driver... Simply by disrupting TCP packages on the network he managed to kill the process where So the issue with restoring broken connections is 100% there, just needs to be found. |
I think I've found the fix for this. When the backend server disconnects or socket is closed unexpectedly the client would emit an |
Closing for now as I think this has been taken care of! 💃 |
Hey, sorry for reviving such an old issue but, from my own testing attempting to recover from Multi-AZ RDS failover scenarios (which I'm simulating by issuing a Our current workaround amounts to:
added to This is actually possible with the current implementation, since the library allows passing an already existing socket object in the All in all, I think it'd be worthwhile to actually allow for configuration of the property, since as mentioned above, it's available on other drivers. |
Hello,
I'm working on setting up a health check within our system, and have noticed that our Postgres implementation doesn't seem to handle connection timeouts, dropped network, or a db failover event very gracefully (at least in the way it works when using a multi-az deployment in Amazon RDS). The issue is very similar to one that was fixed in the mysql package a while back -- mysqljs/mysql#821.
We effectively have a class that looks similar to this:
In a normal environment, when our application connects, I see it spit out "HEALTHY" x the number of connections it made, as expected. However, there are a few issues:
1.) If the connection is severed (I turn off my wireless, kill my VPN, or trigger a reboot on the RDS instance), no error events are raised even though the documentation for pg-pool states an error is "Emitted whenever an idle client in the pool encounters an error. This is common when your PostgreSQL server shuts down, reboots, or a network partition otherwise causes it to become unavailable while your pool has connected clients." My expectation would be to see a string of "UNHEALTHY" statements.
2.) Similarly, if I initiate a failover in RDS, no error events are raised but I am unable to run queries. This may be due to how AWS handles the DNS entries around multi-az deployments, but I cannot test this while issue #1 remains unresolved.
3.) Calling getDatabaseConnection when no replies are received from postgres (sudo iptables -A INPUT -p tcp --source-port 5432 -j DROP, sudo iptables -F afterwards to restore) hangs on grabbing a connection, as there is no timeout setting.
Am I missing a setting in my configuration, or is this a situation that just hasn't come up yet for someone else? If it's the latter, I'd be more than happy to create a fix for the issue, but want to make sure I'm not overlooking something first.
Node version: 6.2.x
PG version: 6.0.2
Postgres version: 9.5.x
Thanks!
The text was updated successfully, but these errors were encountered: