Redshift Long Running Query hang indefinitely: Query completes on RedShift #1863

OTooleMichael · 2019-03-27T12:38:19Z

First off love the module and use it often for large ETL processes for multiple clients, certainly the best Db connection module on npm (and I often have to connect to literally every type of Db in the same @project).

After upgrading from v6.4 to v7.9 long running queries began to hang indefinitely despite completing successfully on the Redshift (RS) instance. Cant find a similar issue.

Other queries can still be run against RS (and return results)

during long running query
and after long running query's completion on the RS server.

No errors thrown, no evidence of success or failure. Queries work fine if they are shorter.

I've tried listening to every error event I can find in the API and using pg.Client rather than pg.Pool but there is no difference.

I'm out of ideas, code below.
Help much appreciated

const {Pool} = require('pg'); // "pg": "^7.8.2",
const env = require("../../envs").get;
const DEFAULT_RS_AUTH = env("REDSHIFT_DEFAULTS");
const REDSHIFT_USERS = env("REDSHIFT_USERS");
const Pools = {};
function getPgConfig(userId){
  const user = REDSHIFT_USERS[userId];
  const config = Object.assign({},DEFAULT_RS_AUTH);
  config.user = user.user;
  config.password = user.password;
  config.idleTimeoutMillis = 0;
  config.keepAlive = true //this setting isn't in the API DOCs btw (but makes no difference here)
  return config
};
function initPool(userId){
	let config = getAuth(userId);
    Pools[userId] = new Pool(config);
    Pools[userId].on('error', (err, client) => { // never thrown
      console.error('Unexpected error on idle client', err,userId)
      process.exit(-1)
    });
    return
}
function getPool(userId){ 
  if(!Pools[userId]){ // init the pool if it doesnt exist ( manage mulitple pools );
    initPool(userId);
  }
  return Pools[userId];
};

async function runQueriesList(){
	const queries = ['SQL1','SQL2','....','SQL25','....'];
	for(let sql of queries){
		// queries 1 through 24 run fine, all complete in less than 3mins
		// SQL25 runs and completed on Redshift in 3mins +
		// SQL25 never fails or returns in Node, 
		// no errors thrown, no timeout, other queries can be run with the same pool
		let res = await getPool('etl_user').query(sql);
		console.log(res);
	};
}
runQueriesList()

boromisp · 2019-03-27T14:29:37Z

I'm not sure, what changes in the library would have surfaced this issue. Could it be, that there were other changes (different OS) in your environment?

This sounds to me, could be related to the fact, that ~~AWS~~ something silently drops TCP connections after 4-5 minutes of inactivity.

To keep the connection alive, the TCP keepalive has to be set up with at most 200 seconds of idle timeout.

You should be able to do this with the currently released version, by setting the keepAlive parameter to true, and modifying the system defaults.

The #1847 PR adds the keepAliveIdleMillis parameter, to enable setting this in the application.

talha5389 · 2019-03-27T15:30:08Z

This also seems to be happening to me.

I am running a single query. It takes about 18-19 minutes. Pgadmin shows connection dropped and query result successfully committed yet promise for pool client query method never seems to be resolved.

I tried

Setting min = 1 as per Connection terminated unexpectedly #1611
Using native bindings as per Connection terminated unexpectedly #1611
Setting idleTimeoutMillis to huge value (12 hours)
Tested with keepAlive = true. Still same behavior

As mentioned by @OTooleMichael , it is working for small queries in my case.

Other things I have noticed is that, (they might have nothing to do with issue)

If I run from windows directly to AWS Postgres, it seems to work
If I run from windows via docker node:10.15.3 after some time, it gave error Connection Terminated Unexpectedly
If I run from linux based machine either via kubernetes or ubuntu, it never returns even though pgadmin shows connection being terminated and query successfully committed. Promise never resolves

talha5389 · 2019-03-27T16:08:17Z

I tested with both keepAlive true and setting OS tcp keep alive setting as indicated by @boromisp .
Seems to have solved the issue

OTooleMichael · 2019-03-27T19:41:05Z

The above did help to resolve the issue on my local machine, (although no settings have changed locally save for the pg module) as per:

keepAlive parameter to true, and modifying the system defaults

@boromisp I'm looking forward to the merge request. However will it cover throwing an error when a TCP connection is closed from the other side? That, I think is the main bug I faced here. The complete silence, made it hard to track down.

A promise/callback remaining forever open doesn't make sense, and will do things like depleting the Pool, right?

Also until merged will this be a work around?

function initPool(userId){
	let config = getAuth(userId);
    Pools[userId] = new Pool(config);
    Pools[userId].once('connect', function setKeepAlive(client) {
      // overwrite the connection keepAlive to be short
      client.connection.stream.setKeepAlive(true,200*1000);
    })
    return
}

boromisp · 2019-03-28T16:39:03Z

@OTooleMichael It seems OK to me.

(Disclaimer: I'm not an expert; these are just my thoughts on the topic.)

Related discussions: pgsql-odbc, pgsql-hackers/1, pgsql-hackers/2
TLDR: it's not a solved problem, but future versions of PostgreSQL / libpq might be better at detecting and handling different error conditions.

As I understand it, the current version of PostgreSQL relies on the transport layer being well behaved. If the connection is closed gracefully by either party, both the client and the server will handle it as expected.

Unfortunately, the server could simply disappear or become unresponsive. If setting more aggressive TCP keepalives solves the problem, then most likely a firewall somewhere along the way drops the connection. If you read through the linked discussions, you will find other edge cases, when only some of the components become unresponsive, while others are still working.

After reading this blog post (and later skimming the linked discussions), I don't think there is a way to detect a "broken" connection reliably without application-level health checks. The (linux specific) TCP_USER_TIMEOUT socket option mentioned in the linked blog post cannot be used in nodejs, and it wouldn't necessarily handle every edge case. Setting aggressive keepalive parameters could help keep the connection alive, but it would not detect broken connections.

Quick fixes, some of which you could implement, depending on your use-case:

Simple queries could be run with an appropriate query_timeout. The query_timeout is an undocumented parameter, that can be set in the connection parameters, or for individual queries. It applies to the time between sending the query to the server and giving the full response back to the user.

If you use a persistent connection to wait for notifications, then regularly running SELECT 1 with a short query_timeout on this connection could work.

Where it gets more complicated, is the legitimately long-running queries. In those cases the connection will have no activity for extended periods, and / or it takes a long time to receive all the rows.

If you expect to receive a lot of rows, you could listen on, and set up a timeout for the Query object's row event.

And finally, you could use the client.processID and the pg_stat_activity view to regularly check in a separate connection, if the query is still running. If the promise hasn't resolved after the query stopped running, you can destroy the socket (and by doing so, force to resolve the promise).

As an aside, whenever you abort query locally because the connection seems broken, it might be a good idea, to use pg_terminate_backend, to make sure the server also knows about it.

And just a reminder: whenever we use timeouts to abort queries or close connections, there will be a race condition on whether or not the query succeeds.

bradchristensen · 2022-05-29T08:21:43Z

You should be able to do this with the currently released version, by setting the keepAlive parameter to true, and modifying the system defaults.

The #1847 PR adds the keepAliveIdleMillis parameter, to enable setting this in the application.

Just came here to say that this fixed the issue for me - thanks! I'm not using RedShift but an AWS-based managed TimescaleDB instance. I'm also running the application/queries on my machine inside a Docker container, i.e. from outside AWS, so I'm guessing that's what's doing it as per the docs in the link above.

I had initially tried using keepAlive: true on its own and I didn't realise until reading this thread (which it took me a few days to find) that I needed to explicitly set the keepAliveInitialDelayMillis property to something for it to actually have any effect - I assumed it would have some sane default value. I've used a value of 10000 which seems to do the trick.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redshift Long Running Query hang indefinitely: Query completes on RedShift #1863

Redshift Long Running Query hang indefinitely: Query completes on RedShift #1863

OTooleMichael commented Mar 27, 2019 •

edited

Loading

boromisp commented Mar 27, 2019 •

edited

Loading

talha5389 commented Mar 27, 2019 •

edited

Loading

talha5389 commented Mar 27, 2019 •

edited

Loading

OTooleMichael commented Mar 27, 2019 •

edited

Loading

boromisp commented Mar 28, 2019

bradchristensen commented May 29, 2022

Redshift Long Running Query hang indefinitely: Query completes on RedShift #1863

Redshift Long Running Query hang indefinitely: Query completes on RedShift #1863

Comments

OTooleMichael commented Mar 27, 2019 • edited Loading

boromisp commented Mar 27, 2019 • edited Loading

talha5389 commented Mar 27, 2019 • edited Loading

talha5389 commented Mar 27, 2019 • edited Loading

OTooleMichael commented Mar 27, 2019 • edited Loading

boromisp commented Mar 28, 2019

bradchristensen commented May 29, 2022

OTooleMichael commented Mar 27, 2019 •

edited

Loading

boromisp commented Mar 27, 2019 •

edited

Loading

talha5389 commented Mar 27, 2019 •

edited

Loading

talha5389 commented Mar 27, 2019 •

edited

Loading

OTooleMichael commented Mar 27, 2019 •

edited

Loading