Skip to content

Connection terminated unexpectedly only in Node.js v20 #3083

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gajus opened this issue Oct 23, 2023 · 19 comments
Closed

Connection terminated unexpectedly only in Node.js v20 #3083

gajus opened this issue Oct 23, 2023 · 19 comments

Comments

@gajus
Copy link
Contributor

gajus commented Oct 23, 2023

Has anyone encountered an issue migrating from Node.js v18 to v20?

I am trying isolate exactly what's breaking, but so far unsuccessful.

All tests are passing in Node.js v18, but as soon as we try to run them in v20, they start to fail with:

   FAIL  test/contra-temporal-job-worker/workflows/generatePDFInvoice.test.ts > generatePDFInvoice for custom UNPAID invoice after milestone autogenerated invoice
  Error: Connection terminated unexpectedly
   ❯ Connection.<anonymous> ../../node_modules/.pnpm/[email protected]/node_modules/pg/lib/client.js:132:73
   ❯ Object.onceWrapper node:events:628:28
   ❯ Connection.emit node:events:514:28
   ❯ Connection.emit node:domain:488:12
   ❯ Socket.<anonymous> ../../node_modules/.pnpm/[email protected]/node_modules/pg/lib/connection.js:63:12
   ❯ Socket.emit node:events:514:28
   ❯ Socket.emit node:domain:488:12
   ❯ TCP.<anonymous> node:net:337:12

It happens to about ~1% of all tests (out of about a thousand).

It is not happening to the same tests.

Only happens when running a large number of parallel tests.

@gajus
Copy link
Contributor Author

gajus commented Oct 23, 2023

Running out of ideas here. It appears like a connection inside of a pool is getting terminated prematurely. The query happens immediately after the pool is created, so it is not a timeout question. Appears like so weird race condition. Odd that it is only happening in Node.js v20

@brianc
Copy link
Owner

brianc commented Oct 23, 2023

Sorry that's happening! Do you have any self-contained code to reproduce?

@gajus
Copy link
Contributor Author

gajus commented Oct 23, 2023

Sorry that's happening! Do you have any self-contained code to reproduce?

Been breaking my neck the whole day trying to isolate this. So far I've got nothing that I can share.

I will close this until I can share something more concrete.

@gajus gajus closed this as completed Oct 23, 2023
@sehrope
Copy link
Contributor

sehrope commented Oct 24, 2023

Have you tried bisecting the node versions or pg versions to see when the break happens?

Given the randomness of the error and how hard it is to reproduce, I suggest starting with v20.0.0 (vs. v18.LATEST) as a start to see if that's where the error is starting.

Nothing comes to mind in this module or node proper that would cause a break, but major versions are when internals like compiler versions, libc requirements, V8 runtimes, and many other things get updated. It could be any one of them or a combination of them that's leading to some new race condition.

@brianc
Copy link
Owner

brianc commented Oct 24, 2023

I will close this until I can share something more concrete.

alright...I'm standing by!

@gajus
Copy link
Contributor Author

gajus commented Oct 24, 2023

What's strange is that it happens only on 1 query.

const setupPool = await createPool(CAS_POSTGRES_DSN, {
  captureStackTrace: true,
  connectionTimeout: 5_000,
  maximumPoolSize: 1,
});

const { TEST_RUN_ID } = globalThis;

if (!TEST_RUN_ID) {
  throw new Error('TEST_RUN_ID is not set.');
}

const database = 'test_' + TEST_RUN_ID.replaceAll(/[^\da-z]/gu, '');

// https://github.com/brianc/node-postgres/issues/3083
await retry(
  async () => {
    await setupPool.query(sql.typeAlias('void')`
      CREATE DATABASE ${sql.identifier([database])}
      TEMPLATE contra_template
    `);
  },
  { delayMs: 100, maxRetries: 3 },
);

None of the other queries are affected by this. But the one that's responsible for setting up the test database keeps erroring with "Connection terminated unexpectedly". I cannot replicate the same error even once on v18.

Adding a retry here fixed our CI from flaking, but the issue definitely persists.

@gajus
Copy link
Contributor Author

gajus commented Oct 25, 2023

So still no clue after debugging this for 24 hours.

I've enabled logs of everything, but haven't caught anything definitive, esp. that would link to version update.

We are also seeing the same issue in Node.js v21. Nothing if we downgrade to v18.

One interesting data point that surfaced is ...

2023-10-25T20:19:23.8096820Z [32mcontra-api:test:vitest: [0m [31m×[39m test/contra-api/integration/queries/suggestedTools.test.ts[2m > [22mgets suggested tools with exclude argument
2023-10-25T20:19:23.8105822Z [32mcontra-api:test:vitest: [0m[31m   → Connection terminated unexpectedly[39m
2023-10-25T20:19:23.8106865Z [32mcontra-api:test:vitest: [0m[31m   → Connection terminated unexpectedly[39m
2023-10-25T20:19:24.0586996Z [32mcontra-api:test:vitest: [0m[2m2023-10-25T20:19:24.058146Z[0m [32m INFO[0m [2mtemporal_sdk_core::worker[0m[2m:[0m Initiated shutdown [3mtask_queue[0m[2m=[0mtest-121bf9d3-ce4f-41aa-8ebc-7c1578920e87 [3mnamespace[0m[2m=[0mdefault

Burst of unexpected terminations happened at 2023-10-25T20:19:23

At 20:19:22 databased logged this:

2023-10-25T20:20:54.0516892Z  2023-10-25 20:19:22.330 UTC [3178] LOG:  duration: 102.453 ms  statement: 
2023-10-25T20:20:54.0517065Z  	      CREATE DATABASE "test_13d1bd99713443258e7a08e4292960be"
2023-10-25T20:20:54.0517169Z  	      TEMPLATE contra_template
2023-10-25T20:20:54.0517260Z  	    
2023-10-25T20:20:54.0517610Z  2023-10-25 20:19:22.631 UTC [3181] FATAL:  terminating autovacuum process due to administrator command
2023-10-25T20:20:54.0517980Z  2023-10-25 20:19:22.631 UTC [3181] CONTEXT:  while scanning block 731 of relation "pg_catalog.pg_shdepend"
2023-10-25T20:20:54.0518198Z  2023-10-25 20:19:22.762 UTC [3174] LOG:  duration: 131.581 ms  statement: 
2023-10-25T20:20:54.0518378Z  	      DROP DATABASE "test_c2dd3de68fa2407485e615cd9098499e"
2023-10-25T20:20:54.0518457Z  	    
2023-10-25T20:20:54.0518681Z  2023-10-25 20:19:22.874 UTC [3192] LOG:  duration: 101.949 ms  statement: 
2023-10-25T20:20:54.0518854Z  	      CREATE DATABASE "test_cf5f40eb783e46a3ac887dc2949706b4"
2023-10-25T20:20:54.0518964Z  	      TEMPLATE contra_template

I don’t have an educated guess here, but I am curious if it is somehow related to the FATAL error we are seeing because the database is being dropped while it being vacuumed.

We don’t need autovacuum in these test database anyways, so I am removing it. Will try to validate this hypothesis.

@sehrope
Copy link
Contributor

sehrope commented Oct 25, 2023

The node client crashing when the database fast shutdowns is a known thing: #1324 (comment)

Without an error event handler, the client will crash the node process if the connection gets killed. That's only if the connection is checked out as the pool attaches error handlers, but they're only in place while the connection is in the pool. To handle all errant situations, you need to add error handlers as the connection crashed error comes as an event, not in response to a promise rejection or callback.

Maybe this is purely a race condition in the tests with the newer node versions running faster (or slower?) and that's triggering that test case at the exact moment when the DB is being dropped.

@gajus
Copy link
Contributor Author

gajus commented Oct 25, 2023

Thank you. This gives me something to work with.

@gajus
Copy link
Contributor Author

gajus commented Oct 25, 2023

@sehrope Are you aware of a good way to replicate this in a test environment? I would like to add a test to Slonik to make sure that we are handling it.

@sehrope
Copy link
Contributor

sehrope commented Oct 25, 2023

Looks like you tried five years ago 😄 #1700

Try killing the connection with pg_terminate_backend or crashing the server itself to kill the socket. Not sure if the former works but the latter would definitely work. Just make sure to do it when the connection is checked out of the pool without an error handler.

@gajus
Copy link
Contributor Author

gajus commented Oct 25, 2023

oh lol.

The pg_terminate_backend is properly handled in Slonik though.

That would have caused a different error to propagate.

  BackendTerminatedError {
    cause: error (DatabaseError) @Error {
      code: '57P01',
      column: undefined,
      constraint: undefined,
      dataType: undefined,
      detail: undefined,
      file: 'postgres.c',
      hint: undefined,
      internalPosition: undefined,
      internalQuery: undefined,
      length: 116,
      line: '3218',
      position: undefined,
      routine: 'ProcessInterrupts',
      schema: undefined,
      severity: 'FATAL',
      table: undefined,
      where: undefined,
      message: 'terminating connection due to administrator command',
    },
    notices: [],
  }

I realized I can replicate this by force killing the process that's running Postgres.

  Rejected promise returned by test. Reason:

  Error {
    notices: [],
    message: 'Connection terminated unexpectedly',
  }

  › node_modules/pg/lib/client.js:526:17
  › async executionRoutine (src/connectionMethods/query.ts:20:5)
  › async executeQuery (src/routines/executeQuery.ts:241:18)
  › async query (src/connectionMethods/query.ts:54:10)
  › async src/binders/bindPool.ts:191:18
  › async createConnection (src/factories/createConnection.ts:127:14)
  › async Object.query (src/binders/bindPool.ts:185:14)
  › async test/slonik/integration/pg.ts:90:3

To handle all errant situations, you need to add error handlers as the connection crashed error comes as an event, not in response to a promise rejection or callback.

But it looks like I am already attaching error handlers to the pool and to the client?

@gajus
Copy link
Contributor Author

gajus commented Oct 25, 2023

@gajus
Copy link
Contributor Author

gajus commented Oct 25, 2023

I am not entirely sure what's the best way to handle this.

Is it safe to retry a query in this scenario?

I would think so. Need to do more research.

@charmander
Copy link
Collaborator

Is it safe to retry a query in this scenario?

Depends on the query. It may or may not have completed.

@JafetGuerra
Copy link

JafetGuerra commented Feb 8, 2024

I am having the exact same error here. Sometimes is when reaching out to a Redis instance and sometimes is when I have to reach out the DB itself (Mysql). Could not find a way to solve this without downgrade NodeJs version. It seems that is something with node indeed.

@timurturbil
Copy link

@JafetGuerra what is node.js version you downgrade to?

@JafetGuerra
Copy link

JafetGuerra commented Jun 11, 2024

@timurturbil I am using NodeJs 18 in some projects and NodeJs 16 in only one. I believe that downgrading to 18 should work.

@turjoy-real
Copy link

In my case, this error occurred because the server from which I was calling the APIs was serving through nginx http/1.1 to my node@20 API call.

Updating the the API server to nginx http/2.x resolved the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants