Skip to content
This repository was archived by the owner on Aug 4, 2023. It is now read-only.

Kibana Instrumentation and APM Server transport error (ECONNRESET): socket hang up Log Messages #127

Closed
astorm opened this issue Jan 8, 2021 · 14 comments
Labels
agent-nodejs Make available for APM Agents project planning.

Comments

@astorm
Copy link
Contributor

astorm commented Jan 8, 2021

We've received reports that some users are seeing the following error message

APM Server transport error (ECONNRESET): socket hang up

These users are using the Elastic Node.js Agent to instrument their Kibana development instances. This issue is a general catch all thread for information about these errors and our attempts to get a stable working reproduction in order to further diagnose the issue.

@github-actions github-actions bot added the agent-nodejs Make available for APM Agents project planning. label Jan 8, 2021
@astorm
Copy link
Contributor Author

astorm commented Jan 8, 2021

Kibana's dev server starts up with multiple Node.js processes

Screen Shot 2021-01-08 at 2 36 48 PM

It's unclear is all these processes are started via the cluster module, or if some are started up via a traditional child_process.fork().

It's also unclear if all these processes are serving Kibana and being instrumented by the agent, or if some processes are independent of that.

Finally, it's unclear what's meant by a "kibana restart" -- does this mean some (all) of the child processes are restarted?

Understanding Kibana's process model will be critical in understanding this bug. Without the cluster modules, each running node.js process will have its own apm-agent attached, with its own TCP connections to APM Server. However, if these processes are created with the cluster module that means they share network resources and TCP connections.

Our current working theory on this bug is during the process cycling of a restart (waves hands vaguely) bad things happen with the processes and the TCP connections while things are settling. (theory: one process closes the connection but other processes try to use that connection))

In addition to solving this for kibana, this also points to a general need to expand our multi-process support.

@astorm
Copy link
Contributor Author

astorm commented Jan 8, 2021

Another aspect to consider here -- users have reported they're using APM Server in the cloud when this error occurs. This means their agent configuration looks something like

var apm = require('elastic-apm-node').start({

  // Set custom APM Server URL (default: http://localhost:8200)
  
  serverUrl: 'https://long-fake-string.apm.us-east-1.aws.cloud.es.io:443',
  // could be GCP or Azure as well
})

Understanding what sort of load balancing layers exist between the Agent and the APM Server in the cloud will be important in diagnosing this issue.

@dgieselaar
Copy link
Member

From Slack:

Here’s what I think is happening:

  • The error itself is caused by the agent aborting the request after serverTimeout has been reached (15s)
  • the agent writes data to a stream, and pipes that stream to a request. it will close this stream every 10s, which should close the request as well
  • in some cases, the socket for the outgoing HTTP request is created shortly before the Kibana development server’s file watcher starts (e.g. watching for changes (8010 files))
  • in some cases, mostly when a secure connection has not been established before the file watcher starts, the stream is closed before the socket has established a (secure) connection. when this happens, the request never ends and eventually times out
  • I’m not sure why a secure connection is not established. If connect fires before the file watcher starts usually it will get a secureConnect later. In other cases, it sends a ClientHello but never receives a ServerHello. I’ve tried fiddling with keepAliveMsecs but wasn’t able to consistently fix it

@watson
Copy link
Contributor

watson commented Jan 11, 2021

Let me know if I can help diagnose this problem. I wrote the stream implementation here and I know it's quite complicated and not easy to understand, so if there's anything I can do to help don't hesitate to ask 😃

@dgieselaar
Copy link
Member

I see this happening on starts, but also restarts, and potentially at any time during the lifecycle of the proxy server, but I haven't been able to confirm the latter yet.

@watson
Copy link
Contributor

watson commented Jan 11, 2021

@dgieselaar Do you know if it happens outside of Kibana as well, or have you only seen this in Kibana so far? If only in Kibana, do you know if it also happens if connecting to a non-proxied APM Server?

@dgieselaar
Copy link
Member

I've only seen it in Kibana in development mode with a proxy Kibana server (which is the proxy I'm referring to). I've not tried any other ways of running Kibana.

@trentm
Copy link
Member

trentm commented Jan 11, 2021

Dario and Tyler have been using Kibana's master branch, which IIUC no longer uses cluster as of elastic/kibana@fd1328f

@dgieselaar
Copy link
Member

I was able to consistently reproduce this by delaying the initialisation of the stream by about ~1.5s.I did this in a very gross manner, which was adding a timeout before initialising the StreamChopper instance. There is probably a better way. What the right delay is probably is dependent on the machine. But, for it to consistently reproduce the stream has to be created before the file watcher log message (watching for changes), and the socket should only connect after this message.

@tylersmalley
Copy link

Thanks all for helping out with this. Here is a bit more information:

While in development, I was able to see the socket hang up without a Kibana server restart or any other change.

server    log   [13:37:44.590] [info][plugins][watcher] Your basic license does not support watcher. Please upgrade your license.
server    log   [13:37:44.594] [info][crossClusterReplication][plugins] Your basic license does not support crossClusterReplication. Please upgrade your license.
server    log   [13:37:44.595] [info][kibana-monitoring][monitoring][monitoring][plugins] Starting monitoring stats collection
server    log   [13:37:45.353] [info][listening] Server running at http://localhost:5601/ued
server    log   [13:37:45.866] [info][server][Kibana][http] http server running at http://localhost:5601/ued
APM Server transport error (ECONNRESET): socket hang up
APM Server transport error (ECONNRESET): socket hang up

There have been discussions around this being related to the Kibana server restarts, so I decided to work on reproducing outside that environment.

I have been able to reproduce using a 8.0.0 snapshot build of Kibana.

I am wondering if this is an issue with the APM server in Cloud. Is there anything that would be helpful to make that determination or rule it out?

@tylersmalley
Copy link

@dgieselaar has informed me my previous comment was due to the apmRequestTimeout being the same as the serverTimeout.

@dgieselaar
Copy link
Member

I have recently started sending data to a different APM Server instance, also in cloud, and am not seeing this error anymore, at least not as often.

@trentm
Copy link
Member

trentm commented Apr 20, 2021

@dgieselaar v3.14.0 of the agent includes a fix for the blocking behaviour issue we were seeing with the agent talking to APM server. I have elastic/kibana#97509 open to update Kibana to use the new agent. Would you be able/willing some time to try to reproduce those same errors you were seeing with the updated agent?

@trentm
Copy link
Member

trentm commented Aug 3, 2023

I don't know for certain, but I've not heard any more APM Server transport error (ECONNRESET): socket hang up issues come up in the intervening time (2y+). I'm hoping that the changes in #144 resolved this issue.

I'm closing now. We can re-open this or an issue on https://github.com/elastic/apm-agent-nodejs later if the issue re-occurs. (Note that in elastic/apm-agent-nodejs#3507 the http-client code was moved to the apm-agent-nodejs repo.)

@trentm trentm closed this as completed Aug 3, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
agent-nodejs Make available for APM Agents project planning.
Projects
None yet
Development

No branches or pull requests

5 participants