Skip to content

Transactions not always ended if client closes socket prematurely #1411

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
watson opened this issue Sep 28, 2019 · 7 comments · Fixed by #1439
Closed

Transactions not always ended if client closes socket prematurely #1411

watson opened this issue Sep 28, 2019 · 7 comments · Fixed by #1439
Assignees
Labels

Comments

@watson
Copy link
Contributor

watson commented Sep 28, 2019

The first test in test/instrumentation/modules/http/aborted-requests-enabled.js randomly hangs and never completes on Jenkins. It happens every 5 times or so we run a test on Jenkins and it happens on multiple different versions of Node.js.

This might be related to #1350

@watson watson added the ci label Sep 28, 2019
@watson watson self-assigned this Sep 28, 2019
@watson
Copy link
Contributor Author

watson commented Sep 28, 2019

I've tracked it down to the transaction never ending, which in turn means that it's not sent to the mock APM Server, which then means that we never reach expected number of assertions (t.plan(9)) in the test.

The reason why it never ends is that we're listening for the prefinish event on the ServerResponse object:

// Handle case where res.end is called after an error occurred on the
// stream (e.g. if the underlying socket was prematurely closed)
res.on('prefinish', function () {
trans.end()
})

The place this event is fired inside of Node core is in the OutgoingMessage.prototype._finish function. This function is normally called when calling res.end(), but only if there's no data in the queue (detected by the this.outputData.length === 0 check).

The problem is that our code seems to contain a race condition which triggers if data is written to the response object too soon after the underlying socket has been aborted. So at the time when the data is written, the ServerResponse object doesn't yet know the socket is actually closed and queues the data (the last part is just a guess of mine).

Or tests normally doesn't trigger this race condition because the sockets operate quite quickly. But on Jenkins there's probably something related to how we use Docker, which means it takes a long time for the ServerResponse to get notified about the socket being closed. In fact it sometimes takes more than 10ms according to my tests on Jenkins, as opposed to around 1ms on my own machine. And since we only wait 10ms before we write to the response again after the client request has been aborted, we risk triggering the race condition.

I can reproduce this error locally if I change the 10ms timeout to 0ms:

clientReq.abort()
setTimeout(function () {
res.write('Hello') // server emits clientError if written in same tick as abort
setTimeout(function () {
res.end(' World')
}, 10)
}, 10)

We could just "fix" our tests by changing the 10ms timeout to a 100ms timeout. But that would still mean that we have a race condition. If that race condition triggers, a transaction will never be recorded for the request.

So why don't we just end the transaction when the ServerResponse closes? We could do that, and that would be more foolproof, but it would mean that the reported length of the transaction would be cut short in cases where the underlying socket was closed prematurely. The server code will still continue to treat the response as one you can write too even though the data is just thrown away, so from the point of view of the server, the transaction hasn't ended yet. So ending the transaction when the socket ends, would skew the avg. transaction length metrics and report smaller response times than in reality.

Possible solutions

  1. End the transaction early and pretend all is good
    • Pros:
      • Dead easy to implement
    • Cons:
      • Avg. response times will be reported lower than they actually are
      • Sample transactions might be one of these transactions and will look weird as it's ended midway without any explanation
  2. End the transaction early, but set a custom result, e.g. closed (👈 my personal favorite currently)
    • Pros:
      • Easy to implement
      • Actually highlights what's going on instead of pretending that the transaction completed successfully
    • Cons:
      • If not all agents do this, it might look weird that the Node.js agent implements it this way
  3. Fix our current implementation so there's no race condition
    • Pros:
      • No change in behavior
    • Cons:
      • Hard to implement
      • Doesn't actually highlight what's going on

@watson watson added bug and removed ci labels Sep 28, 2019
@watson watson changed the title Aborted request tests randomly fail on Jenkins Transactions not always ended if client closes socket prematurely Sep 28, 2019
@Qard
Copy link
Contributor

Qard commented Sep 28, 2019

Perhaps we should get input from other agents? @elastic/apm-agent-devs

@axw
Copy link
Member

axw commented Oct 7, 2019

Perhaps I'm missing something, but this specific issue seems to closely related to how Node.js works.

FWIW, in Go the server will be notified when the socket is closed, but the application code needs to actively watch for this and return if it wants to. Only once the application code returns will the transaction be ended.

Option 3 seems like to right thing to do to me, but I have a couple of questions:

  • What do you mean by "Doesn't actually highlight what's going on"? As in, there's no indication that the request was aborted? Maybe we could combine options 2 and 3: fix the current implementation, but also set the result to closed or aborted?
  • How would you go about fixing the current implementation? What makes it hard?

@watson
Copy link
Contributor Author

watson commented Oct 7, 2019

@axw

this specific issue seems to closely related to how Node.js works.

Maybe yes... at least the streams of Node.js doesn't make this easy.

FWIW, in Go the server will be notified when the socket is closed, but the application code needs to actively watch for this and return if it wants to. Only once the application code returns will the transaction be ended.

In that case, it sounds like the Go agent "follows" the implementation of option 3 (which is also what I would expect for all the other agents).

What HTTP status code/result do you report for those types of request transactions where the underlying socket was closed by the client? The Node.js agent just reports whatever the reponse sets it to - e.g. HTTP 2xx.

Option 3 seems like to right thing to do to me

It might be pretty tricky to implement which is why I leaned more towards option 2. But if we can implement it and also combine it with option 2 as you suggest, then it would be the best of both worlds I guess (see below).

What do you mean by "Doesn't actually highlight what's going on"? As in, there's no indication that the request was aborted? Maybe we could combine options 2 and 3: fix the current implementation, but also set the result to closed or aborted?

Yes, that's what I meant. Combining 2 and 3 would solve this issue 👍

How would you go about fixing the current implementation? What makes it hard?

The issue is that our current implementation relies on the event prefinish which wasn't such a foolproof solution as I had originally thought. We relied on the event being emitted when the response was ended. But in certain scenarios it's not (see comment above). I'm currently not sure how to work around this, so more research is needed to figure out if option 3 is actually a valid option. This is why I thought option 2 might be a good compromise.

Currently, our implementation relies on the response object being a stream, and we "simply" listen for when that stream ends. This is what makes the implementation complicated. I guess we could instead listen for when someone calls response.end() and then end the transaction then. That would get around the stream related issues, but that might then wrongly end a transaction before the actual HTTP response is ended as any buffering happening internally in the response object after calling response.end() would be ignored (which is why we made our current implementation the way we did).

@axw
Copy link
Member

axw commented Oct 8, 2019

What HTTP status code/result do you report for those types of request transactions where the underlying socket was closed by the client? The Node.js agent just reports whatever the reponse sets it to - e.g. HTTP 2xx.

At the moment the Go agent does the same. Having an indication that the request was cancelled seems like a useful change, I'm a bit unsure of whether changing the result is the right thing to do, since it'll affect throughput graphs. OTOH, being able to filter them easily may be useful for isolating outliers.

From your earlier analysis:

The problem is that our code seems to contain a race condition which triggers if data is written to the response object too soon after the underlying socket has been aborted. So at the time when the data is written, the ServerResponse object doesn't yet know the socket is actually closed and queues the data (the last part is just a guess of mine).

Yes, this does appear to be the case. I tested with v8.11.0 and v10.16.0; the former marks the connection as destroyed immediately, the latter does not.

Currently, our implementation relies on the response object being a stream, and we "simply" listen for when that stream ends. This is what makes the implementation complicated. I guess we could instead listen for when someone calls response.end() and then end the transaction then. That would get around the stream related issues, but that might then wrongly end a transaction before the actual HTTP response is ended as any buffering happening internally in the response object after calling response.end() would be ignored (which is why we made our current implementation the way we did).

How about we hook into OutgoingMessage.end, ending the transaction immediately if the request is aborted (req.aborted), since data won't be buffered any more, and otherwise wait for prefinish?

@watson
Copy link
Contributor Author

watson commented Oct 8, 2019

How about we hook into OutgoingMessage.end, ending the transaction immediately if the request is aborted (req.aborted), since data won't be buffered any more, and otherwise wait for prefinish?

That sounds like a great idea. Unless there's some hidden edge case that I haven't thought of yet, it might just work 👍

watson added a commit to watson/apm-agent-nodejs that referenced this issue Oct 8, 2019
This fixes a race condition that in some cases meant that a request
transaction was never ended if the underlying TCP socket was closed by
the client and the server didn't discover this in time.

Fixes elastic#1411
watson added a commit that referenced this issue Oct 9, 2019
This fixes a race condition that in some cases meant that a request
transaction was never ended if the underlying TCP socket was closed by
the client and the server didn't discover this in time.

Fixes #1411
watson added a commit to watson/apm-agent-nodejs that referenced this issue Oct 9, 2019
…c#1439)

This fixes a race condition that in some cases meant that a request
transaction was never ended if the underlying TCP socket was closed by
the client and the server didn't discover this in time.

Fixes elastic#1411
@watson
Copy link
Contributor Author

watson commented Oct 9, 2019

I created a separate issue regarding marking the transaction as aborted: elastic/apm#154

v1v pushed a commit to v1v/apm-agent-nodejs that referenced this issue Oct 11, 2019
…c#1439)

This fixes a race condition that in some cases meant that a request
transaction was never ended if the underlying TCP socket was closed by
the client and the server didn't discover this in time.

Fixes elastic#1411
Qard pushed a commit that referenced this issue Oct 16, 2019
This fixes a race condition that in some cases meant that a request
transaction was never ended if the underlying TCP socket was closed by
the client and the server didn't discover this in time.

Fixes #1411
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants