Transactions not always ended if client closes socket prematurely #1411

watson · 2019-09-28T08:46:01Z

The first test in test/instrumentation/modules/http/aborted-requests-enabled.js randomly hangs and never completes on Jenkins. It happens every 5 times or so we run a test on Jenkins and it happens on multiple different versions of Node.js.

This might be related to #1350

The text was updated successfully, but these errors were encountered:

watson · 2019-09-28T09:24:28Z

I've tracked it down to the transaction never ending, which in turn means that it's not sent to the mock APM Server, which then means that we never reach expected number of assertions (t.plan(9)) in the test.

The reason why it never ends is that we're listening for the prefinish event on the ServerResponse object:

apm-agent-nodejs/lib/instrumentation/http-shared.js

Lines 50 to 54 in 3e360ce

    
           // Handle case where res.end is called after an error occurred on the 
        
           // stream (e.g. if the underlying socket was prematurely closed) 
        
           res.on('prefinish', function () { 
        
             trans.end() 
        
           })

The place this event is fired inside of Node core is in the OutgoingMessage.prototype._finish function. This function is normally called when calling res.end(), but only if there's no data in the queue (detected by the this.outputData.length === 0 check).

The problem is that our code seems to contain a race condition which triggers if data is written to the response object too soon after the underlying socket has been aborted. So at the time when the data is written, the ServerResponse object doesn't yet know the socket is actually closed and queues the data (the last part is just a guess of mine).

Or tests normally doesn't trigger this race condition because the sockets operate quite quickly. But on Jenkins there's probably something related to how we use Docker, which means it takes a long time for the ServerResponse to get notified about the socket being closed. In fact it sometimes takes more than 10ms according to my tests on Jenkins, as opposed to around 1ms on my own machine. And since we only wait 10ms before we write to the response again after the client request has been aborted, we risk triggering the race condition.

I can reproduce this error locally if I change the 10ms timeout to 0ms:

apm-agent-nodejs/test/instrumentation/modules/http/aborted-requests-enabled.js

Lines 39 to 45 in 3e360ce

    
           clientReq.abort() 
        
           setTimeout(function () { 
        
             res.write('Hello') // server emits clientError if written in same tick as abort 
        
             setTimeout(function () { 
        
               res.end(' World') 
        
             }, 10) 
        
           }, 10)

We could just "fix" our tests by changing the 10ms timeout to a 100ms timeout. But that would still mean that we have a race condition. If that race condition triggers, a transaction will never be recorded for the request.

So why don't we just end the transaction when the ServerResponse closes? We could do that, and that would be more foolproof, but it would mean that the reported length of the transaction would be cut short in cases where the underlying socket was closed prematurely. The server code will still continue to treat the response as one you can write too even though the data is just thrown away, so from the point of view of the server, the transaction hasn't ended yet. So ending the transaction when the socket ends, would skew the avg. transaction length metrics and report smaller response times than in reality.

Possible solutions

End the transaction early and pretend all is good
- Pros:
  - Dead easy to implement
- Cons:
  - Avg. response times will be reported lower than they actually are
  - Sample transactions might be one of these transactions and will look weird as it's ended midway without any explanation
End the transaction early, but set a custom result, e.g. closed (👈 my personal favorite currently)
- Pros:
  - Easy to implement
  - Actually highlights what's going on instead of pretending that the transaction completed successfully
- Cons:
  - If not all agents do this, it might look weird that the Node.js agent implements it this way
Fix our current implementation so there's no race condition
- Pros:
  - No change in behavior
- Cons:
  - Hard to implement
  - Doesn't actually highlight what's going on

Qard · 2019-09-28T22:30:26Z

Perhaps we should get input from other agents? @elastic/apm-agent-devs

axw · 2019-10-07T07:34:30Z

Perhaps I'm missing something, but this specific issue seems to closely related to how Node.js works.

FWIW, in Go the server will be notified when the socket is closed, but the application code needs to actively watch for this and return if it wants to. Only once the application code returns will the transaction be ended.

Option 3 seems like to right thing to do to me, but I have a couple of questions:

What do you mean by "Doesn't actually highlight what's going on"? As in, there's no indication that the request was aborted? Maybe we could combine options 2 and 3: fix the current implementation, but also set the result to closed or aborted?
How would you go about fixing the current implementation? What makes it hard?

watson · 2019-10-07T09:56:18Z

@axw

this specific issue seems to closely related to how Node.js works.

Maybe yes... at least the streams of Node.js doesn't make this easy.

FWIW, in Go the server will be notified when the socket is closed, but the application code needs to actively watch for this and return if it wants to. Only once the application code returns will the transaction be ended.

In that case, it sounds like the Go agent "follows" the implementation of option 3 (which is also what I would expect for all the other agents).

What HTTP status code/result do you report for those types of request transactions where the underlying socket was closed by the client? The Node.js agent just reports whatever the reponse sets it to - e.g. HTTP 2xx.

Option 3 seems like to right thing to do to me

It might be pretty tricky to implement which is why I leaned more towards option 2. But if we can implement it and also combine it with option 2 as you suggest, then it would be the best of both worlds I guess (see below).

What do you mean by "Doesn't actually highlight what's going on"? As in, there's no indication that the request was aborted? Maybe we could combine options 2 and 3: fix the current implementation, but also set the result to closed or aborted?

Yes, that's what I meant. Combining 2 and 3 would solve this issue 👍

How would you go about fixing the current implementation? What makes it hard?

The issue is that our current implementation relies on the event prefinish which wasn't such a foolproof solution as I had originally thought. We relied on the event being emitted when the response was ended. But in certain scenarios it's not (see comment above). I'm currently not sure how to work around this, so more research is needed to figure out if option 3 is actually a valid option. This is why I thought option 2 might be a good compromise.

Currently, our implementation relies on the response object being a stream, and we "simply" listen for when that stream ends. This is what makes the implementation complicated. I guess we could instead listen for when someone calls response.end() and then end the transaction then. That would get around the stream related issues, but that might then wrongly end a transaction before the actual HTTP response is ended as any buffering happening internally in the response object after calling response.end() would be ignored (which is why we made our current implementation the way we did).

axw · 2019-10-08T05:18:41Z

What HTTP status code/result do you report for those types of request transactions where the underlying socket was closed by the client? The Node.js agent just reports whatever the reponse sets it to - e.g. HTTP 2xx.

At the moment the Go agent does the same. Having an indication that the request was cancelled seems like a useful change, I'm a bit unsure of whether changing the result is the right thing to do, since it'll affect throughput graphs. OTOH, being able to filter them easily may be useful for isolating outliers.

From your earlier analysis:

The problem is that our code seems to contain a race condition which triggers if data is written to the response object too soon after the underlying socket has been aborted. So at the time when the data is written, the ServerResponse object doesn't yet know the socket is actually closed and queues the data (the last part is just a guess of mine).

Yes, this does appear to be the case. I tested with v8.11.0 and v10.16.0; the former marks the connection as destroyed immediately, the latter does not.

Currently, our implementation relies on the response object being a stream, and we "simply" listen for when that stream ends. This is what makes the implementation complicated. I guess we could instead listen for when someone calls response.end() and then end the transaction then. That would get around the stream related issues, but that might then wrongly end a transaction before the actual HTTP response is ended as any buffering happening internally in the response object after calling response.end() would be ignored (which is why we made our current implementation the way we did).

How about we hook into OutgoingMessage.end, ending the transaction immediately if the request is aborted (req.aborted), since data won't be buffered any more, and otherwise wait for prefinish?

watson · 2019-10-08T06:19:29Z

How about we hook into OutgoingMessage.end, ending the transaction immediately if the request is aborted (req.aborted), since data won't be buffered any more, and otherwise wait for prefinish?

That sounds like a great idea. Unless there's some hidden edge case that I haven't thought of yet, it might just work 👍

This fixes a race condition that in some cases meant that a request transaction was never ended if the underlying TCP socket was closed by the client and the server didn't discover this in time. Fixes elastic#1411

This fixes a race condition that in some cases meant that a request transaction was never ended if the underlying TCP socket was closed by the client and the server didn't discover this in time. Fixes #1411

…c#1439) This fixes a race condition that in some cases meant that a request transaction was never ended if the underlying TCP socket was closed by the client and the server didn't discover this in time. Fixes elastic#1411

watson · 2019-10-09T07:09:37Z

I created a separate issue regarding marking the transaction as aborted: elastic/apm#154

…c#1439) This fixes a race condition that in some cases meant that a request transaction was never ended if the underlying TCP socket was closed by the client and the server didn't discover this in time. Fixes elastic#1411

This fixes a race condition that in some cases meant that a request transaction was never ended if the underlying TCP socket was closed by the client and the server didn't discover this in time. Fixes #1411

watson added the ci label Sep 28, 2019

watson self-assigned this Sep 28, 2019

watson added the [zube]: In Progress label Sep 28, 2019

watson added bug and removed ci labels Sep 28, 2019

watson changed the title ~~Aborted request tests randomly fail on Jenkins~~ Transactions not always ended if client closes socket prematurely Sep 28, 2019

watson mentioned this issue Oct 8, 2019

fix: always end transaction when socket is closed prematurely #1439

Merged

watson added [zube]: In Review and removed [zube]: In Progress labels Oct 8, 2019

watson closed this as completed in #1439 Oct 9, 2019

zube bot added [zube]: Done and removed [zube]: In Review labels Oct 9, 2019

watson mentioned this issue Oct 9, 2019

If client closes socket prematurely, mark transaction in a special way elastic/apm#154

Open

graphaelli removed the [zube]: Done label Jun 8, 2020

BCook98 mentioned this issue Mar 29, 2021

Transactions not ended on prematurely closed socket in Hapi <= v17 #2018

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transactions not always ended if client closes socket prematurely #1411

Transactions not always ended if client closes socket prematurely #1411

watson commented Sep 28, 2019 •

edited by zube bot

Loading

watson commented Sep 28, 2019 •

edited

Loading

Qard commented Sep 28, 2019

axw commented Oct 7, 2019

watson commented Oct 7, 2019

axw commented Oct 8, 2019

watson commented Oct 8, 2019

watson commented Oct 9, 2019

Transactions not always ended if client closes socket prematurely #1411

Transactions not always ended if client closes socket prematurely #1411

Comments

watson commented Sep 28, 2019 • edited by zube bot Loading

watson commented Sep 28, 2019 • edited Loading

Possible solutions

Qard commented Sep 28, 2019

axw commented Oct 7, 2019

watson commented Oct 7, 2019

axw commented Oct 8, 2019

watson commented Oct 8, 2019

watson commented Oct 9, 2019

watson commented Sep 28, 2019 •

edited by zube bot

Loading

watson commented Sep 28, 2019 •

edited

Loading