Capture panics from selector execution #334

hannahhoward · 2022-01-19T03:28:26Z

Goals

Prevent the whole system from going down cause of an error in selector execution

Implementation

Adds a minimal panic recovery library
Sets the go routine that runs selector traversal to recover from panics with this library
Pass the dependencies through the code base so we can do this (I know, we need some kind of global object so we're not doing this over and over :( -- but not now)
And an option for calling code to supply its own callback to run when Graphsync recovers from panics

For discussion

I really am not sure how to test this!
This is minimally scoped to the selector traversal go routine for now. This does not cover request/response execution that occurs through the TaskQueue. I would like ot hold off on this for now particularly if we're merging soon.

willscott · 2022-01-20T11:53:08Z

impl/graphsync.go

@@ -232,7 +244,7 @@ func New(parent context.Context, network gsnet.GraphSyncNetwork,

 	asyncLoader := asyncloader.New(ctx, linkSystem)
 	requestQueue := taskqueue.NewTaskQueue(ctx)
-	requestManager := requestmanager.New(ctx, asyncLoader, linkSystem, outgoingRequestHooks, incomingResponseHooks, networkErrorListeners, outgoingRequestProcessingListeners, requestQueue, network.ConnectionManager(), gsConfig.maxLinksPerOutgoingRequest)
+	requestManager := requestmanager.New(ctx, asyncLoader, linkSystem, outgoingRequestHooks, incomingResponseHooks, networkErrorListeners, outgoingRequestProcessingListeners, requestQueue, network.ConnectionManager(), gsConfig.maxLinksPerOutgoingRequest, gsConfig.panicCallback)


this constructor is.. unwieldy

willscott · 2022-01-20T11:56:15Z

panics/panics.go

+// CallBackFn is a function that will get called with information about the panic
+type CallBackFn func(recoverObj interface{}, debugStackTrace string)
+
+// PanicHandler is a function that can be called with defer to recover from panics and pass them to a callback
+// it returns an error if a recovery was needed
+type PanicHandler func() error
+
+// MakeHandler makes a handler that recovers from panics and passes them to the given callback
+func MakeHandler(cb CallBackFn) PanicHandler {
+	return func() error {
+		obj := recover()
+		if obj == nil {
+			return nil
+		}
+		stack := string(debug.Stack())
+		if cb != nil {
+			cb(obj, stack)
+		}
+		return RecoveredPanicErr{
+			PanicObj:        obj,


is it worth the separate package / factory / complexity for calling recover in one place?

* fix panic handler, which requires recover() to be at the deferred callsite rather than indirected * add a test to simulate a panic in a codec, or storage, or somewhere deep in ipld-prime

rvagg · 2022-03-31T06:12:53Z

Approving changes as a formality because I've taken over this from @hannahhoward who can't provide a formal review, but it still needs a review or two from @hannahhoward and/or @willscott.

So, I've rebased this and fixed it up for the current main, then I expanded its use to the bindnode recovers I added in #368, and then expanded it further to cover all encodes and decodes as well. The networking piece needs a separate option, you have to gsnet.NewFromLibp2pHost(host, gsnet.PanicCallback(...)) to get handling down there since it's not directly connected to the per-request traversal path. I think this is reasonable, now you have low-level panic capturing and per-request traversal panic capturing.

I implemented a test case which covers the traversal piece, and in the process discovered that the originally implemented method didn't work! Apparently you have to recover() directly within the deferred callsite frame, not in a function called from that frame! So I've made the interface panicHandler(recover()), which isn't as clean as not having the recover() everywhere, but it apparently has to be so.

Further to discussion here and elsewhere about whether this is overkill having a callback--I think it's a reasonable design choice since we're converting the panics to standard errors but panics indicate programmer error and you really want to be handling them differently. This almost gives the best of both worlds because they aren't fatal, but you can optionally plug in and inspect for them and log them appropriately if necessary. I say "almost" because the best outcome would be more in-your-face, fatals are fatal for a very good reason, but we're dealing with sensitive services here so that's unfortunately not very OK.

PTAL.

willscott · 2022-03-31T14:34:28Z

this is fairly granular in terms of wrapping the panic handler around each operation like decoding a node that has the potential to panic. Do we have a sense of if there's a performance impact to doing it at that granularity? nodes are small / there are a lot of them in one transfer, and if we could have this recovery at a higher level function that only runs once per transfer I'd be less worried about potential performance impacts.

rvagg · 2022-04-01T02:48:33Z

Yeah, good questions. I'll put in a bit more work trying to trace these to higher-level places, and there's a chance there's even some overlap here which I was pondering after I stepped away from the code. But some of the messaging stuff, particularly on the receiving side, need to be fairly granular to be safe.

Any idea about what costs are involved in adding a defer + panic?

willscott · 2022-04-01T04:01:47Z

@mvdan - do you know what the cost of a recover when there hasn't been a panic would be? how problematic would that be to do on a block level?

rvagg · 2022-04-01T10:00:07Z

Some detailed notes after digging into this, I don't actually expect anyone else to read this but I need to record it for my own use. The take-away is that I could aggregate the bindnode and dagcbor recoveries up the stack closer to the networking handling. The main question I have is whether it's wise to put a recovery in network.libp2pGraphSyncNetwork#handleNewStream(), wrapping the entire stream, such that a panic resulting from any part of reading from a stream will error at the top level and cancel the stream, rather than just erroring an individual message. Or should I do it per message?

Non-test & benchmark paths to low-level functions currently recovering:

SafeWrap - EncodeMetadata() (protocol v1) & ToNet() (protocol v2) - both called (eventually) by network.msgToStream(), which has a messageHandlerSelector which we can easily give a PanicHandler in the current setup in this PR. There's at least one EncodeNode() call that has a recovery that we'd be aggregating by moving the recovery higher for this, and potentially many more because each extension also has an EncodeNode(), so we'd aggregate a bunch of recovery calls into one. For v2 we aggregate only 2 recoveries into 1 by doing this (there's an EncodeNodeInto() that couples with the Unwrap).
SafeUnwrap - DecodeMetadata() (protocol v1) & FromMsgReader() (protocol v2) - both called (eventually) by network.libp2pGraphSyncNetwork#handleNewStream() - where we have easy access to a PanicHandler but if we handle it in this then we'd error out the whole stream if we got an error in one message; is this OK? Otherwise we push down into the v1 & v2 FromMsgReader() functions. Similar to SafeWrap we aggregate potentially many DecodeNode() calls with recoveries as well, which might be a good saving. For v2 we only aggregate the two calls again, SafeUnwrap() and its associated DecodeNode().
EncodeNode() - aside from the v1 messaging, there's a call in RequestManager#validateRequest() to try and encode a selector and check whether there's an error or not doing this. I'm not really sure this is necessary, we could just skip to a selector.CompileSelector() which is supposed to manually strictly validate the DMT for a valid selector. So I reckon we could just remove this call entirely and save some complexity.
EncodeNodeInto() - other than being used by EncodeNode(), it's used by protocol v2 ToNet() coupled with SafeWrap(), which we could aggregate as per above.
DecodeNode() - only the 2 uses in protocol v1 - selector encoding and extension encoding, which could be aggregated as per above.
DecodeNodeInto() - other than being used by DecodeNode(), it's being used by protocol v2 FromMsgReader() which could be aggregated as per above.

mvdan · 2022-04-01T10:15:26Z

My intuition with defer-recovers is that you shouldn't need to worry about the cost if you're also spawning a goroutine, as both have an amount of overhead within the same order of magnitude. I'd probably be careful about sprinkling defer-recovers at func/block levels within a single goroutine; if the defers kick in relatively often, I wouldn't be surprised if there is some noticeable overhead. It all depends on how fast your code is, though - typically the overhead of defer/recover will be in the order of microseconds, so if you're doing any I/O, it probably wouldn't make a difference.

TL;DR you should measure, but I would certainly be careful with sprinkled recovers :)

mvdan · 2022-04-01T10:17:04Z

I should also add that the overhead of a single defer should be practically negligible these days; see how golang/go#14939 was fixed. But that's about defers in general, not about recover.

hannahhoward · 2022-04-04T20:56:31Z

honestly, my gut is push panics up. The high high odds are they don't happen too often. When they do, what we care about is graceful recovery without a crash, not making sure we lose as little progress/data as possible. I would honestly put:

One recovery in the traversal thread, as this PR does.
One recovery in the message queue internal thread (for EncodeMetadata & the message serialization functions)
A handleNewStream recovery (for DecodeMetadata and message deserialization functions)

I agree having EncodeNode in validateRequest is not ideal and I would prefer to remove it.

hannahhoward · 2022-04-04T21:00:14Z

To revise my suggestion above, I think for #2, the whole message queue thread is too large a surface area and we should use your suggestion of putting it in messageHandlerSelector.

So:

Recovery in traversal thread
Recovery in handleNewStream
Recovery in messageHandlerSelector

rvagg · 2022-04-07T01:56:45Z

Updated. As per that last comment, we now have:

Recovery in traversal thread
Recovery in handleNewStream
Recovery in messageHandlerSelector

I've lifted the recovery out of the weeds in ipldutil/* encode/decode/wrap/unwrap and it's all handled in network/* at the top level.

Implications:

For receiving streams—if a peer sends a message that causes a panic, the stream is going to be terminated
For sending messages, if a panic happens then it's done on a per-message basis, not for the whole stream
There's one call to ipldutil.EncodeNode(selectorSpec) that's not covered by any of this because it's outside of the direct messaging stack that's covered in RequestManager.validateRequest(). I don't believe this should be a problem, we don't have active concerns about encoding normal nodes, and selectors are a pretty vanilla case. What I've done here, however, is move the selector.ParseSelector() call to be above the ipldutil.EncodeNode() call in this position because ParseSelector() has a manual validation of the node structure and it's pretty comprehensive in walking the structure.

rvagg · 2022-04-07T02:03:23Z

ooof, I've noticed some hangovers from previous bindnode panic recovery which still need to be removed; so I've switched it to draft, no need to review until I switch it back (hopefully later today)

rvagg · 2022-04-07T02:49:47Z

network/libp2p_impl.go

+			log.Debugf("graphsync net handleNewStream recovered error from %s error: %s", s.Conn().RemotePeer(), rerr)
+			_ = s.Reset()
+			go gsnet.receiver.ReceiveError(p, rerr)


@hannahhoward I wouldn't mind a sanity-check that this is appropriate behaviour on a panic, it's roughly the same as for a standard error from the message handler.

.. there's potential for p to be unset at this call since it's only set after the FromMsgReader call below; alternative might be to move the p = up one line to have a better chance of it being set (it could also be unset because of a panic somewhere else! like NewVarintReaderSize().

I moved the p = line up one, there's much less potential for it to be unset unless there's a panic elsewhere in the system.

willscott

this level of handling looks reasonable to me

This PR logs at debug level a few messages that seem only useful for debugging. These messages otherwise occur in the logs frequently without much value.

hannahhoward requested a review from willscott January 19, 2022 03:28

willscott reviewed Jan 20, 2022

View reviewed changes

hannahhoward and others added 2 commits March 31, 2022 14:39

feat(panics): capture panics from selector execution

8b69754

feat: expand use of panic handler to cover network and codec interaction

acd24bd

* fix panic handler, which requires recover() to be at the deferred callsite rather than indirected * add a test to simulate a panic in a codec, or storage, or somewhere deep in ipld-prime

rvagg force-pushed the feat/capture-panics-in-selectors branch from 374b1a3 to acd24bd Compare March 31, 2022 05:57

rvagg approved these changes Mar 31, 2022

View reviewed changes

rvagg force-pushed the feat/capture-panics-in-selectors branch from 1f7440c to 28225d4 Compare April 7, 2022 01:54

rvagg marked this pull request as draft April 7, 2022 02:02

rvagg force-pushed the feat/capture-panics-in-selectors branch from 28225d4 to 0e7ddc2 Compare April 7, 2022 02:43

rvagg reviewed Apr 7, 2022

View reviewed changes

rvagg force-pushed the feat/capture-panics-in-selectors branch from 0e7ddc2 to 644df25 Compare April 7, 2022 02:51

fix(panics): lift panic recovery up to top of network handling

2dc0292

rvagg force-pushed the feat/capture-panics-in-selectors branch from 644df25 to 2dc0292 Compare April 7, 2022 02:52

rvagg marked this pull request as ready for review April 7, 2022 02:53

willscott approved these changes Apr 8, 2022

View reviewed changes

rvagg merged commit 49f40a5 into main Apr 19, 2022

rvagg deleted the feat/capture-panics-in-selectors branch April 19, 2022 05:19

marten-seemann pushed a commit that referenced this pull request Mar 2, 2023

Log less important messages at debug level (#334)

3aeb4b9

This PR logs at debug level a few messages that seem only useful for debugging. These messages otherwise occur in the logs frequently without much value.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Capture panics from selector execution #334

Capture panics from selector execution #334

hannahhoward commented Jan 19, 2022

willscott Jan 20, 2022

willscott Jan 20, 2022

rvagg commented Mar 31, 2022

willscott commented Mar 31, 2022

rvagg commented Apr 1, 2022

willscott commented Apr 1, 2022

rvagg commented Apr 1, 2022

mvdan commented Apr 1, 2022

mvdan commented Apr 1, 2022

hannahhoward commented Apr 4, 2022

hannahhoward commented Apr 4, 2022

rvagg commented Apr 7, 2022

rvagg commented Apr 7, 2022

rvagg Apr 7, 2022

rvagg Apr 7, 2022 •

edited

Loading

willscott left a comment

Capture panics from selector execution #334

Capture panics from selector execution #334

Conversation

hannahhoward commented Jan 19, 2022

Goals

Implementation

For discussion

willscott Jan 20, 2022

Choose a reason for hiding this comment

willscott Jan 20, 2022

Choose a reason for hiding this comment

rvagg commented Mar 31, 2022

willscott commented Mar 31, 2022

rvagg commented Apr 1, 2022

willscott commented Apr 1, 2022

rvagg commented Apr 1, 2022

mvdan commented Apr 1, 2022

mvdan commented Apr 1, 2022

hannahhoward commented Apr 4, 2022

hannahhoward commented Apr 4, 2022

rvagg commented Apr 7, 2022

rvagg commented Apr 7, 2022

rvagg Apr 7, 2022

Choose a reason for hiding this comment

rvagg Apr 7, 2022 • edited Loading

Choose a reason for hiding this comment

willscott left a comment

Choose a reason for hiding this comment

rvagg Apr 7, 2022 •

edited

Loading