Skip to content

Capture panics from selector execution #334

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 19, 2022
Merged

Conversation

hannahhoward
Copy link
Collaborator

Goals

Prevent the whole system from going down cause of an error in selector execution

Implementation

  • Adds a minimal panic recovery library
  • Sets the go routine that runs selector traversal to recover from panics with this library
  • Pass the dependencies through the code base so we can do this (I know, we need some kind of global object so we're not doing this over and over :( -- but not now)
  • And an option for calling code to supply its own callback to run when Graphsync recovers from panics

For discussion

  • I really am not sure how to test this!
  • This is minimally scoped to the selector traversal go routine for now. This does not cover request/response execution that occurs through the TaskQueue. I would like ot hold off on this for now particularly if we're merging soon.

@@ -232,7 +244,7 @@ func New(parent context.Context, network gsnet.GraphSyncNetwork,

asyncLoader := asyncloader.New(ctx, linkSystem)
requestQueue := taskqueue.NewTaskQueue(ctx)
requestManager := requestmanager.New(ctx, asyncLoader, linkSystem, outgoingRequestHooks, incomingResponseHooks, networkErrorListeners, outgoingRequestProcessingListeners, requestQueue, network.ConnectionManager(), gsConfig.maxLinksPerOutgoingRequest)
requestManager := requestmanager.New(ctx, asyncLoader, linkSystem, outgoingRequestHooks, incomingResponseHooks, networkErrorListeners, outgoingRequestProcessingListeners, requestQueue, network.ConnectionManager(), gsConfig.maxLinksPerOutgoingRequest, gsConfig.panicCallback)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this constructor is.. unwieldy

panics/panics.go Outdated
Comment on lines 8 to 26
// CallBackFn is a function that will get called with information about the panic
type CallBackFn func(recoverObj interface{}, debugStackTrace string)

// PanicHandler is a function that can be called with defer to recover from panics and pass them to a callback
// it returns an error if a recovery was needed
type PanicHandler func() error

// MakeHandler makes a handler that recovers from panics and passes them to the given callback
func MakeHandler(cb CallBackFn) PanicHandler {
return func() error {
obj := recover()
if obj == nil {
return nil
}
stack := string(debug.Stack())
if cb != nil {
cb(obj, stack)
}
return RecoveredPanicErr{
PanicObj: obj,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it worth the separate package / factory / complexity for calling recover in one place?

hannahhoward and others added 2 commits March 31, 2022 14:39
* fix panic handler, which requires recover() to be at the deferred
  callsite rather than indirected
* add a test to simulate a panic in a codec, or storage, or somewhere deep
  in ipld-prime
@rvagg rvagg force-pushed the feat/capture-panics-in-selectors branch from 374b1a3 to acd24bd Compare March 31, 2022 05:57
@rvagg
Copy link
Member

rvagg commented Mar 31, 2022

Approving changes as a formality because I've taken over this from @hannahhoward who can't provide a formal review, but it still needs a review or two from @hannahhoward and/or @willscott.

So, I've rebased this and fixed it up for the current main, then I expanded its use to the bindnode recovers I added in #368, and then expanded it further to cover all encodes and decodes as well. The networking piece needs a separate option, you have to gsnet.NewFromLibp2pHost(host, gsnet.PanicCallback(...)) to get handling down there since it's not directly connected to the per-request traversal path. I think this is reasonable, now you have low-level panic capturing and per-request traversal panic capturing.

I implemented a test case which covers the traversal piece, and in the process discovered that the originally implemented method didn't work! Apparently you have to recover() directly within the deferred callsite frame, not in a function called from that frame! So I've made the interface panicHandler(recover()), which isn't as clean as not having the recover() everywhere, but it apparently has to be so.

Further to discussion here and elsewhere about whether this is overkill having a callback--I think it's a reasonable design choice since we're converting the panics to standard errors but panics indicate programmer error and you really want to be handling them differently. This almost gives the best of both worlds because they aren't fatal, but you can optionally plug in and inspect for them and log them appropriately if necessary. I say "almost" because the best outcome would be more in-your-face, fatals are fatal for a very good reason, but we're dealing with sensitive services here so that's unfortunately not very OK.

PTAL.

@willscott
Copy link
Collaborator

this is fairly granular in terms of wrapping the panic handler around each operation like decoding a node that has the potential to panic. Do we have a sense of if there's a performance impact to doing it at that granularity? nodes are small / there are a lot of them in one transfer, and if we could have this recovery at a higher level function that only runs once per transfer I'd be less worried about potential performance impacts.

@rvagg
Copy link
Member

rvagg commented Apr 1, 2022

Yeah, good questions. I'll put in a bit more work trying to trace these to higher-level places, and there's a chance there's even some overlap here which I was pondering after I stepped away from the code. But some of the messaging stuff, particularly on the receiving side, need to be fairly granular to be safe.

Any idea about what costs are involved in adding a defer + panic?

@willscott
Copy link
Collaborator

@mvdan - do you know what the cost of a recover when there hasn't been a panic would be? how problematic would that be to do on a block level?

@rvagg
Copy link
Member

rvagg commented Apr 1, 2022

Some detailed notes after digging into this, I don't actually expect anyone else to read this but I need to record it for my own use. The take-away is that I could aggregate the bindnode and dagcbor recoveries up the stack closer to the networking handling. The main question I have is whether it's wise to put a recovery in network.libp2pGraphSyncNetwork#handleNewStream(), wrapping the entire stream, such that a panic resulting from any part of reading from a stream will error at the top level and cancel the stream, rather than just erroring an individual message. Or should I do it per message?


Non-test & benchmark paths to low-level functions currently recovering:

  • SafeWrap - EncodeMetadata() (protocol v1) & ToNet() (protocol v2) - both called (eventually) by network.msgToStream(), which has a messageHandlerSelector which we can easily give a PanicHandler in the current setup in this PR. There's at least one EncodeNode() call that has a recovery that we'd be aggregating by moving the recovery higher for this, and potentially many more because each extension also has an EncodeNode(), so we'd aggregate a bunch of recovery calls into one. For v2 we aggregate only 2 recoveries into 1 by doing this (there's an EncodeNodeInto() that couples with the Unwrap).
  • SafeUnwrap - DecodeMetadata() (protocol v1) & FromMsgReader() (protocol v2) - both called (eventually) by network.libp2pGraphSyncNetwork#handleNewStream() - where we have easy access to a PanicHandler but if we handle it in this then we'd error out the whole stream if we got an error in one message; is this OK? Otherwise we push down into the v1 & v2 FromMsgReader() functions. Similar to SafeWrap we aggregate potentially many DecodeNode() calls with recoveries as well, which might be a good saving. For v2 we only aggregate the two calls again, SafeUnwrap() and its associated DecodeNode().
  • EncodeNode() - aside from the v1 messaging, there's a call in RequestManager#validateRequest() to try and encode a selector and check whether there's an error or not doing this. I'm not really sure this is necessary, we could just skip to a selector.CompileSelector() which is supposed to manually strictly validate the DMT for a valid selector. So I reckon we could just remove this call entirely and save some complexity.
  • EncodeNodeInto() - other than being used by EncodeNode(), it's used by protocol v2 ToNet() coupled with SafeWrap(), which we could aggregate as per above.
  • DecodeNode() - only the 2 uses in protocol v1 - selector encoding and extension encoding, which could be aggregated as per above.
  • DecodeNodeInto() - other than being used by DecodeNode(), it's being used by protocol v2 FromMsgReader() which could be aggregated as per above.

@mvdan
Copy link
Contributor

mvdan commented Apr 1, 2022

My intuition with defer-recovers is that you shouldn't need to worry about the cost if you're also spawning a goroutine, as both have an amount of overhead within the same order of magnitude. I'd probably be careful about sprinkling defer-recovers at func/block levels within a single goroutine; if the defers kick in relatively often, I wouldn't be surprised if there is some noticeable overhead. It all depends on how fast your code is, though - typically the overhead of defer/recover will be in the order of microseconds, so if you're doing any I/O, it probably wouldn't make a difference.

TL;DR you should measure, but I would certainly be careful with sprinkled recovers :)

@mvdan
Copy link
Contributor

mvdan commented Apr 1, 2022

I should also add that the overhead of a single defer should be practically negligible these days; see how golang/go#14939 was fixed. But that's about defers in general, not about recover.

@hannahhoward
Copy link
Collaborator Author

honestly, my gut is push panics up. The high high odds are they don't happen too often. When they do, what we care about is graceful recovery without a crash, not making sure we lose as little progress/data as possible. I would honestly put:

  1. One recovery in the traversal thread, as this PR does.
  2. One recovery in the message queue internal thread (for EncodeMetadata & the message serialization functions)
  3. A handleNewStream recovery (for DecodeMetadata and message deserialization functions)

I agree having EncodeNode in validateRequest is not ideal and I would prefer to remove it.

@hannahhoward
Copy link
Collaborator Author

To revise my suggestion above, I think for #2, the whole message queue thread is too large a surface area and we should use your suggestion of putting it in messageHandlerSelector.

So:

  1. Recovery in traversal thread
  2. Recovery in handleNewStream
  3. Recovery in messageHandlerSelector

@rvagg rvagg force-pushed the feat/capture-panics-in-selectors branch from 1f7440c to 28225d4 Compare April 7, 2022 01:54
@rvagg
Copy link
Member

rvagg commented Apr 7, 2022

Updated. As per that last comment, we now have:

  1. Recovery in traversal thread
  2. Recovery in handleNewStream
  3. Recovery in messageHandlerSelector

I've lifted the recovery out of the weeds in ipldutil/* encode/decode/wrap/unwrap and it's all handled in network/* at the top level.

Implications:

  1. For receiving streams—if a peer sends a message that causes a panic, the stream is going to be terminated
  2. For sending messages, if a panic happens then it's done on a per-message basis, not for the whole stream
  3. There's one call to ipldutil.EncodeNode(selectorSpec) that's not covered by any of this because it's outside of the direct messaging stack that's covered in RequestManager.validateRequest(). I don't believe this should be a problem, we don't have active concerns about encoding normal nodes, and selectors are a pretty vanilla case. What I've done here, however, is move the selector.ParseSelector() call to be above the ipldutil.EncodeNode() call in this position because ParseSelector() has a manual validation of the node structure and it's pretty comprehensive in walking the structure.

@rvagg rvagg marked this pull request as draft April 7, 2022 02:02
@rvagg
Copy link
Member

rvagg commented Apr 7, 2022

ooof, I've noticed some hangovers from previous bindnode panic recovery which still need to be removed; so I've switched it to draft, no need to review until I switch it back (hopefully later today)

@rvagg rvagg force-pushed the feat/capture-panics-in-selectors branch from 28225d4 to 0e7ddc2 Compare April 7, 2022 02:43
Comment on lines +220 to +222
log.Debugf("graphsync net handleNewStream recovered error from %s error: %s", s.Conn().RemotePeer(), rerr)
_ = s.Reset()
go gsnet.receiver.ReceiveError(p, rerr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hannahhoward I wouldn't mind a sanity-check that this is appropriate behaviour on a panic, it's roughly the same as for a standard error from the message handler.

Copy link
Member

@rvagg rvagg Apr 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.. there's potential for p to be unset at this call since it's only set after the FromMsgReader call below; alternative might be to move the p = up one line to have a better chance of it being set (it could also be unset because of a panic somewhere else! like NewVarintReaderSize().

I moved the p = line up one, there's much less potential for it to be unset unless there's a panic elsewhere in the system.

@rvagg rvagg force-pushed the feat/capture-panics-in-selectors branch from 0e7ddc2 to 644df25 Compare April 7, 2022 02:51
@rvagg rvagg force-pushed the feat/capture-panics-in-selectors branch from 644df25 to 2dc0292 Compare April 7, 2022 02:52
@rvagg rvagg marked this pull request as ready for review April 7, 2022 02:53
Copy link
Collaborator

@willscott willscott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this level of handling looks reasonable to me

@rvagg rvagg merged commit 49f40a5 into main Apr 19, 2022
@rvagg rvagg deleted the feat/capture-panics-in-selectors branch April 19, 2022 05:19
marten-seemann pushed a commit that referenced this pull request Mar 2, 2023
This PR logs at debug level a few messages that seem only useful for debugging.  These messages otherwise occur in the logs frequently without much value.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants