-
Notifications
You must be signed in to change notification settings - Fork 38
Capture panics from selector execution #334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
impl/graphsync.go
Outdated
@@ -232,7 +244,7 @@ func New(parent context.Context, network gsnet.GraphSyncNetwork, | |||
|
|||
asyncLoader := asyncloader.New(ctx, linkSystem) | |||
requestQueue := taskqueue.NewTaskQueue(ctx) | |||
requestManager := requestmanager.New(ctx, asyncLoader, linkSystem, outgoingRequestHooks, incomingResponseHooks, networkErrorListeners, outgoingRequestProcessingListeners, requestQueue, network.ConnectionManager(), gsConfig.maxLinksPerOutgoingRequest) | |||
requestManager := requestmanager.New(ctx, asyncLoader, linkSystem, outgoingRequestHooks, incomingResponseHooks, networkErrorListeners, outgoingRequestProcessingListeners, requestQueue, network.ConnectionManager(), gsConfig.maxLinksPerOutgoingRequest, gsConfig.panicCallback) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this constructor is.. unwieldy
panics/panics.go
Outdated
// CallBackFn is a function that will get called with information about the panic | ||
type CallBackFn func(recoverObj interface{}, debugStackTrace string) | ||
|
||
// PanicHandler is a function that can be called with defer to recover from panics and pass them to a callback | ||
// it returns an error if a recovery was needed | ||
type PanicHandler func() error | ||
|
||
// MakeHandler makes a handler that recovers from panics and passes them to the given callback | ||
func MakeHandler(cb CallBackFn) PanicHandler { | ||
return func() error { | ||
obj := recover() | ||
if obj == nil { | ||
return nil | ||
} | ||
stack := string(debug.Stack()) | ||
if cb != nil { | ||
cb(obj, stack) | ||
} | ||
return RecoveredPanicErr{ | ||
PanicObj: obj, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it worth the separate package / factory / complexity for calling recover in one place?
* fix panic handler, which requires recover() to be at the deferred callsite rather than indirected * add a test to simulate a panic in a codec, or storage, or somewhere deep in ipld-prime
374b1a3
to
acd24bd
Compare
Approving changes as a formality because I've taken over this from @hannahhoward who can't provide a formal review, but it still needs a review or two from @hannahhoward and/or @willscott. So, I've rebased this and fixed it up for the current I implemented a test case which covers the traversal piece, and in the process discovered that the originally implemented method didn't work! Apparently you have to Further to discussion here and elsewhere about whether this is overkill having a callback--I think it's a reasonable design choice since we're converting the panics to standard errors but panics indicate programmer error and you really want to be handling them differently. This almost gives the best of both worlds because they aren't fatal, but you can optionally plug in and inspect for them and log them appropriately if necessary. I say "almost" because the best outcome would be more in-your-face, fatals are fatal for a very good reason, but we're dealing with sensitive services here so that's unfortunately not very OK. PTAL. |
this is fairly granular in terms of wrapping the panic handler around each operation like decoding a node that has the potential to panic. Do we have a sense of if there's a performance impact to doing it at that granularity? nodes are small / there are a lot of them in one transfer, and if we could have this recovery at a higher level function that only runs once per transfer I'd be less worried about potential performance impacts. |
Yeah, good questions. I'll put in a bit more work trying to trace these to higher-level places, and there's a chance there's even some overlap here which I was pondering after I stepped away from the code. But some of the messaging stuff, particularly on the receiving side, need to be fairly granular to be safe. Any idea about what costs are involved in adding a defer + panic? |
@mvdan - do you know what the cost of a |
Some detailed notes after digging into this, I don't actually expect anyone else to read this but I need to record it for my own use. The take-away is that I could aggregate the bindnode and dagcbor recoveries up the stack closer to the networking handling. The main question I have is whether it's wise to put a recovery in Non-test & benchmark paths to low-level functions currently recovering:
|
My intuition with defer-recovers is that you shouldn't need to worry about the cost if you're also spawning a goroutine, as both have an amount of overhead within the same order of magnitude. I'd probably be careful about sprinkling defer-recovers at func/block levels within a single goroutine; if the defers kick in relatively often, I wouldn't be surprised if there is some noticeable overhead. It all depends on how fast your code is, though - typically the overhead of defer/recover will be in the order of microseconds, so if you're doing any I/O, it probably wouldn't make a difference. TL;DR you should measure, but I would certainly be careful with sprinkled recovers :) |
I should also add that the overhead of a single defer should be practically negligible these days; see how golang/go#14939 was fixed. But that's about defers in general, not about recover. |
honestly, my gut is push panics up. The high high odds are they don't happen too often. When they do, what we care about is graceful recovery without a crash, not making sure we lose as little progress/data as possible. I would honestly put:
I agree having EncodeNode in validateRequest is not ideal and I would prefer to remove it. |
To revise my suggestion above, I think for #2, the whole message queue thread is too large a surface area and we should use your suggestion of putting it in messageHandlerSelector. So:
|
1f7440c
to
28225d4
Compare
Updated. As per that last comment, we now have:
I've lifted the recovery out of the weeds in ipldutil/* encode/decode/wrap/unwrap and it's all handled in network/* at the top level. Implications:
|
ooof, I've noticed some hangovers from previous bindnode panic recovery which still need to be removed; so I've switched it to draft, no need to review until I switch it back (hopefully later today) |
28225d4
to
0e7ddc2
Compare
log.Debugf("graphsync net handleNewStream recovered error from %s error: %s", s.Conn().RemotePeer(), rerr) | ||
_ = s.Reset() | ||
go gsnet.receiver.ReceiveError(p, rerr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hannahhoward I wouldn't mind a sanity-check that this is appropriate behaviour on a panic, it's roughly the same as for a standard error from the message handler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.. there's potential for p
to be unset at this call since it's only set after the FromMsgReader
call below; alternative might be to move the p =
up one line to have a better chance of it being set (it could also be unset because of a panic somewhere else! like NewVarintReaderSize()
.
I moved the p =
line up one, there's much less potential for it to be unset unless there's a panic elsewhere in the system.
0e7ddc2
to
644df25
Compare
644df25
to
2dc0292
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this level of handling looks reasonable to me
This PR logs at debug level a few messages that seem only useful for debugging. These messages otherwise occur in the logs frequently without much value.
Goals
Prevent the whole system from going down cause of an error in selector execution
Implementation
For discussion