-
Notifications
You must be signed in to change notification settings - Fork 404
Disconnect peers on timer ticks to unblock channel state machine #2293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disconnect peers on timer ticks to unblock channel state machine #2293
Conversation
Codecov ReportPatch coverage:
❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more. Additional details and impacted files@@ Coverage Diff @@
## main #2293 +/- ##
==========================================
- Coverage 90.79% 90.70% -0.09%
==========================================
Files 104 104
Lines 53033 53162 +129
Branches 53033 53162 +129
==========================================
+ Hits 48153 48223 +70
- Misses 4880 4939 +59
☔ View full report in Codecov by Sentry. |
15c311e
to
6353683
Compare
Grr, sorry for the delay in responding here. So I'm not sure what to do about expected-CS. If we have an inbound HTLC that we remove, send a CS, receive an RAA and then don't receive the expected CS so we can RAA we'll (a) no longer have any state about this HTLC in the channel - its been removed from our state as we never need to care about it in the off-chain state side of things and (b) still force-close the channel if the HTLC times out, as the ChannelMonitor sees that if we broadcast we'll still have the HTLC which we need to time-out. At least the two state-machine-deadlocks I saw with lnd peers we were waiting on an RAA, and because we're not blocked on an RAA we can make progress (in the form of sending our own CS and then presumably if they're still hung we'll decide to d/c because we're awaiting-RAA), but its kinda awkward there's still a case there where we should d/c but won't. For now its probably fine, and its better that we do d/c if we have to than nothing, at least because actually fixing this would require state machine tweaks that I don't really want to have to think hard about. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
6353683
to
105e9ac
Compare
105e9ac
to
fe3a962
Compare
MSRV build is sad. |
fe3a962
to
793c2cf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good I think, mostly nits.
let alice_init = msgs::Init { features: nodes[0].node.init_features(), remote_network_address: None }; | ||
nodes[1].node.peer_connected(&nodes[0].node.get_our_node_id(), &alice_init, true).unwrap(); | ||
|
||
// Upon reconnection, Alice sends her `ChannelReestablish` to Bob. Alice, however, hasn't |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add another (or just extend the test) to include a channel_ready
message? That should emulate the exact behavior we've seen out of lnd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't really matter? They can send channel_ready
if they wish, the issue is not receiving channel_reestablish
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't really matter? They can send channel_ready
if they wish, the issue is not receiving channel_reestablish
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair, it doesn't matter much I just wanted to match exactly what we've seen in the wild cause I can't readily repro with any peers so its nice to get an exact test to make sure we're good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good I think, mostly nits.
d767d1d
to
09c051f
Compare
Ugh needs rebase, LGTM, though. |
The inner structs of each enum variant already implemented them and we plan to pass in `Message`s to `enqueue_message` in a future commit.
`enqueue_message` simply adds the message to the outbound queue, it still needs to be written to the socket with `do_attempt_write_data`. However, since we immediately return an error causing the socket to be closed, the message never actually gets sent.
At times, we've noticed that channels with `lnd` counterparties do not receive messages we expect to in a timely manner (or at all) after sending them a `ChannelReestablish` upon reconnection, or a `CommitmentSigned` message. This can block the channel state machine from making progress, eventually leading to force closes, if any pending HTLCs are committed and their expiration is met. It seems common wisdom for `lnd` node operators to periodically restart their node/reconnect to their peers, allowing them to start from a fresh state such that the message we expect to receive hopefully gets sent. We can achieve the same end result by disconnecting peers ourselves (regardless of whether they're a `lnd` node), which we opt to implement here by awaiting their response within two timer ticks.
09c051f
to
5bf7fac
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Nothing else from my side.
At times, we've noticed that channels with
lnd
counterparties do not receive messages we expect to in a timely manner (or at all) after sending them aChannelReestablish
upon reconnection, or aCommitmentSigned
message. This can block the channel state machine from making progress, eventually leading to force closes, if any pending HTLCs are committed and their expiration is met.It seems common wisdom for
lnd
node operators to periodically restart their node/reconnect to their peers, allowing them to start from a fresh state such that the message we expect to receive hopefully gets sent. We can achieve the same end result by disconnecting peers ourselves (regardless of whether they're alnd
node), which we opt to implement here by awaiting their response within two timer ticks.Fixes #2282.