Disconnect peers on timer ticks to unblock channel state machine #2293

wpaulino · 2023-05-13T19:03:17Z

At times, we've noticed that channels with lnd counterparties do not receive messages we expect to in a timely manner (or at all) after sending them a ChannelReestablish upon reconnection, or a CommitmentSigned message. This can block the channel state machine from making progress, eventually leading to force closes, if any pending HTLCs are committed and their expiration is met.

It seems common wisdom for lnd node operators to periodically restart their node/reconnect to their peers, allowing them to start from a fresh state such that the message we expect to receive hopefully gets sent. We can achieve the same end result by disconnecting peers ourselves (regardless of whether they're a lnd node), which we opt to implement here by awaiting their response within two timer ticks.

Fixes #2282.

codecov-commenter · 2023-05-13T20:12:05Z

Codecov Report

Patch coverage: 57.66% and project coverage change: -0.09 ⚠️

Comparison is base (4dce209) 90.79% compared to head (09c051f) 90.70%.

❗ Current head 09c051f differs from pull request most recent head 5bf7fac. Consider uploading reports for the commit 5bf7fac to get more accurate results

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2293      +/-   ##
==========================================
- Coverage   90.79%   90.70%   -0.09%     
==========================================
  Files         104      104              
  Lines       53033    53162     +129     
  Branches    53033    53162     +129     
==========================================
+ Hits        48153    48223      +70     
- Misses       4880     4939      +59

Impacted Files	Coverage Δ
lightning/src/ln/msgs.rs	`85.07% <0.00%> (-0.06%)`	⬇️
lightning/src/ln/wire.rs	`49.09% <2.17%> (-7.76%)`	⬇️
lightning/src/ln/peer_handler.rs	`58.85% <23.07%> (-0.52%)`	⬇️
lightning/src/ln/channel.rs	`89.78% <86.66%> (-0.02%)`	⬇️
lightning/src/ln/functional_tests.rs	`98.25% <98.48%> (+<0.01%)`	⬆️
lightning/src/ln/channelmanager.rs	`87.13% <100.00%> (+0.02%)`	⬆️

... and 6 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

TheBlueMatt · 2023-05-14T00:18:34Z

Grr, sorry for the delay in responding here. So I'm not sure what to do about expected-CS.

If we have an inbound HTLC that we remove, send a CS, receive an RAA and then don't receive the expected CS so we can RAA we'll (a) no longer have any state about this HTLC in the channel - its been removed from our state as we never need to care about it in the off-chain state side of things and (b) still force-close the channel if the HTLC times out, as the ChannelMonitor sees that if we broadcast we'll still have the HTLC which we need to time-out.

At least the two state-machine-deadlocks I saw with lnd peers we were waiting on an RAA, and because we're not blocked on an RAA we can make progress (in the form of sending our own CS and then presumably if they're still hung we'll decide to d/c because we're awaiting-RAA), but its kinda awkward there's still a case there where we should d/c but won't.

For now its probably fine, and its better that we do d/c if we have to than nothing, at least because actually fixing this would require state machine tweaks that I don't really want to have to think hard about.

TheBlueMatt

Thanks!

lightning/src/ln/channel.rs

lightning/src/ln/channelmanager.rs

lightning/src/ln/channel.rs

lightning/src/ln/functional_tests.rs

TheBlueMatt · 2023-05-20T21:36:49Z

MSRV build is sad.

TheBlueMatt

Code looks good I think, mostly nits.

lightning/src/ln/channel.rs

lightning/src/ln/peer_handler.rs

lightning/src/ln/channel.rs

lightning/src/ln/channelmanager.rs

TheBlueMatt · 2023-05-21T18:32:05Z

lightning/src/ln/functional_tests.rs

+	let alice_init = msgs::Init { features: nodes[0].node.init_features(), remote_network_address: None };
+	nodes[1].node.peer_connected(&nodes[0].node.get_our_node_id(), &alice_init, true).unwrap();
+
+	// Upon reconnection, Alice sends her `ChannelReestablish` to Bob. Alice, however, hasn't


Can you add another (or just extend the test) to include a channel_ready message? That should emulate the exact behavior we've seen out of lnd.

It doesn't really matter? They can send channel_ready if they wish, the issue is not receiving channel_reestablish.

Fair, it doesn't matter much I just wanted to match exactly what we've seen in the wild cause I can't readily repro with any peers so its nice to get an exact test to make sure we're good.

TheBlueMatt

Code looks good I think, mostly nits.

TheBlueMatt · 2023-05-26T19:20:11Z

Ugh needs rebase, LGTM, though.

The inner structs of each enum variant already implemented them and we plan to pass in `Message`s to `enqueue_message` in a future commit.

`enqueue_message` simply adds the message to the outbound queue, it still needs to be written to the socket with `do_attempt_write_data`. However, since we immediately return an error causing the socket to be closed, the message never actually gets sent.

At times, we've noticed that channels with `lnd` counterparties do not receive messages we expect to in a timely manner (or at all) after sending them a `ChannelReestablish` upon reconnection, or a `CommitmentSigned` message. This can block the channel state machine from making progress, eventually leading to force closes, if any pending HTLCs are committed and their expiration is met. It seems common wisdom for `lnd` node operators to periodically restart their node/reconnect to their peers, allowing them to start from a fresh state such that the message we expect to receive hopefully gets sent. We can achieve the same end result by disconnecting peers ourselves (regardless of whether they're a `lnd` node), which we opt to implement here by awaiting their response within two timer ticks.

dunxen

Looks good. Nothing else from my side.

wpaulino added this to the 0.0.116 milestone May 13, 2023

wpaulino force-pushed the disconnect-peers-timer-tick branch from 15c311e to 6353683 Compare May 13, 2023 21:59

TheBlueMatt reviewed May 14, 2023

View reviewed changes

lightning/src/ln/channel.rs Outdated Show resolved Hide resolved

lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved

lightning/src/ln/channel.rs Outdated Show resolved Hide resolved

wpaulino force-pushed the disconnect-peers-timer-tick branch from 6353683 to 105e9ac Compare May 15, 2023 18:27

wpaulino added the Seeking Code Review label May 16, 2023

dunxen reviewed May 17, 2023

View reviewed changes

lightning/src/ln/channel.rs Show resolved Hide resolved

lightning/src/ln/functional_tests.rs Show resolved Hide resolved

wpaulino force-pushed the disconnect-peers-timer-tick branch from 105e9ac to fe3a962 Compare May 18, 2023 19:15

wpaulino force-pushed the disconnect-peers-timer-tick branch from fe3a962 to 793c2cf Compare May 21, 2023 00:28

wpaulino removed the Seeking Code Review label May 21, 2023

TheBlueMatt reviewed May 21, 2023

View reviewed changes

wpaulino force-pushed the disconnect-peers-timer-tick branch 2 times, most recently from d767d1d to 09c051f Compare May 24, 2023 17:33

wpaulino added 4 commits May 26, 2023 14:40

Implement enqueue_message trait requirements on Message enum

55a7a03

The inner structs of each enum variant already implemented them and we plan to pass in `Message`s to `enqueue_message` in a future commit.

Add new DisconnectPeerWithWarning variant to ErrorAction

5ba2f80

wpaulino force-pushed the disconnect-peers-timer-tick branch from 09c051f to 5bf7fac Compare May 26, 2023 21:41

TheBlueMatt approved these changes May 29, 2023

View reviewed changes

wpaulino requested a review from dunxen May 30, 2023 17:55

dunxen approved these changes May 30, 2023

View reviewed changes

TheBlueMatt merged commit eec5ec6 into lightningdevkit:main May 30, 2023

wpaulino deleted the disconnect-peers-timer-tick branch May 30, 2023 19:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disconnect peers on timer ticks to unblock channel state machine #2293

Disconnect peers on timer ticks to unblock channel state machine #2293

wpaulino commented May 13, 2023

codecov-commenter commented May 13, 2023 •

edited

Loading

TheBlueMatt commented May 14, 2023

TheBlueMatt left a comment

TheBlueMatt commented May 20, 2023

TheBlueMatt left a comment

TheBlueMatt May 21, 2023

wpaulino May 22, 2023

wpaulino May 22, 2023

TheBlueMatt May 24, 2023

TheBlueMatt left a comment

TheBlueMatt commented May 26, 2023

dunxen left a comment

Disconnect peers on timer ticks to unblock channel state machine #2293

Disconnect peers on timer ticks to unblock channel state machine #2293

Conversation

wpaulino commented May 13, 2023

codecov-commenter commented May 13, 2023 • edited Loading

Codecov Report

TheBlueMatt commented May 14, 2023

TheBlueMatt left a comment

Choose a reason for hiding this comment

TheBlueMatt commented May 20, 2023

TheBlueMatt left a comment

Choose a reason for hiding this comment

TheBlueMatt May 21, 2023

Choose a reason for hiding this comment

wpaulino May 22, 2023

Choose a reason for hiding this comment

wpaulino May 22, 2023

Choose a reason for hiding this comment

TheBlueMatt May 24, 2023

Choose a reason for hiding this comment

TheBlueMatt left a comment

Choose a reason for hiding this comment

TheBlueMatt commented May 26, 2023

dunxen left a comment

Choose a reason for hiding this comment

codecov-commenter commented May 13, 2023 •

edited

Loading