Introduce structured, span-based observability through Logger interface #3766

wvanlint · 2025-05-02T21:26:19Z

This change allows users to create hierarchical span objects through the Logger interface for specific computations, such as the handling of HTLCs. These span objects will be held in LDK across the corresponding lifetimes before being dropped, providing insight in durations and latencies.

This API is designed to be compatible with https://docs.rs/opentelemetry/latest/opentelemetry/trace/trait.Tracer.html, but can also be used more directly to derive time-based metrics or to log durations.

Hierarchical RAII spans are currently added for:

HTLC lifetimes, HTLC state transitions, inbound to outbound HTLC forwarding (see functional_tests.rs).
Ping/pong request-response pairs.

ldk-reviews-bot · 2025-05-02T21:26:21Z

👋 I see @valentinewallace was un-assigned.
If you'd like another reviewer assignemnt, please click here.

codecov · 2025-05-02T22:27:40Z

Codecov Report

Attention: Patch coverage is 84.86395% with 89 lines in your changes missing coverage. Please review.

Project coverage is 89.38%. Comparing base (89f5217) to head (bd59e5e).

Files with missing lines	Patch %	Lines
lightning/src/ln/channel.rs	84.84%	32 Missing and 28 partials ⚠️
lightning/src/ln/channelmanager.rs	76.36%	13 Missing ⚠️
lightning/src/util/logger.rs	73.91%	6 Missing ⚠️
lightning/src/chain/channelmonitor.rs	0.00%	3 Missing ⚠️
lightning/src/ln/invoice_utils.rs	0.00%	3 Missing ⚠️
lightning-dns-resolver/src/lib.rs	0.00%	1 Missing ⚠️
lightning-net-tokio/src/lib.rs	0.00%	1 Missing ⚠️
lightning/src/ln/onion_payment.rs	83.33%	0 Missing and 1 partial ⚠️
lightning/src/util/test_utils.rs	96.00%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff            @@
##             main    #3766    +/-   ##
========================================
  Coverage   89.37%   89.38%            
========================================
  Files         157      157            
  Lines      124095   124443   +348     
  Branches   124095   124443   +348     
========================================
+ Hits       110915   111228   +313     
- Misses      10469    10495    +26     
- Partials     2711     2720     +9

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

This change allows users to create hierarchical span objects through the Logger interface for specific computations, such as the handling of HTLCs. These span objects will be held in LDK across the corresponding lifetimes before being dropped, providing insight in durations and latencies.

TheBlueMatt

Oops, okay so there was def a miscommunication when we spoke about switching to an RAII span concept 😅. At a high level I wasn't anticipate we'd give spans for a total HTLC forwards - I was assuming we'd only do spans for basically function runtime in LDK - "how long did it take us to process events", "how long did it take us to process the pending forwards", etc. From there, we might give spans for things like monitor updates (including the parent idea, which is cool), so you could answer questions like "how much of the time spent processing pending forwards is spent in monitor updates".

By focusing on answering questions about specific function runtime the "parent" thing is super well-defined. But once we switch to non-functional spans, it becomes a bit murky - a monitor update might block several payment forwards, so should it have all of them as "parents"?

Of course using only functional blocks as spans makes it less able to answer the types of questions I think you specifically want, eg "how long is it taking us to forward most HTLCs", but as it lies its hard to take that and answer "why is it taking us long to forward some HTLCs".

We could also split the difference and have functional spans for LDK things, but also non-functional spans for outside-LDK things - ie the above spans plus a span for how long each async monitor update took to persist as well as how long it took to get a response from a peer once we send a message that needs a response. Sadly the latter is a bit annoying to build, commitment_signed -> revoke_and_ack latency is pretty easy to measure, though once we get the RAA we're still waiting on another commitment_signed in many cases. There may be a way to do it specific to an HTLC kinda like you have here but just around messages themselves.

WDYT?

TheBlueMatt · 2025-05-05T14:33:06Z

lightning/src/ln/channel.rs

+	cltv_expiry: u32,
+	payment_hash: PaymentHash,
+	state_wrapper: InboundHTLCStateWrapper,
+	span: BoxedSpan,


Oh man its a bit weird to start storing user objects in all our structs. I was kinda anticipating an RAII struct of our own that calls an end method on the Logger, though thinking about it now its definitely obvious that letting the user define the struct is easier for them (I guess otherwise they'd have to keep a map of pending spans and drop the opentelemetry::Tracer by hand?).

Yeah exactly, this was to avoid reconciliation on the user side.

wvanlint · 2025-05-05T22:08:52Z

Haha no worries, I imagined it it wasn't the same but wanted to write this up as a starting point.

I agree on the difference between function scope spans and higher-level state machine spans and how they answer different questions. Given the amount of logic that depends on messages, timers, and asynchronous persistence completions, I think we would need both complementing each other to have a complete understanding.

Function scope spans are clear and easy to optimize for, but don't cross certain boundaries and would not give a complete picture. User trait implementations are already easy to measure but it's difficult to decide when it's impactful to improve them.
Higher-level spans around the Lightning state machine would give an end-to-end view for Lightning payments and can identify rough areas of investigation.

If we're building higher-level state machine spans on top of messages outside of LDK as in https://discord.com/channels/915026692102316113/978829624635195422/1296877062015029258, would it lead to rebuilding/duplicating something similar to the Lightning state machine to get the full picture? It seems like channel.rs already has all the necessary information.

I'm also curious what we should optimize for. I'm assuming we would like to have maximal observability, while minimizing overhead. Is maintaining span objects in the Lightning state machine too invasive? Would there be a more minimal event API that can be added to channel.rs?

Having start(span: Span, parent: Option<&Span>) -> () and end(span: Span) -> () methods without the RAII concept could be less invasive, possibly with some helpers. We could also still verify in TestLogger whether all spans are ended correctly during tests.

wvanlint force-pushed the trace_spans branch 2 times, most recently from d115f88 to 40a6022 Compare May 2, 2025 22:17

wvanlint force-pushed the trace_spans branch 2 times, most recently from 086fbe5 to f7c48f7 Compare May 3, 2025 01:04

wvanlint force-pushed the trace_spans branch from f7c48f7 to bd59e5e Compare May 5, 2025 06:31

wvanlint marked this pull request as ready for review May 5, 2025 06:58

ldk-reviews-bot requested a review from valentinewallace May 5, 2025 06:59

TheBlueMatt reviewed May 5, 2025

View reviewed changes

TheBlueMatt removed the request for review from valentinewallace May 5, 2025 14:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce structured, span-based observability through Logger interface #3766

Introduce structured, span-based observability through Logger interface #3766

wvanlint commented May 2, 2025

ldk-reviews-bot commented May 2, 2025 •

edited

Loading

codecov bot commented May 2, 2025 •

edited

Loading

TheBlueMatt left a comment •

edited

Loading

TheBlueMatt May 5, 2025

wvanlint May 5, 2025

wvanlint commented May 5, 2025

Introduce structured, span-based observability through Logger interface #3766

Are you sure you want to change the base?

Introduce structured, span-based observability through Logger interface #3766

Conversation

wvanlint commented May 2, 2025

ldk-reviews-bot commented May 2, 2025 • edited Loading

codecov bot commented May 2, 2025 • edited Loading

Codecov Report

TheBlueMatt left a comment • edited Loading

Choose a reason for hiding this comment

TheBlueMatt May 5, 2025

Choose a reason for hiding this comment

wvanlint May 5, 2025

Choose a reason for hiding this comment

wvanlint commented May 5, 2025

ldk-reviews-bot commented May 2, 2025 •

edited

Loading

codecov bot commented May 2, 2025 •

edited

Loading

TheBlueMatt left a comment •

edited

Loading