Closed
Description
The objective is to define a minimal set of metrics that the swarm should expose. There are a lot of interesting things we could measure (and we should in the long term, see #1356), but we need to start somewhere.
I propose to define the minimal set as those metrics that allow us to measure the effect of our smarter dialing logic (#1785).
Suggested metrics:
- conns opened (grouped by: direction, transport / security / muxer)
- conns closed (grouped by: direction, transport / security / muxer)
- conn duration histogram: duration between handshake completion and closing (grouped by: transport / security / muxer)
- dial duration histogram
- dial errors (group by: error type (timeout, cancellation, other), transport / security / muxer)
When implementing smarter dialing, we expect:
- the dial errors to decrease dramatically (for cancelations)
- the dial duration to just increase slightly (otherwise this dialing logic wouldn't be very smart)
- the number of very short-lived incoming connection (as other nodes upgrade)