p2p: make dial faster by streamlined discovery process #31678

cskiraly · 2025-04-19T18:30:02Z

This PR improves the speed of Disc/v4 and Disc/v5 based discovery. It builds on #31592. To be rebased after merging that PR.

Our dial process is rate-limited, but until now, during startup, our discovery was too slow to serve dial. So the bottleneck was discovery, not the rate limit in dial, resulting in slow ramp-up of outgoing peer count.

This PR making discovery fast enough to serve dial's needs. The rate-limit of dial is still limiting the discovery process, so we are not risking being too aggressive in our discovery.

The PR adds:

a pre-fetch buffer to discovery sources, eliminating slowdowns due to timeouts and rate mismatch between the two precesses
multiple parallel lookups to make sure enough dial candidates are available all the time

cskiraly · 2025-04-19T18:35:10Z

First hour of dial progress before this PR: number of outgoing peers

First hour of dial progress after this PR (and #31592): number of outgoing peers

cskiraly · 2025-04-20T08:22:14Z

Discovery traffic before:

And after:

This PR increases the initial UDP traffic, mostly caused by the pre-fetch buffer. We might dial that down a bit, although I don't think the amount of traffic generated (~70 KB/s) is too much. On the long term, average discovery traffic should be the same.

cskiraly · 2025-04-25T09:46:21Z

@fjl I was thinking about the increased UDP traffic. There is the case of a very small network, smaller than maxpeers. For example a small test network with only a few nodes. In that case discovery might go full-steam continuously, looking for new dial candidates. In this specific case this PR would increase the background UDP traffic by roughly 3x.

If we think this is a problem, we might use discoveryParallelLookups=2 to be more conservative. Or should we maybe introduce an explicit "smallnet" flag and tune parameters accordingly? Or can we just expect people to set maxpeers wisely for a small network? I would say the last option would be fine, but we need some mods to make that work. See below.

A short description of what (supposed to) happen in detail. There are two cases:

a small net using the large public disc-v4/v5 DHT.
a small net using a private disc-v4/v5 DHT.
Before p2p: Filter Discv4 dial candidates based on forkID #31592 these would behave differently, but after adding the filter on discv4, both should behave the same, except for the number of discarded peers during discovery.

In discovery, we only have duplicate filtering in lookup, by the seen cache mechanism. However, the lookupIterator will start new lookups, and these can return the same nodes again. There is no duplicate filtering between sequential lookups, or between parallel lookups as introduced here by discoveryParallelLookups. So discovery can go on full-steam, feeding "new" (which are the same as the old) dial candidates.

Dial pulls from this, and has its own duplicate detection through dialHistoryExpiration. When it finds a duplicate within the expiration time, it throws it away and pulls more candidates from discovery. It only throttles down when it starts to get closer to maxpeers, but that will not happen if maxpeers is significantly larger than the network size.

If instead maxpeers is set correctly, dial will try to slow down. It will reduce slots. But even with a single slot, it will pull and discard in a loop, making discovery go full steam. Should we introduce some time-based mechanism here? E.g. throttle if checkDial fails too many times in a row?

Simple version that does the filtering, but misses pipelining, waiting for ENRs to be retrieved one-by-one. Signed-off-by: Csaba Kiraly <[email protected]>

Signed-off-by: Csaba Kiraly <[email protected]>

It is not guaranteed that Next will be called until exhaustion after Close was called. Hence, we need to empty override the passed channel. Signed-off-by: Csaba Kiraly <[email protected]>

Signed-off-by: Csaba Kiraly <[email protected]>

Signed-off-by: Csaba Kiraly <[email protected]> # Conflicts: # eth/backend.go

BufferIter wraps an iterator and prefetches up to a given number of nodes from it. Signed-off-by: Csaba Kiraly <[email protected]>

Signed-off-by: Csaba Kiraly <[email protected]>

When Close was called while Next was active, and there was a race on the closed channel. If Close finished before closed was received, This happened: goroutine 22716 [chan send, 5 minutes]: github.com/ethereum/go-ethereum/p2p/enode.NewBufferIter.func1() github.com/ethereum/go-ethereum/p2p/enode/iter.go:219 +0x77 created by github.com/ethereum/go-ethereum/p2p/enode.NewBufferIter in goroutine 1 github.com/ethereum/go-ethereum/p2p/enode/iter.go:216 +0xd8 goroutine 22714 [chan receive (nil chan), 1 minutes]: github.com/ethereum/go-ethereum/p2p/enode.(*BufferIter).Next(0xc00505fea0) github.com/ethereum/go-ethereum/p2p/enode/iter.go:226 +0x2d github.com/ethereum/go-ethereum/p2p/enode.AsyncFilter.func1() github.com/ethereum/go-ethereum/p2p/enode/iter.go:151 +0x5f created by github.com/ethereum/go-ethereum/p2p/enode.AsyncFilter in goroutine 1 github.com/ethereum/go-ethereum/p2p/enode/iter.go:146 +0x156 Signed-off-by: Csaba Kiraly <[email protected]>

Signed-off-by: Csaba Kiraly <[email protected]>

cskiraly · 2025-04-29T16:33:59Z

Testing on a small network (Sepolia), this is the first hour of operation before this PR:

And after this PR:

cskiraly requested review from fjl, zsfelfoldi and rjl493456442 as code owners April 19, 2025 18:30

cskiraly force-pushed the dial-turbo branch from eedbbc2 to f99f4d9 Compare April 19, 2025 19:16

cskiraly force-pushed the dial-turbo branch 2 times, most recently from a5b1b9d to 016bf44 Compare April 24, 2025 20:32

cskiraly added 16 commits April 28, 2025 17:36

p2p: implement naive version of discv4 forkid filtering

294a388

Simple version that does the filtering, but misses pipelining, waiting for ENRs to be retrieved one-by-one. Signed-off-by: Csaba Kiraly <[email protected]>

p2p: allow several asyncFilter lookups in parallel

da0a98a

Signed-off-by: Csaba Kiraly <[email protected]>

change from workers to goroutines

bae2613

Signed-off-by: Csaba Kiraly <[email protected]>

fix close sequence

4048aea

It is not guaranteed that Next will be called until exhaustion after Close was called. Hence, we need to empty override the passed channel. Signed-off-by: Csaba Kiraly <[email protected]>

adding docs to AsyncFilter

d2c42f7

Signed-off-by: Csaba Kiraly <[email protected]>

p2p: set timeout on FairMix in Backend,reduce timeout

21d2619

Signed-off-by: Csaba Kiraly <[email protected]>

add maxParallelENRRequests parameter

b44fff3

Signed-off-by: Csaba Kiraly <[email protected]>

simplify AsyncFilter design

b81b933

Signed-off-by: Csaba Kiraly <[email protected]> # Conflicts: # eth/backend.go

p2p/enode: add BufferIter to support prefetching

08058cd

BufferIter wraps an iterator and prefetches up to a given number of nodes from it. Signed-off-by: Csaba Kiraly <[email protected]>

p2p/enode: fixup BufferIterator

fba00b0

Signed-off-by: Csaba Kiraly <[email protected]>

p2p/enode: return Iterator interface

68d3b8a

Signed-off-by: Csaba Kiraly <[email protected]>

eth/enode: simplify BufferIter

705d386

Signed-off-by: Csaba Kiraly <[email protected]>

eth: add multiple parallel lookup in discovery

da06d13

Signed-off-by: Csaba Kiraly <[email protected]>

eth: add pre-fetch in discovery to avoid stalls

d2d4932

Signed-off-by: Csaba Kiraly <[email protected]>

p2p/enode: closed channel not needed in bufferIter

f50b7db

Signed-off-by: Csaba Kiraly <[email protected]>

cskiraly force-pushed the dial-turbo branch from 18946bf to f50b7db Compare April 29, 2025 09:50

fjl self-assigned this Apr 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

p2p: make dial faster by streamlined discovery process #31678

p2p: make dial faster by streamlined discovery process #31678

Uh oh!

cskiraly commented Apr 19, 2025 •

edited

Loading

Uh oh!

cskiraly commented Apr 19, 2025

Uh oh!

cskiraly commented Apr 20, 2025

Uh oh!

cskiraly commented Apr 25, 2025

Uh oh!

cskiraly commented Apr 29, 2025

Uh oh!

Uh oh!

p2p: make dial faster by streamlined discovery process #31678

Are you sure you want to change the base?

p2p: make dial faster by streamlined discovery process #31678

Uh oh!

Conversation

cskiraly commented Apr 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cskiraly commented Apr 19, 2025

Uh oh!

cskiraly commented Apr 20, 2025

Uh oh!

cskiraly commented Apr 25, 2025

Uh oh!

cskiraly commented Apr 29, 2025

Uh oh!

Uh oh!

cskiraly commented Apr 19, 2025 •

edited

Loading