Skip to content

p2p: make dial faster by streamlined discovery process #31678

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

cskiraly
Copy link
Contributor

@cskiraly cskiraly commented Apr 19, 2025

This PR improves the speed of Disc/v4 and Disc/v5 based discovery. It builds on #31592. To be rebased after merging that PR.

Our dial process is rate-limited, but until now, during startup, our discovery was too slow to serve dial. So the bottleneck was discovery, not the rate limit in dial, resulting in slow ramp-up of outgoing peer count.

This PR making discovery fast enough to serve dial's needs. The rate-limit of dial is still limiting the discovery process, so we are not risking being too aggressive in our discovery.

The PR adds:

  • a pre-fetch buffer to discovery sources, eliminating slowdowns due to timeouts and rate mismatch between the two precesses
  • multiple parallel lookups to make sure enough dial candidates are available all the time

@cskiraly
Copy link
Contributor Author

First hour of dial progress before this PR: number of outgoing peers
image

First hour of dial progress after this PR (and #31592): number of outgoing peers
image

@cskiraly
Copy link
Contributor Author

Discovery traffic before:
image

And after:
image

This PR increases the initial UDP traffic, mostly caused by the pre-fetch buffer. We might dial that down a bit, although I don't think the amount of traffic generated (~70 KB/s) is too much. On the long term, average discovery traffic should be the same.

@cskiraly cskiraly force-pushed the dial-turbo branch 2 times, most recently from a5b1b9d to 016bf44 Compare April 24, 2025 20:32
@cskiraly
Copy link
Contributor Author

@fjl I was thinking about the increased UDP traffic. There is the case of a very small network, smaller than maxpeers. For example a small test network with only a few nodes. In that case discovery might go full-steam continuously, looking for new dial candidates. In this specific case this PR would increase the background UDP traffic by roughly 3x.

If we think this is a problem, we might use discoveryParallelLookups=2 to be more conservative. Or should we maybe introduce an explicit "smallnet" flag and tune parameters accordingly? Or can we just expect people to set maxpeers wisely for a small network? I would say the last option would be fine, but we need some mods to make that work. See below.

A short description of what (supposed to) happen in detail. There are two cases:

  • a small net using the large public disc-v4/v5 DHT.
  • a small net using a private disc-v4/v5 DHT.
    Before p2p: Filter Discv4 dial candidates based on forkID #31592 these would behave differently, but after adding the filter on discv4, both should behave the same, except for the number of discarded peers during discovery.

In discovery, we only have duplicate filtering in lookup, by the seen cache mechanism. However, the lookupIterator will start new lookups, and these can return the same nodes again. There is no duplicate filtering between sequential lookups, or between parallel lookups as introduced here by discoveryParallelLookups. So discovery can go on full-steam, feeding "new" (which are the same as the old) dial candidates.

Dial pulls from this, and has its own duplicate detection through dialHistoryExpiration. When it finds a duplicate within the expiration time, it throws it away and pulls more candidates from discovery. It only throttles down when it starts to get closer to maxpeers, but that will not happen if maxpeers is significantly larger than the network size.

If instead maxpeers is set correctly, dial will try to slow down. It will reduce slots. But even with a single slot, it will pull and discard in a loop, making discovery go full steam. Should we introduce some time-based mechanism here? E.g. throttle if checkDial fails too many times in a row?

cskiraly added 16 commits April 28, 2025 17:36
Simple version that does the filtering, but misses pipelining,
waiting for ENRs to be retrieved one-by-one.

Signed-off-by: Csaba Kiraly <[email protected]>
It is not guaranteed that Next will be called until exhaustion
after Close was called. Hence, we need to empty override the
passed channel.

Signed-off-by: Csaba Kiraly <[email protected]>
Signed-off-by: Csaba Kiraly <[email protected]>
Signed-off-by: Csaba Kiraly <[email protected]>

# Conflicts:
#	eth/backend.go
BufferIter wraps an iterator and prefetches up to a given
number of nodes from it.

Signed-off-by: Csaba Kiraly <[email protected]>
When Close was called while Next was active, and there was a race
on the closed channel. If Close finished before closed was received,
This happened:

goroutine 22716 [chan send, 5 minutes]:
github.com/ethereum/go-ethereum/p2p/enode.NewBufferIter.func1()
	github.com/ethereum/go-ethereum/p2p/enode/iter.go:219 +0x77
created by github.com/ethereum/go-ethereum/p2p/enode.NewBufferIter in goroutine 1
	github.com/ethereum/go-ethereum/p2p/enode/iter.go:216 +0xd8

goroutine 22714 [chan receive (nil chan), 1 minutes]:
github.com/ethereum/go-ethereum/p2p/enode.(*BufferIter).Next(0xc00505fea0)
	github.com/ethereum/go-ethereum/p2p/enode/iter.go:226 +0x2d
github.com/ethereum/go-ethereum/p2p/enode.AsyncFilter.func1()
	github.com/ethereum/go-ethereum/p2p/enode/iter.go:151 +0x5f
created by github.com/ethereum/go-ethereum/p2p/enode.AsyncFilter in goroutine 1
	github.com/ethereum/go-ethereum/p2p/enode/iter.go:146 +0x156

Signed-off-by: Csaba Kiraly <[email protected]>
@fjl fjl self-assigned this Apr 29, 2025
@cskiraly
Copy link
Contributor Author

Testing on a small network (Sepolia), this is the first hour of operation before this PR:
image
And after this PR:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants