-
Notifications
You must be signed in to change notification settings - Fork 20.9k
p2p: make dial faster by streamlined discovery process #31678
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
First hour of dial progress before this PR: number of outgoing peers First hour of dial progress after this PR (and #31592): number of outgoing peers |
a5b1b9d
to
016bf44
Compare
@fjl I was thinking about the increased UDP traffic. There is the case of a very small network, smaller than maxpeers. For example a small test network with only a few nodes. In that case discovery might go full-steam continuously, looking for new dial candidates. In this specific case this PR would increase the background UDP traffic by roughly 3x. If we think this is a problem, we might use discoveryParallelLookups=2 to be more conservative. Or should we maybe introduce an explicit "smallnet" flag and tune parameters accordingly? Or can we just expect people to set maxpeers wisely for a small network? I would say the last option would be fine, but we need some mods to make that work. See below. A short description of what (supposed to) happen in detail. There are two cases:
In discovery, we only have duplicate filtering in lookup, by the Dial pulls from this, and has its own duplicate detection through If instead maxpeers is set correctly, dial will try to slow down. It will reduce slots. But even with a single slot, it will pull and discard in a loop, making discovery go full steam. Should we introduce some time-based mechanism here? E.g. throttle if checkDial fails too many times in a row? |
Simple version that does the filtering, but misses pipelining, waiting for ENRs to be retrieved one-by-one. Signed-off-by: Csaba Kiraly <[email protected]>
Signed-off-by: Csaba Kiraly <[email protected]>
Signed-off-by: Csaba Kiraly <[email protected]>
It is not guaranteed that Next will be called until exhaustion after Close was called. Hence, we need to empty override the passed channel. Signed-off-by: Csaba Kiraly <[email protected]>
Signed-off-by: Csaba Kiraly <[email protected]>
Signed-off-by: Csaba Kiraly <[email protected]>
Signed-off-by: Csaba Kiraly <[email protected]>
Signed-off-by: Csaba Kiraly <[email protected]> # Conflicts: # eth/backend.go
BufferIter wraps an iterator and prefetches up to a given number of nodes from it. Signed-off-by: Csaba Kiraly <[email protected]>
Signed-off-by: Csaba Kiraly <[email protected]>
When Close was called while Next was active, and there was a race on the closed channel. If Close finished before closed was received, This happened: goroutine 22716 [chan send, 5 minutes]: github.com/ethereum/go-ethereum/p2p/enode.NewBufferIter.func1() github.com/ethereum/go-ethereum/p2p/enode/iter.go:219 +0x77 created by github.com/ethereum/go-ethereum/p2p/enode.NewBufferIter in goroutine 1 github.com/ethereum/go-ethereum/p2p/enode/iter.go:216 +0xd8 goroutine 22714 [chan receive (nil chan), 1 minutes]: github.com/ethereum/go-ethereum/p2p/enode.(*BufferIter).Next(0xc00505fea0) github.com/ethereum/go-ethereum/p2p/enode/iter.go:226 +0x2d github.com/ethereum/go-ethereum/p2p/enode.AsyncFilter.func1() github.com/ethereum/go-ethereum/p2p/enode/iter.go:151 +0x5f created by github.com/ethereum/go-ethereum/p2p/enode.AsyncFilter in goroutine 1 github.com/ethereum/go-ethereum/p2p/enode/iter.go:146 +0x156 Signed-off-by: Csaba Kiraly <[email protected]>
Signed-off-by: Csaba Kiraly <[email protected]>
Signed-off-by: Csaba Kiraly <[email protected]>
Signed-off-by: Csaba Kiraly <[email protected]>
Signed-off-by: Csaba Kiraly <[email protected]>
Signed-off-by: Csaba Kiraly <[email protected]>
This PR improves the speed of Disc/v4 and Disc/v5 based discovery. It builds on #31592. To be rebased after merging that PR.
Our dial process is rate-limited, but until now, during startup, our discovery was too slow to serve dial. So the bottleneck was discovery, not the rate limit in dial, resulting in slow ramp-up of outgoing peer count.
This PR making discovery fast enough to serve dial's needs. The rate-limit of dial is still limiting the discovery process, so we are not risking being too aggressive in our discovery.
The PR adds: