Skip to content

Commit 4a2cc73

Browse files
committed
runtime: don't attempt to steal from idle Ps
Work stealing is a scalability bottleneck in the scheduler. Since each P has a work queue, work stealing must look at every P to determine if there is any work. The number of Ps scales linearly with GOMAXPROCS (i.e., the number of Ps _is_ GOMAXPROCS), thus this work scales linearly with GOMAXPROCS. Work stealing is a later attempt by a P to find work before it goes idle. Since the P has no work of its own, extra costs here tend not to directly affect application-level benchmarks. Where they show up is extra CPU usage by the process as a whole. These costs get particularly expensive for applications that transition between blocked and running frequently. Long term, we need a more scalable approach in general, but for now we can make a simple observation: idle Ps ([1]) cannot possibly have anything in their runq, so we need not bother checking at all. We track idle Ps via a new global bitmap, updated in pidleput/pidleget. This is already a slow path (requires sched.lock), so we don't expect high contention there. Using a single bitmap avoids the need to touch every P to read p.status. Currently, the bitmap approach is not significantly better than reading p.status. However, in a future CL I'd like to apply a similiar optimization to timers. Once done, findrunnable would not touch most Ps at all (in mostly idle programs), which will avoid memory latency to pull those Ps into cache. When reading this bitmap, we are racing with Ps going in and out of idle, so there are a few cases to consider: 1. _Prunning -> _Pidle: Running P goes idle after we check the bitmap. In this case, we will try to steal (and find nothing) so there is no harm. 2. _Pidle -> _Prunning while spinning: A P that starts running may queue new work that we miss. This is OK: (a) that P cannot go back to sleep without completing its work, and (b) more fundamentally, we will recheck after we drop our P. 3. _Pidle -> _Prunning after spinning: After spinning, we really can miss work from a newly woken P. (a) above still applies here as well, but this is also the same delicate dance case described in findrunnable: if nothing is spinning anymore, the other P will unpark a thread to run the work it submits. Benchmark results from WakeupParallel/syscall/pair/race/1ms (see golang.org/cl/228577): name old msec new msec delta Perf-task-clock-8 250 ± 1% 247 ± 4% ~ (p=0.690 n=5+5) Perf-task-clock-16 258 ± 2% 259 ± 2% ~ (p=0.841 n=5+5) Perf-task-clock-32 284 ± 2% 270 ± 4% -4.94% (p=0.032 n=5+5) Perf-task-clock-64 326 ± 3% 303 ± 2% -6.92% (p=0.008 n=5+5) Perf-task-clock-128 407 ± 2% 363 ± 5% -10.69% (p=0.008 n=5+5) Perf-task-clock-256 561 ± 1% 481 ± 1% -14.20% (p=0.016 n=4+5) Perf-task-clock-512 840 ± 5% 683 ± 2% -18.70% (p=0.008 n=5+5) Perf-task-clock-1024 1.38k ±14% 1.07k ± 2% -21.85% (p=0.008 n=5+5) [1] "Idle Ps" here refers to _Pidle Ps in the sched.pidle list. In other contexts, Ps may temporarily transition through _Pidle (e.g., in handoffp); those Ps may have work. Updates #28808 Updates #18237 Change-Id: Ieeb958bd72e7d8fb375b0b1f414e8d7378b14e29 Reviewed-on: https://go-review.googlesource.com/c/go/+/259578 Run-TryBot: Michael Pratt <[email protected]> TryBot-Result: Go Bot <[email protected]> Reviewed-by: Michael Knyszek <[email protected]> Reviewed-by: Austin Clements <[email protected]> Trust: Michael Pratt <[email protected]>
1 parent 44dbeaf commit 4a2cc73

File tree

2 files changed

+82
-8
lines changed

2 files changed

+82
-8
lines changed

src/runtime/proc.go

Lines changed: 72 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2295,8 +2295,12 @@ top:
22952295
if _p_ == p2 {
22962296
continue
22972297
}
2298-
if gp := runqsteal(_p_, p2, stealRunNextG); gp != nil {
2299-
return gp, false
2298+
2299+
// Don't bother to attempt to steal if p2 is idle.
2300+
if !idlepMask.read(enum.position()) {
2301+
if gp := runqsteal(_p_, p2, stealRunNextG); gp != nil {
2302+
return gp, false
2303+
}
23002304
}
23012305

23022306
// Consider stealing timers from p2.
@@ -2307,8 +2311,13 @@ top:
23072311
// and is not marked for preemption. If p2 is running
23082312
// and not being preempted we assume it will handle its
23092313
// own timers.
2314+
//
23102315
// If we're still looking for work after checking all
23112316
// the P's, then go ahead and steal from an active P.
2317+
//
2318+
// TODO(prattmic): Maintain a global look-aside similar
2319+
// to idlepMask to avoid looking at p2 if it can't
2320+
// possibly have timers.
23122321
if i > 2 || (i > 1 && shouldStealTimers(p2)) {
23132322
tnow, w, ran := checkTimers(p2, now)
23142323
now = tnow
@@ -2379,6 +2388,9 @@ stop:
23792388
// safe-points. We don't need to snapshot the contents because
23802389
// everything up to cap(allp) is immutable.
23812390
allpSnapshot := allp
2391+
// Also snapshot idlepMask. Value changes are OK, but we can't allow
2392+
// len to change out from under us.
2393+
idlepMaskSnapshot := idlepMask
23822394

23832395
// return P and block
23842396
lock(&sched.lock)
@@ -2419,8 +2431,8 @@ stop:
24192431
}
24202432

24212433
// check all runqueues once again
2422-
for _, _p_ := range allpSnapshot {
2423-
if !runqempty(_p_) {
2434+
for id, _p_ := range allpSnapshot {
2435+
if !idlepMaskSnapshot.read(uint32(id)) && !runqempty(_p_) {
24242436
lock(&sched.lock)
24252437
_p_ = pidleget()
24262438
unlock(&sched.lock)
@@ -4398,6 +4410,8 @@ func procresize(nprocs int32) *p {
43984410
}
43994411
sched.procresizetime = now
44004412

4413+
maskWords := (nprocs+31) / 32
4414+
44014415
// Grow allp if necessary.
44024416
if nprocs > int32(len(allp)) {
44034417
// Synchronize with retake, which could be running
@@ -4412,6 +4426,15 @@ func procresize(nprocs int32) *p {
44124426
copy(nallp, allp[:cap(allp)])
44134427
allp = nallp
44144428
}
4429+
4430+
if maskWords <= int32(cap(idlepMask)) {
4431+
idlepMask = idlepMask[:maskWords]
4432+
} else {
4433+
nidlepMask := make([]uint32, maskWords)
4434+
// No need to copy beyond len, old Ps are irrelevant.
4435+
copy(nidlepMask, idlepMask)
4436+
idlepMask = nidlepMask
4437+
}
44154438
unlock(&allpLock)
44164439
}
44174440

@@ -4470,6 +4493,7 @@ func procresize(nprocs int32) *p {
44704493
if int32(len(allp)) != nprocs {
44714494
lock(&allpLock)
44724495
allp = allp[:nprocs]
4496+
idlepMask = idlepMask[:maskWords]
44734497
unlock(&allpLock)
44744498
}
44754499

@@ -5153,8 +5177,46 @@ func globrunqget(_p_ *p, max int32) *g {
51535177
return gp
51545178
}
51555179

5156-
// Put p to on _Pidle list.
5180+
// pIdleMask is a bitmap of of Ps in the _Pidle list, one bit per P.
5181+
type pIdleMask []uint32
5182+
5183+
// read returns true if P id is in the _Pidle list, and thus cannot have work.
5184+
func (p pIdleMask) read(id uint32) bool {
5185+
word := id / 32
5186+
mask := uint32(1) << (id % 32)
5187+
return (atomic.Load(&p[word]) & mask) != 0
5188+
}
5189+
5190+
// set sets P id as idle in mask.
5191+
//
5192+
// Must be called only for a P owned by the caller. In order to maintain
5193+
// consistency, a P going idle must the idle mask simultaneously with updates
5194+
// to the idle P list under the sched.lock, otherwise a racing pidleget may
5195+
// clear the mask before pidleput sets the mask, corrupting the bitmap.
5196+
//
5197+
// N.B., procresize takes ownership of all Ps in stopTheWorldWithSema.
5198+
func (p pIdleMask) set(id int32) {
5199+
word := id / 32
5200+
mask := uint32(1) << (id % 32)
5201+
atomic.Or(&p[word], mask)
5202+
}
5203+
5204+
// clear sets P id as non-idle in mask.
5205+
//
5206+
// See comment on set.
5207+
func (p pIdleMask) clear(id int32) {
5208+
word := id / 32
5209+
mask := uint32(1) << (id % 32)
5210+
atomic.And(&p[word], ^mask)
5211+
}
5212+
5213+
// pidleput puts p to on the _Pidle list.
5214+
//
5215+
// This releases ownership of p. Once sched.lock is released it is no longer
5216+
// safe to use p.
5217+
//
51575218
// sched.lock must be held.
5219+
//
51585220
// May run during STW, so write barriers are not allowed.
51595221
//go:nowritebarrierrec
51605222
func pidleput(_p_ *p) {
@@ -5163,20 +5225,24 @@ func pidleput(_p_ *p) {
51635225
if !runqempty(_p_) {
51645226
throw("pidleput: P has non-empty run queue")
51655227
}
5228+
idlepMask.set(_p_.id)
51665229
_p_.link = sched.pidle
51675230
sched.pidle.set(_p_)
51685231
atomic.Xadd(&sched.npidle, 1) // TODO: fast atomic
51695232
}
51705233

5171-
// Try get a p from _Pidle list.
5234+
// pidleget tries to get a p from the _Pidle list, acquiring ownership.
5235+
//
51725236
// sched.lock must be held.
5237+
//
51735238
// May run during STW, so write barriers are not allowed.
51745239
//go:nowritebarrierrec
51755240
func pidleget() *p {
51765241
assertLockHeld(&sched.lock)
51775242

51785243
_p_ := sched.pidle.ptr()
51795244
if _p_ != nil {
5245+
idlepMask.clear(_p_.id)
51805246
sched.pidle = _p_.link
51815247
atomic.Xadd(&sched.npidle, -1) // TODO: fast atomic
51825248
}

src/runtime/runtime2.go

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1035,14 +1035,22 @@ func (w waitReason) String() string {
10351035
var (
10361036
allglen uintptr
10371037
allm *m
1038-
allp []*p // len(allp) == gomaxprocs; may change at safe points, otherwise immutable
1039-
allpLock mutex // Protects P-less reads of allp and all writes
10401038
gomaxprocs int32
10411039
ncpu int32
10421040
forcegc forcegcstate
10431041
sched schedt
10441042
newprocs int32
10451043

1044+
// allpLock protects P-less reads and size changes of allp and
1045+
// idlepMask, and all writes to allp.
1046+
allpLock mutex
1047+
// len(allp) == gomaxprocs; may change at safe points, otherwise
1048+
// immutable.
1049+
allp []*p
1050+
// Bitmask of Ps in _Pidle list, one bit per P. Reads and writes must
1051+
// be atomic. Length may change at safe points.
1052+
idlepMask pIdleMask
1053+
10461054
// Information about what cpu features are available.
10471055
// Packages outside the runtime should not use these
10481056
// as they are not an external api.

0 commit comments

Comments
 (0)