runtime: TestGoexitCrash failure on linux-ppc64le-buildlet #34575

bcmills · 2019-09-27T15:06:08Z

Observed on the linux-ppc64le-buildlet builder (https://build.golang.org/log/b524842fe441d0c1d47adad4cde878daed7bfc76):

--- FAIL: TestGoexitCrash (60.00s)
    crash_test.go:95: testprog GoexitExit exit status: exit status 2
    crash_test.go:291: output:
        t1
        t2
        t3
        t4
        SIGQUIT: quit
        PC=0x68570 m=0 sigcode=0
        
        goroutine 6 [syscall]:
        runtime.notetsleepg(0x1d8d40, 0xc34fff89f, 0x1415)
        	/tmp/workdir-host-linux-ppc64le-osu/go/src/runtime/lock_futex.go:227 +0x38 fp=0xc000036730 sp=0xc0000366f0 pc=0x198e8
        runtime.timerproc(0x1d8d20)
        	/tmp/workdir-host-linux-ppc64le-osu/go/src/runtime/time.go:308 +0x2f4 fp=0xc0000367b8 sp=0xc000036730 pc=0x59774
        runtime.goexit()
        	/tmp/workdir-host-linux-ppc64le-osu/go/src/runtime/asm_ppc64x.s:884 +0x4 fp=0xc0000367b8 sp=0xc0000367b8 pc=0x67a44
        created by runtime.(*timersBucket).addtimerLocked
        	/tmp/workdir-host-linux-ppc64le-osu/go/src/runtime/time.go:166 +0x13c
        
        r0   0xdd	r1   0xc000036648
        r2   0x2f2ec	r3   0x4
        r4   0x80	r5   0x0
        r6   0xc000036698	r7   0x0
        r8   0x0	r9   0x0
        r10  0x0	r11  0x0
        r12  0x0	r13  0x1ddac8
        r14  0x196d4	r15  0xc000036678
        r16  0x1	r17  0xffffffffb150046b
        r18  0xb150046b	r19  0x127474
        r20  0xc00008e020	r21  0x1d6a40
        r22  0x0	r23  0x39
        r24  0x10	r25  0x8
        r26  0x0	r27  0x0
        r28  0x0	r29  0x0
        r30  0xc00005e900	r31  0x130000
        pc   0x68570	ctr  0x0
        link 0x39094	xer  0x0
        ccr  0x54400002	trap 0xc00
        
        
        want output containing: no goroutines (main called runtime.Goexit) - deadlock!
FAIL
FAIL	runtime	109.964s

CC @danscales @mknyszek; see previously #31966

The text was updated successfully, but these errors were encountered:

bcmills · 2019-10-21T13:05:55Z

linux-ppc64-buildlet again: https://build.golang.org/log/80f1a9433cc496697d6c95d65466fdf282e844b2

bcmills · 2019-10-21T13:06:59Z

@danscales @mknyszek @aclements: could someone from the runtime team take a look at this failure and at least assess whether it's something we should fix in 1.14?

danscales · 2019-10-21T16:56:59Z

I will take a look, see if I can reproduce or make a guess as to what might be happening.

danscales · 2019-10-24T15:27:16Z

I am able to reproduce once every few times when I run this command on the ppc64-linux-buildlet (which each time itself runs the test 500 times):

gomote run user-danscales-linux-ppc64-buildlet-0 go/bin/go test runtime -test.count=500 -test.run TestGoexitCrash

When it fails, the test seems to hang (somehow the normal "deadlock" detection of no goroutines and no main thread (because Goexit called) doesn't happen), and then something forces a SIGQUIT after 60 seconds.

The same test command run locally on amd64 never fails.

The actual test program (that is supposed to deadlock when all threads end and main does the Goexit) is:

func GoexitExit() {
	println("t1")
	go func() {
		time.Sleep(time.Millisecond)
	}()
	i := 0
	println("t2")
	runtime.SetFinalizer(&i, func(p *int) {})
	println("t3")
	runtime.GC()
	println("t4")
	runtime.Goexit()
}

I can still reproduce the problem if I comment out the i := 0 line and the SetFinalizer line. However, I can't reproduce if I comment out the runtime.GC line.

I'll keep investigating and check if it is present in 1.13. However, it seems unlikely that this has to be fixed for 1.14, since it is so rare and only happens when all threads and main exit (which is most likely a programming mistake).

danscales · 2019-10-24T16:02:47Z

OK, it definitely also happens in go1.13. I haven't been able to reproduce in go1.12 yet, so it might be a slight regression in the go1.13 timeframe.

aclements · 2019-10-25T00:50:32Z

Is it more likely if you run with GOGC=20 or something very low? I ran into problems with this test when stress testing other changes, and I think it was because of the stress testing, not the changes (but I'm not at a computer right now). My hunch was that some GC-related goroutines or maybe the scavenger were interfering with deadlock detection. @mknyszek, I recall you looking into a similar problem with the scavenger a while ago, but don't remember the outcome of that.

…

danscales · 2019-10-25T14:14:23Z

It turns out that the problem is on the latest release of go1.13, but not on the very first releases of 1.13 in August, so I did a 'git bisect'. The change that it came up with (not saying this is definitive at all) was:

runtime: redefine scavenge goal in terms of heap_inuse [mknyszek]

So, good guess that it might be related to GC/scavenger-related. Also, as I mentioned, it isn't reproducible for any commit if runtime.GC() is removed. Not sure why this would particularly show up only for ppc64.

Will update further when I get a change to try out GOGC=20

mknyszek · 2019-10-25T17:25:01Z

@danscales thanks for the bisection, that helps a lot.

I printed out some diagnostic information and ran the program until it hung.

I found that in the cases where it hung, the scavenger had turned on, but consistently found no work to do, and thus fell into it's exponential back-off case. Since the program is no longer making progress at that point, it means that the scavenger will never get memory to scavenge, so it sits there for all eternity preventing checkdead from firing since it thinks that the scavenger is alive via a live timer.

There are indeed cases where the scavenger turned on but there was no hang, because it found memory to scavenge and achieved its goal, thereby turning off and allowing checkdead to throw. Most of the time it just never turns on at all, which is why this is hard to reproduce.

The reason why removing runtime.GC() fixes this is because the scavenger is always off for the first GC, so an explicit GC opens up the opportunity to potentially turn on the scavenger at all.

So, now the question is why does it turn on if there's no work to do? In this case, heap_sys - heap_released (the scavenger's retained heap memory metric) is greater than the goal it sets for itself, which, as of the CL @danscales mentioned, is based on heap_inuse at the end of the last GC, multiplied by the expected heap growth (next_gc / last_next_gc).

My first guess was that since last_heap_inuse is stale information (slightly) there could be skew with the actual heap_inuse which means the whole heap is either in use or scavenged. But this guess was wrong. When I print out heap_inuse at the last scavenger pacing change before the hang, it's identical in value to last_heap_inuse.

There's a 5 pages (40960 byte) discrepancy between heap_sys - heap_released and heap_inuse. In theory, heap_sys is all mapped memory, heap_released is all freed and scavenged memory, and heap_inuse is all in use memory. Therefore those 40960 bytes should be free and unscavenged memory, but the scavenger can't find them. If it could, it would just scavenge some portion of them and the application would exit quickly (like in the non-hanging scavenger-on cases, where 40960 bytes is the exact discrepancy).

This indicates either a bug in the scavenging/treap code, or maybe 5 pages are being unaccounted for correctly. I'll dig further.

mknyszek · 2019-10-25T18:12:51Z

OK it appears to be neither of the things I thought it would be.

So it turns out there's a chance the computed rate could end up as +Inf, which means that "retained-want" (in the scavenger's calculation) is going to be some nonsense number.

This is the crux of the problem. The reason why we see this on ppc64 and not other platforms is the higher system page size of 64 KiB; it's more likely with small heaps that we could end up having less work than a single page, and so we fall into a situation where the scavenger should scavenge something (due to one set of sane pacing parameters) but will calculate it's always ahead of schedule because of the nonsense number, so it gets stuck in a loop.

The fix is easy: never let the nonsense number happen by either always rounding up to 1 physical page worth of work, or turning off the scavenger if there isn't at least one physical page worth of work. Both avoid the divide by zero which causes the +Inf, and the nonsense number later on.

This is not a performance problem or anything else in real running applications because the nonsense could dissipate in the following GC cycle, or the scavenger will harmlessly back off and do nothing (when there's really so little work to do that it doesn't matter). It is a problem for deadlock detection though, which is useful for teaching, so I'll put up a fix.

Should we backport this?

mknyszek · 2019-10-25T18:40:33Z

Uploaded a fix for this. With the patch, I can run the test 100,000 consecutive times on linux/ppc64 without issue.

gopherbot · 2019-10-25T18:45:41Z

Change https://golang.org/cl/203517 mentions this issue: runtime: turn off scavenger when there's <1 physical page of work

mknyszek · 2019-10-29T15:56:55Z

Moved to the Go 1.14 milestone just because the fix is already up for review, is small, and close to landing.

bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Sep 27, 2019

bcmills added this to the Go1.14 milestone Sep 27, 2019

rsc modified the milestones: Go1.14, Backlog Oct 9, 2019

bcmills modified the milestones: Backlog, Go1.14 Oct 21, 2019

bcmills modified the milestones: Go1.14, Backlog Oct 24, 2019

laboger mentioned this issue Oct 29, 2019

runtime/debug: SIGSEGV in dumpgoroutine #35172

Closed

mknyszek modified the milestones: Backlog, Go1.14 Oct 29, 2019

mknyszek self-assigned this Oct 29, 2019

mknyszek removed the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Oct 29, 2019

bcmills mentioned this issue Oct 31, 2019

runtime: TestLockedDeadlock2 is flaky #35274

Closed

gopherbot closed this as completed in 08a72c8 Nov 1, 2019

golang locked and limited conversation to collaborators Oct 31, 2020

gopherbot added the FrozenDueToAge label Oct 31, 2020

rsc unassigned mknyszek Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: TestGoexitCrash failure on linux-ppc64le-buildlet #34575

runtime: TestGoexitCrash failure on linux-ppc64le-buildlet #34575

bcmills commented Sep 27, 2019

bcmills commented Oct 21, 2019

bcmills commented Oct 21, 2019 •

edited

Loading

danscales commented Oct 21, 2019

danscales commented Oct 24, 2019 •

edited

Loading

danscales commented Oct 24, 2019

aclements commented Oct 25, 2019 via email

danscales commented Oct 25, 2019

mknyszek commented Oct 25, 2019

mknyszek commented Oct 25, 2019

mknyszek commented Oct 25, 2019

gopherbot commented Oct 25, 2019

mknyszek commented Oct 29, 2019

runtime: TestGoexitCrash failure on linux-ppc64le-buildlet #34575

runtime: TestGoexitCrash failure on linux-ppc64le-buildlet #34575

Comments

bcmills commented Sep 27, 2019

bcmills commented Oct 21, 2019

bcmills commented Oct 21, 2019 • edited Loading

danscales commented Oct 21, 2019

danscales commented Oct 24, 2019 • edited Loading

danscales commented Oct 24, 2019

aclements commented Oct 25, 2019 via email

danscales commented Oct 25, 2019

mknyszek commented Oct 25, 2019

mknyszek commented Oct 25, 2019

mknyszek commented Oct 25, 2019

gopherbot commented Oct 25, 2019

mknyszek commented Oct 29, 2019

bcmills commented Oct 21, 2019 •

edited

Loading

danscales commented Oct 24, 2019 •

edited

Loading