-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: TestGoexitCrash failure on linux-ppc64le-buildlet #34575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
|
@danscales @mknyszek @aclements: could someone from the runtime team take a look at this failure and at least assess whether it's something we should fix in 1.14? |
I will take a look, see if I can reproduce or make a guess as to what might be happening. |
I am able to reproduce once every few times when I run this command on the ppc64-linux-buildlet (which each time itself runs the test 500 times):
When it fails, the test seems to hang (somehow the normal "deadlock" detection of no goroutines and no main thread (because Goexit called) doesn't happen), and then something forces a SIGQUIT after 60 seconds. The same test command run locally on amd64 never fails. The actual test program (that is supposed to deadlock when all threads end and main does the Goexit) is:
I can still reproduce the problem if I comment out the I'll keep investigating and check if it is present in 1.13. However, it seems unlikely that this has to be fixed for 1.14, since it is so rare and only happens when all threads and main exit (which is most likely a programming mistake). |
OK, it definitely also happens in go1.13. I haven't been able to reproduce in go1.12 yet, so it might be a slight regression in the go1.13 timeframe. |
Is it more likely if you run with GOGC=20 or something very low? I ran into
problems with this test when stress testing other changes, and I think it
was because of the stress testing, not the changes (but I'm not at a
computer right now). My hunch was that some GC-related goroutines or maybe
the scavenger were interfering with deadlock detection.
@mknyszek, I recall you looking into a similar problem with the scavenger a
while ago, but don't remember the outcome of that.
… |
It turns out that the problem is on the latest release of go1.13, but not on the very first releases of 1.13 in August, so I did a 'git bisect'. The change that it came up with (not saying this is definitive at all) was: runtime: redefine scavenge goal in terms of heap_inuse [mknyszek] So, good guess that it might be related to GC/scavenger-related. Also, as I mentioned, it isn't reproducible for any commit if runtime.GC() is removed. Not sure why this would particularly show up only for ppc64. Will update further when I get a change to try out GOGC=20 |
@danscales thanks for the bisection, that helps a lot. I printed out some diagnostic information and ran the program until it hung. I found that in the cases where it hung, the scavenger had turned on, but consistently found no work to do, and thus fell into it's exponential back-off case. Since the program is no longer making progress at that point, it means that the scavenger will never get memory to scavenge, so it sits there for all eternity preventing There are indeed cases where the scavenger turned on but there was no hang, because it found memory to scavenge and achieved its goal, thereby turning off and allowing The reason why removing So, now the question is why does it turn on if there's no work to do? In this case, My first guess was that since There's a 5 pages (40960 byte) discrepancy between This indicates either a bug in the scavenging/treap code, or maybe 5 pages are being unaccounted for correctly. I'll dig further. |
OK it appears to be neither of the things I thought it would be. So it turns out there's a chance the computed rate could end up as +Inf, which means that "retained-want" (in the scavenger's calculation) is going to be some nonsense number. This is the crux of the problem. The reason why we see this on ppc64 and not other platforms is the higher system page size of 64 KiB; it's more likely with small heaps that we could end up having less work than a single page, and so we fall into a situation where the scavenger should scavenge something (due to one set of sane pacing parameters) but will calculate it's always ahead of schedule because of the nonsense number, so it gets stuck in a loop. The fix is easy: never let the nonsense number happen by either always rounding up to 1 physical page worth of work, or turning off the scavenger if there isn't at least one physical page worth of work. Both avoid the divide by zero which causes the +Inf, and the nonsense number later on. This is not a performance problem or anything else in real running applications because the nonsense could dissipate in the following GC cycle, or the scavenger will harmlessly back off and do nothing (when there's really so little work to do that it doesn't matter). It is a problem for deadlock detection though, which is useful for teaching, so I'll put up a fix. Should we backport this? |
Uploaded a fix for this. With the patch, I can run the test 100,000 consecutive times on linux/ppc64 without issue. |
Change https://golang.org/cl/203517 mentions this issue: |
Moved to the Go 1.14 milestone just because the fix is already up for review, is small, and close to landing. |
Observed on the
linux-ppc64le-buildlet
builder (https://build.golang.org/log/b524842fe441d0c1d47adad4cde878daed7bfc76):CC @danscales @mknyszek; see previously #31966
The text was updated successfully, but these errors were encountered: