runtime: segmentation fault on linux/amd64 #49370

mknyszek · 2021-11-05T01:28:28Z

There's been a slew of segmentation fault (core dumped) failures (without additional context) on various linux/amd64 builders as of 961aab2 (AFAICT).

2021-11-04T21:52:51-1e0c3b2/linux-amd64
2021-11-04T21:52:51-1e0c3b2/linux-amd64-bullseye
2021-11-04T21:52:36-8ad0a7e/linux-amd64
2021-11-04T21:52:36-8ad0a7e/linux-amd64-fedora
2021-11-04T21:52:36-8ad0a7e/linux-amd64-staticlockranking
2021-11-04T21:52:06-37634ee/linux-amd64-fedora
2021-11-04T21:33:23-71fc881/linux-amd64-unified
2021-11-04T20:01:10-9b2dd1f/linux-amd64-sid
2021-11-04T20:00:54-961aab2/linux-amd64-unified

Notably, the failure always seems to pop up in the same place in the runtime tests (around 5.2 to 5.5 seconds), probably because it's the same single test failing. I've been unable to reproduce it on my VM, or on the gomotes, despite many all.bash runs.

My guess is that this is another failure related to a bad test that previously never had a GC run during the test, but now because of 961aab2 and the smaller minimum heap size, does (I have fixed 4 failures like this so far; it's not proof, but it does appear to be part of the pattern).

The text was updated successfully, but these errors were encountered:

mknyszek · 2021-11-05T17:24:35Z

I'm trying to reproduce this with 10 gomotes, ignoring other flakes and failures on the dashboard. I upgraded my tooling for this so I should be able to run it indefinitely and get back a core file when it does actually fail.

mknyszek · 2021-11-05T18:27:53Z

I reproduced it but failed to produce a core.

mknyszek · 2021-11-05T21:56:11Z

Reproduced again, no core file. Trying a different strategy to try to get the core file to be produced...

mknyszek · 2021-11-05T23:09:16Z

Great news! I have cores. The problem is I have 97 of them, but all of them are indeed from this test run. So one of them should capture the failure. In theory.

mknyszek · 2021-11-05T23:24:22Z

OK, that was easier than I thought. Backtrace:

#0  runtime.gcWriteBarrier ()
    at /usr/local/google/home/mknyszek/tmp/dlgo-linux-amd64/go/src/runtime/asm_amd64.s:1622
#1  0x00000000004726e7 in runtime.gcWriteBarrierCX ()
    at /usr/local/google/home/mknyszek/tmp/dlgo-linux-amd64/go/src/runtime/asm_amd64.s:1697
#2  0x00000000004679d4 in runtime.(*debugCallHandler).inject (h=0x657350, info=0xc000115bf0,
    ctxt=<optimized out>, gp2=<optimized out>)
    at /usr/local/google/home/mknyszek/tmp/dlgo-linux-amd64/go/src/runtime/export_debug_test.go:133
#3  0x0000000000475172 in runtime.(*debugCallHandler).inject-fm ()
    at /usr/local/google/home/mknyszek/tmp/dlgo-linux-amd64/go/src/runtime/export_debug_test.go:109
#4  0x000000000044fdc4 in runtime.sighandler (sig=5, info=<optimized out>, ctxt=<optimized out>,
    gp=0xc0000f8048) at /usr/local/google/home/mknyszek/tmp/dlgo-linux-amd64/go/src/runtime/signal_unix.go:606
#5  0x000000000044f8e5 in runtime.sigtrampgo (sig=5, info=0xc000115bf0, ctx=0xc000115ac0)
    at /usr/local/google/home/mknyszek/tmp/dlgo-linux-amd64/go/src/runtime/signal_unix.go:465
#6  0x0000000000474d8e in runtime.sigtrampgo (sig=5, info=0xc000115bf0, ctx=0xc000115ac0) at <autogenerated>:1
#7  0x00000000004741dd in runtime.sigtramp ()
    at /usr/local/google/home/mknyszek/tmp/dlgo-linux-amd64/go/src/runtime/sys_linux_amd64.s:361

Sadly, this is something I'm aware of. The debug call injection tests are riddled with write barriers that are not safe to execute. The quick fix here is to disable the GC for these tests. Notably some of these tests call runtime.GC in the call injection handler -- that's totally fine, the issue is if a GC is currently executing while the handler is running.

Note also that this is a test-only failure. Real-world implementations of the debug call protocol (like in delve) don't have this problem, because the handler is implemented in another process. In the past I made an effort to remove these write barriers, but it's surprisingly difficult. There are a lot of them, and arranging for something different for all of them would be a pain.

I propose disabling the GOGC-based GC during the test to unblock the longtest builder, and then taking some time later to rewrite this entire handler to be careful not to use write barriers (and to be annotated with go:nowritebarrierrec). Another useful note: in theory the handler should already have had write barriers turned off! It's called from a signal handler. Unfortunately it's called indirectly, so the static analysis that enforces this fails.

mknyszek · 2021-11-05T23:24:53Z

CC @prattmic who I discussed this with in the past.

gopherbot · 2021-11-05T23:41:22Z

Change https://golang.org/cl/361896 mentions this issue: runtime: disable GC during debug call tests

gopherbot · 2021-12-06T23:04:26Z

Change https://golang.org/cl/369751 mentions this issue: runtime: clean up redundant calls to SetGCPercent in debug_test.go

SetGCPercent(-1) is called by several tests in debug_test.go (followed by a call to runtime.GC) due to #49370. However, startDebugCallWorker already actually has this, just without the runtime.GC call (allowing an in-progress GC to still mess up the test). This CL consolidates SetGCPercent into startDebugDebugCallWorker where applicable. Change-Id: Ifa12d6a911f1506e252d3ddf03004cf2ab3f4ee4 Reviewed-on: https://go-review.googlesource.com/c/go/+/369751 Trust: Michael Knyszek <[email protected]> Run-TryBot: Michael Knyszek <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: David Chase <[email protected]>

gopherbot · 2021-12-07T05:31:36Z

Change https://golang.org/cl/369815 mentions this issue: runtime: fix comments on the behavior of SetGCPercent

mknyszek · 2021-12-07T05:32:04Z

In investigating the root causes for #49680, #49695, and #45867, I discovered that the logic behind the fix here was erroneous. In fact, I suspect that the cause of these failures was that just one of the tests was missing a call to SetGCPercent(-1) and that the additional runtime.GC calls are actually pointless. I'm not going to pull them out for 1.18 because that risks introducing flaky tests, but I sent a CL that leaves TODOs to remove them.

Fixes for #49680, #49695, #45867, and #49370 all assumed that SetGCPercent(-1) doesn't block until the GC's mark phase is done, but it actually does. The cause of 3 of those 4 failures comes from the fact that at the beginning of the sweep phase, the GC does try to preempt every P once, and this may run concurrently with test code. In the fourth case, the issue was likely that only *one* of the debug_test.go tests was missing a call to SetGCPercent(-1). Just to be safe, leave a TODO there for now to remove the extraneous runtime.GC calls, but leave the calls in. Updates #49680, #49695, #45867, and #49370. Change-Id: Ibf4e64addfba18312526968bcf40f1f5d54eb3f1 Reviewed-on: https://go-review.googlesource.com/c/go/+/369815 Reviewed-by: Austin Clements <[email protected]> Trust: Michael Knyszek <[email protected]> Run-TryBot: Michael Knyszek <[email protected]> TryBot-Result: Gopher Robot <[email protected]>

mknyszek added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Linux release-blocker labels Nov 5, 2021

mknyszek added this to the Go1.18 milestone Nov 5, 2021

mknyszek self-assigned this Nov 6, 2021

gopherbot closed this as completed in 7ee3f14 Nov 8, 2021

mknyszek mentioned this issue Nov 8, 2021

runtime: debug call injection tests contain many unsafe write barriers #49169

Open

bcmills mentioned this issue Mar 10, 2022

runtime: test failure with segmentation fault (and no other output) on linux-ppc64le-buildlet #51600

Open

rsc unassigned mknyszek Jun 23, 2022

golang locked and limited conversation to collaborators Jun 23, 2023

gopherbot added the FrozenDueToAge label Jun 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

runtime: segmentation fault on linux/amd64 #49370

runtime: segmentation fault on linux/amd64 #49370

mknyszek commented Nov 5, 2021

mknyszek commented Nov 5, 2021

Uh oh!

mknyszek commented Nov 5, 2021

Uh oh!

mknyszek commented Nov 5, 2021

Uh oh!

mknyszek commented Nov 5, 2021

Uh oh!

mknyszek commented Nov 5, 2021

Uh oh!

mknyszek commented Nov 5, 2021

Uh oh!

gopherbot commented Nov 5, 2021

Uh oh!

gopherbot commented Dec 6, 2021

Uh oh!

gopherbot commented Dec 7, 2021

Uh oh!

mknyszek commented Dec 7, 2021

Uh oh!

runtime: segmentation fault on linux/amd64 #49370

runtime: segmentation fault on linux/amd64 #49370

Comments

mknyszek commented Nov 5, 2021

mknyszek commented Nov 5, 2021

Uh oh!

mknyszek commented Nov 5, 2021

Uh oh!

mknyszek commented Nov 5, 2021

Uh oh!

mknyszek commented Nov 5, 2021

Uh oh!

mknyszek commented Nov 5, 2021

Uh oh!

mknyszek commented Nov 5, 2021

Uh oh!

gopherbot commented Nov 5, 2021

Uh oh!

gopherbot commented Dec 6, 2021

Uh oh!

gopherbot commented Dec 7, 2021

Uh oh!

mknyszek commented Dec 7, 2021

Uh oh!