-
Notifications
You must be signed in to change notification settings - Fork 18.1k
runtime: segmentation fault on linux/amd64 #49370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm trying to reproduce this with 10 gomotes, ignoring other flakes and failures on the dashboard. I upgraded my tooling for this so I should be able to run it indefinitely and get back a core file when it does actually fail. |
I reproduced it but failed to produce a core. |
Reproduced again, no core file. Trying a different strategy to try to get the core file to be produced... |
Great news! I have cores. The problem is I have 97 of them, but all of them are indeed from this test run. So one of them should capture the failure. In theory. |
OK, that was easier than I thought. Backtrace:
Sadly, this is something I'm aware of. The debug call injection tests are riddled with write barriers that are not safe to execute. The quick fix here is to disable the GC for these tests. Notably some of these tests call Note also that this is a test-only failure. Real-world implementations of the debug call protocol (like in delve) don't have this problem, because the handler is implemented in another process. In the past I made an effort to remove these write barriers, but it's surprisingly difficult. There are a lot of them, and arranging for something different for all of them would be a pain. I propose disabling the GOGC-based GC during the test to unblock the longtest builder, and then taking some time later to rewrite this entire handler to be careful not to use write barriers (and to be annotated with |
CC @prattmic who I discussed this with in the past. |
Change https://golang.org/cl/361896 mentions this issue: |
Change https://golang.org/cl/369751 mentions this issue: |
SetGCPercent(-1) is called by several tests in debug_test.go (followed by a call to runtime.GC) due to #49370. However, startDebugCallWorker already actually has this, just without the runtime.GC call (allowing an in-progress GC to still mess up the test). This CL consolidates SetGCPercent into startDebugDebugCallWorker where applicable. Change-Id: Ifa12d6a911f1506e252d3ddf03004cf2ab3f4ee4 Reviewed-on: https://go-review.googlesource.com/c/go/+/369751 Trust: Michael Knyszek <[email protected]> Run-TryBot: Michael Knyszek <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: David Chase <[email protected]>
Change https://golang.org/cl/369815 mentions this issue: |
In investigating the root causes for #49680, #49695, and #45867, I discovered that the logic behind the fix here was erroneous. In fact, I suspect that the cause of these failures was that just one of the tests was missing a call to |
Fixes for #49680, #49695, #45867, and #49370 all assumed that SetGCPercent(-1) doesn't block until the GC's mark phase is done, but it actually does. The cause of 3 of those 4 failures comes from the fact that at the beginning of the sweep phase, the GC does try to preempt every P once, and this may run concurrently with test code. In the fourth case, the issue was likely that only *one* of the debug_test.go tests was missing a call to SetGCPercent(-1). Just to be safe, leave a TODO there for now to remove the extraneous runtime.GC calls, but leave the calls in. Updates #49680, #49695, #45867, and #49370. Change-Id: Ibf4e64addfba18312526968bcf40f1f5d54eb3f1 Reviewed-on: https://go-review.googlesource.com/c/go/+/369815 Reviewed-by: Austin Clements <[email protected]> Trust: Michael Knyszek <[email protected]> Run-TryBot: Michael Knyszek <[email protected]> TryBot-Result: Gopher Robot <[email protected]>
There's been a slew of
segmentation fault (core dumped)
failures (without additional context) on various linux/amd64 builders as of 961aab2 (AFAICT).2021-11-04T21:52:51-1e0c3b2/linux-amd64
2021-11-04T21:52:51-1e0c3b2/linux-amd64-bullseye
2021-11-04T21:52:36-8ad0a7e/linux-amd64
2021-11-04T21:52:36-8ad0a7e/linux-amd64-fedora
2021-11-04T21:52:36-8ad0a7e/linux-amd64-staticlockranking
2021-11-04T21:52:06-37634ee/linux-amd64-fedora
2021-11-04T21:33:23-71fc881/linux-amd64-unified
2021-11-04T20:01:10-9b2dd1f/linux-amd64-sid
2021-11-04T20:00:54-961aab2/linux-amd64-unified
Notably, the failure always seems to pop up in the same place in the runtime tests (around 5.2 to 5.5 seconds), probably because it's the same single test failing. I've been unable to reproduce it on my VM, or on the gomotes, despite many
all.bash
runs.My guess is that this is another failure related to a bad test that previously never had a GC run during the test, but now because of 961aab2 and the smaller minimum heap size, does (I have fixed 4 failures like this so far; it's not proof, but it does appear to be part of the pattern).
The text was updated successfully, but these errors were encountered: