-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: macOS-only segfault on 1.14+ with "split stack overflow" #39079
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It'll be hard to debug this if it's essentially linking in a big blob of non-Go code. But maybe looking at the stack-traces can give some hint. |
Looking at those failures, the first thing that jumps out at me is that the |
Oh some other aspects I should mention is that So far the best I can surmise is that this is crashing when GC is happening. The call to I'm currently trying to capture a crash in lldb, but it's not being too too successful. I've tried capturing a few core dumps but they're not proving too useful. The core dumps show all threads blocked in You mention there's print-debugging that can be added, do you mean in the runtime? If so are there logs/etc I could try to add to help debug this? |
That makes sense to me. I imagine that given that the file you have above doesn't do much in the Go world that the goroutine calling into cgo has no reason to be preempted, which is where the failures are popping up (
I think I take that back. I was going to suggest putting It still may be worth putting in Does the JIT'd code make use of TLS at all (or the Rust code, even)? The value being clobbered is accessible via an offset from a pointer in a TLS slot. Given that the offset is the same for Go 1.13, I'd be surprised if this is the case, but it is an idea. Given that this is involves the scheduler, one other thing to try is |
RE: that last idea about TLS, @cherrymui just informed me that we use a fixed offset for the TLS on Darwin, so that's unlikely to have changed either. |
Also the This is assuming that my hypothesis is true that the value in |
To make sure I understand, do you mean adding
The JIT'd code doesn't use TLS, but the Rust code does. We use the equivalent of
Interesting! Setting that environment variable though I still see a crash.
|
Ah, I meant the runtime. I'm currently running it in a loop, so I don't mind doing the instrumentation. I'll see if I find anything by printing info.
Yeah OK, then it's probably not that. You mentioned earlier that the JIT'd code needs to run to reproduce the issue, right?
Got it. Well, it doesn't discount the changes around it aren't involved (directly or indirectly), but definitely means that the asynchronous preemption itself isn't an issue here. |
A small amount of JIT code is executed inside of Also, to confirm, are you able to reproduce the crash on your end? I still feel bad about not being able to minimize a giant wad of foreign code, but if you can't even reproduce that's even worse! |
I can indeed reproduce! Sometimes it fails with a plain-old segfault, sometimes with an illegal instruction failure, and sometimes with one of the "stack split" failures you mentioned in the original post. |
Just fyi, this seems to be happening on 1.14.4 as well. At least, in the wasmtime-go repository. I haven't tested this specific repro. |
This is almost certainly some kind of memory corruption, perhaps due to a race condition. I haven't seen anybody else reporting similar errors. Given the information we have so far, I would not guess that this is a bug in Go. |
This was spelled wrong. Try |
Oh thanks for the tip @aclements! Using that env var ( package main
// #include <assert.h>
// #include <signal.h>
// #include <sys/mman.h>
//
// #define ALT_STACK_SIZE (16 * 4096)
//
// void my_run() {
// void *stack = mmap(NULL, ALT_STACK_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
// assert(stack != MAP_FAILED);
//
// stack_t new_stack;
// new_stack.ss_sp = stack;
// new_stack.ss_flags = 0;
// new_stack.ss_size = ALT_STACK_SIZE;
// int r = sigaltstack(&new_stack, NULL);
// assert(r == 0);
// }
import "C"
import "runtime"
func main() {
C.my_run()
runtime.GC()
} This program will segfault like original one when run as: $ go build -o binary
$ while true; do ./binary || break; done This one can take quite some time to reproduce, so I typically run that in a few windows to get it to fail quicker. I've also seen that mixing in code like seen in preempt.go can cause it to more reliably reproduce. It does seems related to async scheduling in any case. Some example crash logs I've seen are here. So at least on macOS this seems related to switching the sigaltstack. We do this in wasmtime currently because we execute our SIGSEGV handler on the sigaltstack, but we require the stack to be big enough to run the handler. Currently that stack size is set to 64kb, and it looks like the default Go one is 8kb, so the check to see if a previous sigaltstack is big enough fails and we allocate a new one. Is it expected that allocating a bigger sigaltstack causes issues here? Is this something that foreign code isn't allowed to do? |
@alexcrichton Oops, very sorry about the wrong environment variable there. Glad that it narrowed down the issue, though (and thanks for narrowing down the reproducer even further!). |
/cc @ianlancetaylor @bcmills , our signal experts. Thanks for narrowing it down to such a small repro! Go allocates a 32 KiB signal stack, but that of course suggests that a 64 KiB signal stack should be fine. I could see there potentially being a problem if the |
@mknyszek oh no worries! I'm just glad I was able to reduce it to a much more bite-sized chunk. @aclements after running a few million times I wasn't able to reproduce without the call to |
What version of macOS was this observed on? |
Oh oops sorry should have mentioned that in the original report as well. I'm on version 10.15.5 (19F101) |
That said, this example in #39079 (comment) does not look safe in general to me. If there is already an alternate signal stack in place, then something somewhere (in this case, presumably the Go runtime) allocated that signal stack, and may expect to use So I would expect that a program that does this sort of runtime.LockOSThread()
s := C.replaceAltStack()
defer func() {
C.restoreAltStack(s)
runtime.UnlockOSThread()
}()
… So, another question: does the bug still reproduce if you execute |
As predicted adding I also understand how modifying the sigaltstack in general is not exactly the safest thing to do, but it seems like "hygienic" usage of sigaltstack would be to grow it if necessary and otherwise not modify it. That's what wasmtime is trying to do (grow it to fit its needs), and it's expected that if anything else configured it then it would take care of deallocating it on its own or would otherwise. Basically at this point I don't understand why this would be invalid. The specific case wasmtime needs is to grow the sigaltstack to meet its needs. Is this incompatible with the Go runtime? If so, why? Comments like this seem to indicate that it's intended to be handled? |
The runtime is supposed to do the right thing if a C thread changes the alternate signal stack. As @alexcrichton says, Clearly it isn't working, though. The stack trace suggests that at least sometimes the stack guard is not being updated to correctly match the alternate signal stack installed by the C code. I didn't audit all the code, but it looks like the failure is the first function called that checks the stack guard. Does it look like the code in |
Change https://golang.org/cl/238020 mentions this issue: |
@alexcrichton does CL https://golang.org/cl/238020 help? Thanks. |
@cherrymui thanks for the cc! Locally that appears to fix the issue for me, yes. Using that toolchain against the entire https://github.com/bytecodealliance/wasmtime-go repository's test (which originally segfaulted and currently segfault on 1.14.4 periodically) the issue also appears fixed. I no longer get segfaults there either. |
@alexcrichton thanks for confirming! |
FYI, This CL failed on mips64le with
https://build.golang.org/log/cd5213f231841de332cad8fced10d6cbf6ca5ac9 |
@mengzhuo It doesn't seem related to that CL. It could be a flake. It seems a separate issue to me. |
Any chance this gets backported to 1.14? It's a crash. @cherrymui |
@gopherbot please consider this for backport in 1.14. The issue can cause runtime crashes on MacOS, and there doesn't seem to be any reasonable workaround. |
Backport issue(s) opened: #41991 (for 1.14). Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases. |
Change https://golang.org/cl/262557 mentions this issue: |
…Stack When a signal is received, the runtime probes whether an alternate signal stack is set, if so, adjust gsignal's stack to point to the alternate signal stack. This is done in adjustSignalStack, which calls sigaltstack "syscall", which is a libc call on darwin through asmcgocall. asmcgocall decides whether to do stack switch based on whether we're running on g0 stack, gsignal stack, or regular g stack. If g is not set to gsignal, asmcgocall may make wrong decision. Set g first. adjustSignalStack is recursively nosplit, so it is okay that temporarily gsignal.stack doesn't match the stack we're running on. Updates #39079. Fixes #41991. Change-Id: I59b2c5dc08c3c951f1098fff038bf2e06d7ca055 Reviewed-on: https://go-review.googlesource.com/c/go/+/238020 Run-TryBot: Cherry Zhang <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> (cherry picked from commit d286e61) Reviewed-on: https://go-review.googlesource.com/c/go/+/262557 Trust: Cherry Zhang <[email protected]> TryBot-Result: Go Bot <[email protected]> Reviewed-by: Austin Clements <[email protected]>
The fix for golang/go#39079 has been backported to 1.14 as of version 1.14.11 so the macOS warning can be clarified and the build matrix can be updated.
The fix for golang/go#39079 has been backported to 1.14 as of version 1.14.11 so the macOS warning can be clarified and the build matrix can be updated.
Not sure if anyone else is experiencing this, but we are still seing a Go1.14+ only, MacOS-only deadlock and high CPU usage caused by |
@albertvaka Since this was about a segfault, which is fixed, please open a new issue about a deadlock you're seeing, so it can be tracked separately. In the description please also include a reference to this issue. Thanks. |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
I created a local
main.go
like so:Next I downloaded the latest wasmtime release and extracted it locally:
$ curl -L https://github.com/bytecodealliance/wasmtime/releases/download/dev/wasmtime-dev-x86_64-macos-c-api.tar.xz | tar xJf - --strip-components=1
Next I compiled the local module:
Finally I ran the binary in an infinite loop:
What did you expect to see?
No segfault. Or more specifically for this to basically run infinitely producing no output.
What did you see instead?
Instead I see sporadic crashes. Some I've seen are:
fatal error: runtime: split stack overflow
fatal error: unexpected signal during runtime execution
fatal error: runtime: stack split at bad time
This was originally reported upstream in bytecodealliance/wasmtime-go#10, and we've been trying to narrow it down. With some investigation we found out that Go 1.13 runs this code successfully. We've also got the same code running succesfully on other platforms.
I realize though that this isn't the best bug report, unfortunately. The native library, wasmtime, is a pretty large project and is a giant wad of compiled Rust code. I've tried replacing it with a trivial C implementation to remove the dependency, but then the crash goes away. It seems that the bug here is related to something that the native binary is doing. I'm pretty certain that the fault does not lie in the native binary (e.g. no segfault or out of bounds writes or anything like that), but as with all native code I can't really entirely rule it out. I'm opening this because at this point we've at least narrowed it down to a regression between Go versions, and I'm hoping that folks more knowledgeable with changes could help out?
Is there a way we could help to reduce this further to a bite-sized test case? Or would it be helpful to perhaps bisect the Go release to try to find a revision which caused the segfault to appear here? I'm happy to help out in reducing this further!
The text was updated successfully, but these errors were encountered: