-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: memory corruption crashes with os/exec on Linux kernel 4.4 #20427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I need to retract the paragraph on CGO_ENABLED=0 still resulting in crashes: The crashes observed after deploying the CGO_ENABLED=0 change were all on straggler processes that were still running the old version of the code. I have not yet seen an instance of the pure-Go program crashing on kernel 4.4. My updated understanding of the problem is as follows: Go programs compiled with default settings on Ubuntu 12.04 may experience memory corruption when run on Ubuntu 16.04. ("Default settings" meaning that cgo has not been explicitly disabled.) This could be due to differences in the C libraries present on those systems, so it may or may not be worth hunting down and understanding more fully. Setting CGO_ENABLED=0 may have resolved the issue, and updating the build environment to match the runtime environment might also resolve it. I'll be able to say more confidently next week. |
Over the weekend we observed more crashes. All instances are now built with CGO_ENABLED=0, and the executable is bit-for-bit identical whether the build environment is based on Ubuntu 12.04 or 16.04. This rules out any involvement from differing C libraries: our pure-Go program built with go1.8.1 is crashing on Linux kernel version 4.4, and is not crashing on 3.2. It also means I'm running low on ideas for how to diagnose the problem. The race detector has not found anything, CGO is out of the picture, build vs runtime environment differences have been accounted for, use of "unsafe" is limited to github.com/golang/protobuf/proto/pointer_unsafe.go. There's no use of runtime.SetFinalizer outside of std packages, and the crashes on go1.6.3 mean that either Go 1.8's changes to argument liveness are uninvolved, or there's more than one bug here. I have not yet run the program with GODEBUG=efence=1, which might help determine whether the GC is reclaiming memory prematurely. I'm hesitant to try that setting since I don't understand how it will affect performance, and the failure rate on each single host is relatively low: much less than once per day. (This program runs on a relatively large number of machines.) Since this program isn't particularly latency-sensitive (or even very throughput-sensitive), it looks like it would be safe to try gcrescanstacks=1 and gccheckmark=1. I plan to try those next. |
Do you see the crashes on multiple machines? Is there a way that we could try to reproduce the problem ourselves? Since the change seems related to changing kernel versions, perhaps it would help to run the program under Does your program set the |
The crashes happen on multiple machines. I don't know how to reproduce the problem myself, so I don't know what to suggest for reproducing. The average failure rate on a single machine is less than once per week, so we'd need to run We don't set SysProcAttr when using os/exec. I have core dumps for most of the crashes, but I'm not sure what to look for. Below is the stack trace for a crash in runtime.runfinq while calling Close on a network connection in a finalizer. The net.netFD struct looks like it's around 14 pointers long; the fault address here is the ASCII string "state", which is the tail end of a 29-byte (4 pointers) long output of path.Join. Not only was the memory reallocated, it was used for a different size class.
Here are two more example crashes:
It looks like the GC no longer scribbles over any part of reclaimed memory with |
Three shots in the dark: Possibly signal handling? We had a nasty problem related to that, @bcmills knows more. Another possibility: https://go-review.googlesource.com/c/43294/ Are you using the GOEXPERIMENT=preemptibleloops experiment for tight-loop-vs-GC-latency |
Since the crashes occur even with |
If you wanted to patch in a simpler/safer fix for this to try to rule it out, you probably want https://go-review.googlesource.com/c/43311/ instead--it fixes this specific problem rather than the general compiler case. |
Thank you all for the ideas. I'll upgrade to go1.8.3 soon after it's out—it looks like it's slated to include cherry-picks of both CL 43294 and CL 43311. This program doesn't use The program uses crypto/rand.Read, which tries to use getrandom(2) when available (on Linux 3.17+). Should we be suspicious of this feature? I got the following crash from an instance of the program that was running with
This program uses Unix domain sockets much more actively than most of our others. Some of the crashes are encountered near the netpoller (like the one below), but I don't know if that's because of memory corruption coming from the netpoller, or if the program is just as likely to crash on any active goroutine and a lot of the active goroutines are working with UDS.
|
Upgrading to go1.8.3 did not resolve the issue, and we've narrowed it down to being related to os/exec. A coworker reminded me of the FreeBSD os/exec memory corruption bug—#15658—and indeed, the first reproducer there (from May 11, 2016) leads to crashes on a Linux 4.4 system.
Here are the beginnings of the three crashes I've seen with the May 11, 2016 reproducer so far (from spending around 100–250 process-hours of execution):
I tried the March 8, 2017 reproducer with seven variants, but none crashed within a few seconds. I haven't seen variant (1) crash after 60 process-hours of execution. My very slow and not terribly reliable reproducer (version c7141caadf in a private repo) is below:
It crashed with the following error while running under The "stress" command here is golang.org/x/tools/cmd/stress
I don't see an obvious smoking gun in the |
@rhysh How hard would it be for you to test https://golang.org/cl/43713 to see if it changes anything for you? |
Thanks @ianlancetaylor , I've cherry-picked CL 43713 PS 2 onto go1.8.3, and I'll try both my reproducer (version c7141caadf) and the 2016-05-11 FreeBSD reproducer with it. Overnight with go1.8.3, I saw 4 failures out of 478 attempts on my reproducer (with 10 workers each executing /bin/true 3000 times), and 3 failures out of 182 attempts at running the 2016-05-11 reproducer (aborted and restarted after an hour). |
I cherry-picked CL 43713 PS 2 onto go1.8.3 and deleted ./VERSION, and ran the reproducers with the resulting toolchain over the weekend. The 2016-05-11 FreeBSD reproducer encountered 21 failures out of 690 attempts (timing out and retrying after 1 hour each), and my reproducer encountered only one out of 6870 attempts. My reproducer (version c7141caadf) got this error:
which I hadn't seen before. It was running on ubuntu's kernel
I don't see how the Wait call would return this error without Start returning a more descriptive error. Below are some of the errors from the weekend's run of the 2016-05-11 FreeBSD reproducer. It was running on ubuntu's kernel
The kernel bug that CL 43713 is working around (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1672819) looks, based on comments 16 and 17 there and the filenames in 16's links, like it was fixed in
CL 43713 PS 2 seems to serialize calls of Calls from CL 43713 PS 2 doesn't fix the problem, and I don't understand enough about its approach to see how to modify it to this issue's needs. Locking between Does this seem like it's in fact a kernel bug, and does it seem like one that the Go runtime could work around? What can I try next to either fix or to work around the problem? |
Those are good questions but I don't have any good answers. It certainly looks like some sort of memory corruption is occurring, but I have no idea where it is coming from. Since I haven't seen any reports of anything similar, I suppose the first place to look is your program itself. It's at least conceivable that the change in kernel version is changing the memory allocation pattern in such a way that errors that were previously hidden are now exposed. But frankly it doesn't seem especially likely to me. It's just that everything else I can think of seems even less likely. |
I don't know what to make of it either, but the reproducer is pretty small. I've looked for sources of corruption in my original program .. the usual suspects of cgo, unsafe, and data races turned up nothing. It's now built with CGO_ENABLED=0, unsafe is only used in the stdlib and protobuf, and the program has run for hours on a real workload with the race detector finding nothing. The crashiest reproducer I've found is a 38-line program from the FreeBSD bug (#15658), written 2016-05-11 (included below).
|
@rhysh, I'm going to attempt to reproduce this in qemu using an Ubuntu 16.04.2 image. Apologies if you've already answered these questions and I missed it:
|
Thanks @aclements
During last week, I often ran multiple copies of |
Thanks. I'm now running this on an 8-core KVM-enabled qemu with 4.4.0-78-generic. If that turns up something, I'll try running it under software-only mode. And if that turns up something, my plan is to tweak qemu to record a memory trace and see if I can make something of that. |
I've been running this both on KVM/qemu and on a real 48 core machine running the 4.4.0-78-generic Ubuntu kernel for the past week with no failures. I'm not really sure where to go from here. If you can capture some core dumps from the 2016-05-11 reproducer, I might be able to find a pattern in them. Alternatively, I have a pin-based memory tracer I could dust off and we could try running under that, but it pretty seriously perturbs the execution. |
Does look similar, I have some Linux boxes I'll see if my repo case triggers on there |
Ok the simple for test running ubuntu 4.4.0-79-generic does repo but it seems rare, once so far in 24 hours on a old L5520 with 24 cores.
|
@stevenh, thanks. Same question for you: can you capture some core dumps from the failures? ( @stevenh and @rhysh, is it possible to make the reproducer more deterministic? For example, does it need all of the goroutines or is it enough to run the exec in a loop (possibly from many separate processes)? I assume it does need the non-determinism, but if it were more deterministic, it would be much easier to compare failures (and successes). |
I have some traces with GOTRACEBACK=crash but no core dumps. We usually use FreeBSD and didn't realise Linux had the dubious default of not creating cores, so still waiting for it to crash again, I'll post as soon as I get them. I'm not sure if its possible to crash with just one goroutine, likely it is but could well take much longer to reproduce and it already takes quite some time. I've kicked off some tests under stress and will report back if I get anything. |
Yeah. :( It seems to be an increasingly popular bandwagon to send core dumps to /dev/null even if the user asked for them (and it requires root privileges to change that, ugh). Thanks for testing. @rhysh or @stevenh, while you're testing, another thing to try is to remove the |
Ok I got a crash under stress with only one goroutine (main). Unfortunately I don't have crash output as the log file the shell script stress was running was setup for a single instance, so it got overwritten :( While I wait for it to crash again here's the info.
package main
import (
"log"
"os/exec"
)
func run() {
cmd := exec.Command("true")
if err := cmd.Start(); err != nil {
log.Fatal(err)
}
cmd.Wait()
}
func main() {
for {
run()
}
} This is with go 1.8.3. I'll wait for another concrete positive to capture the crash log before doing the |
Here's the results of the single goroutine run, two crashes so far. Went to do the |
Just for reference, GCC 6.4.0 is now stable in Gentoo and some other infra I've just updated is hitting this too, even though I built its kernel with CONFIG_OPTIMIZE_INLINING=y. Neither value of that option is a guarantee that everything will be inlined and no stack probe will be emitted. |
@marcan, just to clarify, you're hitting this with binaries built with 1.9.x? (Do you know if it's still a problem with binaries compiled from tip?) |
|
@aclements correct, Go 1.9.x. Cherry-picking a158382 in (backported, which is just dumping the x86 half which is new AIUI) fixes the issue, which is what I'm doing with my Go packages now. I guess the question is will it be backported into the 1.9 branch officially or will this only be fixed with the next major release? @fweimer I'm aware of that, but Ultimately the issue here is that Go is (was) calling code that makes no promises about stack usage on a small stack. |
Marking as a candidate for 1.9.x. I'm not sure whether we want to back-port this or not, but it's worth considering. My understanding (and please correct me if I'm wrong) is that it requires a pretty uncommon configuration to trigger. OTOH, if it is triggered, the effects are quite bad and telling people to reconfigure their kernel doesn't seem a like a viable work-around to me. |
@aclements You can't really fix it by reconfiguring the kernel (not reliably, as I just discovered today). Really the only way to avoid the issue would be to patch the kernel Makefile to use -fno-stack-check or to switch to a non-hardened toolchain altogether. I'm not sure if any other distros other than Gentoo Hardened use -fstack-check globally, but yeah, in this instance at least, there's no sensible workaround. Personally, I'm looking forward to this trickling downstream because I've got one package (Grafana) which is a binary package on Gentoo, and has a Go server component. That will start crashing as soon as I update GCC/the kernel on my hosts that use it, and trying to get a self-compiled build in for that will be a lot more annoying than the source patch for source packages. |
@fweimer Are you saying that GCC's |
@ianlancetaylor I do not fully understand the heuristics for
That is, it probes beyond what is actually needed by the leaf function. However, I such probes are not inserted proactively if the function does not have a large frame in the first place, so it's inconsistent. |
CL 88495 OK for Go 1.9.3. |
Change https://golang.org/cl/88495 mentions this issue: |
… stack NOTE: This elides changes to src/runtime/sys_linux_386.s since that requires another change (golang.org/cl/69390) which we don’t want to backport. If the Linux kernel was built with CONFIG_OPTIMIZE_INLINING=n and was built with hardening options turned on, GCC will insert a stack probe in the VDSO function that requires a full page of stack space. The stack probe can corrupt memory if another thread is using it. Avoid sporadic crashes by calling the VDSO on the g0 or gsignal stack. While we're at it, align the stack as C code expects. We've been getting away with a misaligned stack, but it's possible that the VDSO code will change in the future to break that assumption. Benchmarks show a 11% hit on time.Now, but it's only 6ns. name old time/op new time/op delta AfterFunc-12 1.66ms ± 0% 1.66ms ± 1% ~ (p=0.905 n=9+10) After-12 1.90ms ± 6% 1.86ms ± 0% -2.05% (p=0.012 n=10+8) Stop-12 113µs ± 3% 115µs ± 2% +1.60% (p=0.017 n=9+10) SimultaneousAfterFunc-12 145µs ± 1% 144µs ± 0% -0.68% (p=0.002 n=10+8) StartStop-12 39.5µs ± 3% 40.4µs ± 5% +2.19% (p=0.023 n=10+10) Reset-12 10.2µs ± 0% 10.4µs ± 0% +2.45% (p=0.000 n=10+9) Sleep-12 190µs ± 1% 190µs ± 1% ~ (p=0.971 n=10+10) Ticker-12 4.68ms ± 2% 4.64ms ± 2% -0.83% (p=0.043 n=9+10) Now-12 48.4ns ±11% 54.0ns ±11% +11.42% (p=0.017 n=10+10) NowUnixNano-12 48.5ns ±13% 56.9ns ± 8% +17.30% (p=0.000 n=10+10) Format-12 489ns ±11% 504ns ± 6% ~ (p=0.289 n=10+10) FormatNow-12 436ns ±23% 480ns ±13% +10.25% (p=0.026 n=9+10) MarshalJSON-12 656ns ±14% 587ns ±24% ~ (p=0.063 n=10+10) MarshalText-12 647ns ± 7% 638ns ± 9% ~ (p=0.516 n=10+10) Parse-12 348ns ± 8% 328ns ± 9% -5.66% (p=0.030 n=10+10) ParseDuration-12 136ns ± 9% 140ns ±11% ~ (p=0.425 n=10+10) Hour-12 14.8ns ± 6% 15.6ns ±11% ~ (p=0.085 n=10+10) Second-12 14.0ns ± 6% 14.3ns ±12% ~ (p=0.443 n=10+10) Year-12 32.4ns ±11% 33.4ns ± 6% ~ (p=0.492 n=10+10) Day-12 41.5ns ± 9% 42.3ns ±12% ~ (p=0.239 n=10+10) Fixes #20427 Change-Id: Ia395cbb863215f4499b8e7ef95f4b99f51090911 Reviewed-on: https://go-review.googlesource.com/76990 Reviewed-by: Austin Clements <[email protected]> Reviewed-on: https://go-review.googlesource.com/88495 Run-TryBot: Andrew Bonventre <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]>
go1.9.3 has been packaged and includes:
The release is posted at golang.org/dl. — golang.org/x/build/cmd/releasebot, Jan 22 21:02:53 UTC |
The kernel's main Makefile now sets |
Yes, Nonetheless, there are still no guarantees on vDSO stack usage, so Go is still doing the right thing by switching stacks before calling into it. |
What version of Go are you using (
go version
)?go version go1.8.1 linux/amd64
Some crashes were also observed with
go version go1.6.3 linux/amd64
, prompting an upgrade to go1.8.1.What operating system and processor architecture are you using (
go env
)?The affected servers are linux/amd64. We've seen the crashes when running Linux kernel version 4.4, and have not seen any crashes with Linux kernel version 3.2.
The
./go
command here was cross-compiled from darwin/amd64 and dropped onto an affected server:What did you do?
I have a monitoring daemon that periodically opens/uses/closes some unix domain sockets, reads files from virtual file systems e.g. under /sys/, uses os/exec to run short-lived child processes, sends UDP packets, makes HTTP requests, and provides some debug endpoints on an HTTP server.
It does not use its own cgo or unsafe (aside from the internals of github.com/golang/protobuf/proto), and the race detector has not detected any data races when run with a production workload. It is not specifically compiled with CGO_ENABLED=0 or with the netgo build tag so the executables end up dynamically linked, probably due to the net package's use of cgo.
As a test of cgo's involvement we've built the program with CGO_ENABLED=0, resulting in a statically-linked executable. This version of the program still crashes occasionally on Linux kernel version 4.4.
The program is closed-source.
What did you expect to see?
Given the program's track record of not crashing on Linux kernel version 3.2, I expected that running an identical binary on Linux kernel version 4.4 would also not crash.
What did you see instead?
The program crashes with a variety of "fatal error" messages and an occasional "panic", each of which seems to point to memory corruption. Some examples follow, here's what I think they indicate:
One of the "fatal error" crashes complains about the GC finding a pointer to an unallocated span. Others complain about unexpected fault addresses of small-magnitude numbers, like 0x18 or 0xffffffffffffffff. Either something is corrupting the memory allocator's datastructures—such as data races which are invisible to the race detector, such as within the runtime package—or something is retaining memory in a way that's invisible to the GC, allowing that memory to be reallocated with a different pointer map which would allow small numbers to be written to word that are expected to be pointers.
I don't think it's related to the Go 1.8 changes to argument liveness / runtime.KeepAlive, since there have been some crashes with go1.6.3.
I can't say it's related to crypto/rand's use of getrandom(2), since one of the crashes below shows a goroutine in the code that reads from /dev/urandom.
All of the partial crash logs shown below are from go1.8.1 on Linux kernel version 4.4. We have a couple examples of crashes with go1.6.3 on 4.4. We have not seen this code crash on Linux kernel version 3.2 with either go1.6.3 or go1.8.1.
/cc @aclements
Crash log excerpts follow:
The text was updated successfully, but these errors were encountered: