-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: running cmd.exec within goroutine sometimes leaves process with 100% CPU #53863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
init
sometimes leaves process with 100% CPUinit
sometimes leaves process with 100% CPU
Just to be totally clear, does your reproducer still work if you get rid of any of these things? (Not init, not in a new goroutine at init, not exec.Command?) Can you get a core dump of the hanging processes? If they're actually hung, it should be possible to attach gdb (or not on macOS? not sure) or trigger a core dump somehow (I'm not sure how to do this on macOS). I think it's interesting that This reminds me of some subtleties around signal masking around |
This is the smallest repro I could find (i.e. I haven't managed to reproduce by deleting any of these things, but it's super nondeterministic so I am not 100% confident that this is actually the minimal repro.) Haven't been able to get a core dump on lldb/gdb yet; @allanbreyes has been investigating as well. |
Attaching to one of the hanging processes with
I'm not sure what module the We'll keep researching, but any advice on how to debug further would be welcomed! Thank you. |
All of this sounds a lot like an issue in the kernel where zombie processes are getting stuck attempting to clean up / exit. However, it is not at all clear what is special about this Go program that would trigger this. My best guess would be some kind of race between the parent process exiting and the fork+exec of the child. Does this reproduce if you move the goroutine creation into main? |
I couldn't get it to repro last night but just tried another 10 times and it does!
is the new minimal reproduction, so this doesn't have to do with |
init
sometimes leaves process with 100% CPU
Interestingly, I haven't been able to reproduce this when running the program many times in a row serially. This doesn't mean it doesn't reproduce that way, but I haven't seen an example yet. This might indicate that there is some sort of race condition in the kernel that doesn't clean up these zombie processes properly when it tries to do a lot at once. I.e. running this usually can sometimes reproduce:
But
never seems to (so far). |
Ah, great. There is little difference between the last init function and the start of main, so it is good to know that it isn't related. A few more things to try:
package main
import (
"math/rand"
"os/exec"
"time"
)
func main() {
c := make(chan struct{}, 1)
go func() {
exec.Command("echo").Run()
c <- struct{}{}
}()
time.Sleep(time.Duration(rand.Intn(20)) * time.Millisecond)
<-c
}
|
We'll try both of those things when we get a chance. In the meantime, I've narrowed this further:
replicates. I'm feeling like it's unlikely that this is a go bug, but as you worked around a macos bug in #41702 it might be worth trying to work around this one as well once we get to the root of it. |
This does indeed solve the problem, as expected.
So far, I haven't been able to replicate this in C, but will keep poking when I have time. |
Confirmed reproduction in C:
So, this is clearly not a Go bug. Feel free to close this, but if you have ideas w.r.t. a workaround as above, I'm happy to try to help. Any idea where people have had success reporting macos kernel bugs in the past? Doesn't seem like the other thread got much traction. |
We do have a lock taken around exec and thread creation to protect against a Linux kernel bug (#19546). As a workaround for this, we could perhaps take this lock prior to exit. You could give this a try by calling |
What version of macOS are you running on?. I haven't been able to reproduce it locally. For the C reproducer, how many iterations does it usually take to have a hanging process? Thanks. |
12.4, with an M1 Max chip.
Anecdotally it depends on CPU load, but I can consistently get a go repro in under a minute by running the scripts from here
C repros are less predictable, but that's probably because I didn't include the random jitter; I'll see if adding the randomness helps.
takes at most a couple of minutes to leave a process at 100%. |
Thanks @ostrowr ! I can also reproduce on an Intel Mac with macOS 12.4, both the Go and the C version. I'll forward it to Apple. |
Great, thanks! For posterity, I submitted this to Apple on the 14th; feedback ID FB10691471 if you want to reference it. Haven't taken a look at a workaround in go since it'll likely take me a while to get a dev environment set up but I'll try to make some time this weekend |
I couldn't reproduce now with macOS 12.6.4 and 13.1. Maybe it is fixed now? |
I haven't been able to reproduce on 13.1 either, but only tried for about 10 runs of ./test_race_go. Seems like possibly fixed! Never did get a response on my Apple feedback. |
Thanks. Let's call it fixed. If it is not we can reopen. |
What version of Go are you using (
go version
)?Reproducible on at least 1.18.1 and 1.18.4 on Apple Silicon (M1)
Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
See the repro repo here for a few scripts that might help you reproduce the issue.
go build -o repro main.go
repro
a bunch of times at once – e.g.for i in $(seq 500); do "./repro" &; done
ps aux | grep repro
repro
or(repro)
that continue to run and hog 100% CPU.This is the smallest reproduction I can find but haven't been able to diagnose what this process is actually doing.
kill -ABRT
is ignored; the only way I can get rid of these hanging processes is with akill -9
.No idea
if this has to do with running a goroutine in anwhether this is replicable without runninginit
function andexec.Command
or whether this is anexec.Command
bug independent ofinit
semantics.Old reproduction
What did you expect to see?
The process to exit and not take up much CPU
What did you see instead?
A small percentage of the time, the process remains, either with its own name or its name in parentheses in the process table, hanging and taking up 100% CPU.
The text was updated successfully, but these errors were encountered: