-
Notifications
You must be signed in to change notification settings - Fork 18k
syscall: Go can leak a forked process if main thread exits before spawn finished #33565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
My guess is that when doing a fork+exec, after the fork, the child wants to "communicate" with the parent; perhaps by reading or writing to a pipe. If the parent is gone the child should exit and do nothing. But it appears that with the "right" timing, the child can get stuck in a loop instead. |
There are some lldb stack traces that might help here: https://gitlab.com/gitlab-org/gitaly/issues/1850#note_205304016 |
Can't reproduce on linux.
I don't believe that is necessarily correct. If a parent exits, the child becomes a zombie and attaches to PID 1. I understand this issue is about cleaning up the child before the spawn has happened completely. @randall77 for macOS. |
I also cannot reproduce on linux.
Yes. In my toy example, I'm trying to spawn I've learned some more about the real life bug that prompted this issue. It seems that there, even if the main goroutine does not exit, we get a stuck forked process. I was able to attach to some of these with lldb, which is what I linked to above. https://gitlab.com/gitlab-org/gitaly/issues/1850#note_205304016 I'm not sure if this is the exact same state as my reproducing example, but it might be related. To save some clicking I'm including stack traces below. mystery fork A, 0% CPU
mystery fork B, 100% CPU
When I lldb the crashing example I don't get nice symbols. Is it possible I messed up a |
I would also give 1.13rc1 a shot, just to be sure that it is not something already fixed. Keith can probably comment more on what exactly is happening. |
The stack traces make it appear that the problem is in the Darwin libc. The forked child is trying to get some response from the parent, but the parent has exited. If that is true, it should be possible to write a similar test case in C. Anybody want to try that? |
It's the sort of challenge I enjoy so I'll take a look. But a person familiar with Go internals would be much faster at it than me. I won't be offended if such a person swoops in and beats me to it. |
OK that was not hard. Just translated my Go example to C. https://gist.github.com/jacobvosmaer/69fae756e88d2f5f4ef5091ceeba1d88 As before, leave it running for a while in a restart loop, and you get a weird forked process left behind. |
The C reproducer highlights how odd this is even more: all it tries to do is |
Thanks! Does anybody want to try to report this to Apple? To fix this in Go I think it would suffice to have |
It really does feel like a macOS bug. With my C reproducer, if I leave it running it leaks way more processes than the Go one. Most of the leaked processes are at 0% CPU instead of 100% CPU.
What is particularly worrying here is that after I remove the leaked processes with I don't feel confident reporting this to Apple for a number of reasons:
If they respond to my C reproducer by saying "your code is wrong" or "you should use |
That's certainly true. But we are still making progress even if Apple proved we were doing wrong, isn't it? |
I tried various methods. Can we reslove?
|
…0 to version 3.0.1295.0 Darrell Kienzle (1): Add syscall.Sync on Darwin to workaround golang/go#33565 Faho Shubladze (2): Added validation checks to InstanceID and Region fields of CustomIdentity Updated release notes Frank Miao (1): Added error handling to a session utility function Ivan Aguilar (1): Separate logs files on windows Samiullah Mohammed (1): Add cross-account domain join Thor Bjorgvinsson (4): Added configurable custom identity and configurable identity consumption order Pass appconfig to functions instead of reading config Adding synchronization to RunCommand service stop Moving darwin proxy handling to proxyconfig package Vishnu Karthik Ravindran (3): correct default hibernation seelog config cleanup agent updater old versions Fix gofmt issue Yuting Fan (1): Remove delay in non-interactive session type
For the C reproducer in #33565 (comment) , I can reproduce on macOS. I can also reproduce it on Linux, with even a much higher rate of leaking children. It seems the child stuck at
If we |
Fix high cpu usage caused by golang/go#33565
This may be similar to #53863 ? (which is a macOS system bug and may be fixed in a future macOS version.) |
I am getting these unkillable processes in the os/exec test during all.bash pretty reliably. Here is a stack trace from one:
Is there anything we can do? |
try refactor code async logic to sync or before syscall wait async code finish. I guess maybe when after syscall forked. they will inherit parent process and parent process is in async state. the child process memory is dirty |
I have a fix for the Go stacks that were reported in this issue. I am not as sure about the C reproducer - it may be that it tickles a different fork+exec bug in Apple libc. |
Change https://go.dev/cl/451735 mentions this issue: |
Change https://go.dev/cl/459175 mentions this issue: |
Change https://go.dev/cl/459176 mentions this issue: |
Revert CL 451735 (1f4394a), which fixed #33565 and #56784 but also introduced #57263. I have a different fix to apply instead. Since the first fix was never backported, it will be easiest to backport the new fix if the new fix is done in a separate CL from the revert. Change-Id: I6c8ea3a46e542ee4702675bbc058e29ccd2723e0 Reviewed-on: https://go-review.googlesource.com/c/go/+/459175 Reviewed-by: Cherry Mui <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Run-TryBot: Russ Cox <[email protected]>
Issues #33565 and #56784 were caused by hangs in the child process after fork, while it ran atfork handlers that ran into slow paths that didn't work in the child. CL 451735 worked around those two issues by calling a couple functions at startup to try to warm up those child paths. That mostly worked, but it broke programs using cgo with certain macOS frameworks (#57263). CL 459175 reverted CL 451735. This CL introduces a different fix: bypass the atfork child handlers entirely. For a general fork call where the child and parent are both meant to keep executing the original program, atfork handlers can be necessary to fix any state that would otherwise be tied to the parent process. But Go only uses fork as preparation for exec, and it takes care to limit what it attempts to do in the child between the fork and exec. In particular it doesn't use any of the things that the macOS atfork handlers are trying to fix up (malloc, xpc, others). So we can use the low-level fork system call (__fork) instead of the atfork-wrapped one. The full list of functions that can be called in a child after fork in exec_libc2.go is: - ptrace - setsid - setpgid - getpid - ioctl - chroot - setgroups - setgid - setuid - chdir - dup2 - fcntl - close - execve - write - exit I disassembled all of these while attached to a hung exec.test binary and confirmed that nearly all of them are making direct kernel calls, not using anything that the atfork handler needs to fix up. The exceptions are ioctl, fcntl, and exit. The ioctl and fcntl implementations do some extra work around the kernel call but don't call any other functions, so they should still be OK. (If not, we could use __ioctl and __fcntl instead, but without a good reason, we should keep using the standard entry points.) The exit implementation calls atexit handlers. That is almost certainly inappropriate in a failed fork child, so this CL changes that call to __exit on darwin. To avoid making unnecessary changes at this point in the release cycle, this CL leaves OpenBSD calling plain exit, even though that is probably a bug in the OpenBSD port (filed #57446). Fixes #33565. Fixes #56784. Fixes #57263. Change-Id: I26812c26a72bdd7fcf72ec41899ba11cf6b9c4ab Reviewed-on: https://go-review.googlesource.com/c/go/+/459176 Reviewed-by: David Chase <[email protected]> Reviewed-by: Cherry Mui <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Run-TryBot: Russ Cox <[email protected]>
Change https://go.dev/cl/459178 mentions this issue: |
Change https://go.dev/cl/459179 mentions this issue: |
Issues #33565 and #56784 were caused by hangs in the child process after fork, while it ran atfork handlers that ran into slow paths that didn't work in the child. CL 451735 worked around those two issues by calling a couple functions at startup to try to warm up those child paths. That mostly worked, but it broke programs using cgo with certain macOS frameworks (#57263). CL 459175 reverted CL 451735. This CL introduces a different fix: bypass the atfork child handlers entirely. For a general fork call where the child and parent are both meant to keep executing the original program, atfork handlers can be necessary to fix any state that would otherwise be tied to the parent process. But Go only uses fork as preparation for exec, and it takes care to limit what it attempts to do in the child between the fork and exec. In particular it doesn't use any of the things that the macOS atfork handlers are trying to fix up (malloc, xpc, others). So we can use the low-level fork system call (__fork) instead of the atfork-wrapped one. The full list of functions that can be called in a child after fork in exec_libc2.go is: - ptrace - setsid - setpgid - getpid - ioctl - chroot - setgroups - setgid - setuid - chdir - dup2 - fcntl - close - execve - write - exit I disassembled all of these while attached to a hung exec.test binary and confirmed that nearly all of them are making direct kernel calls, not using anything that the atfork handler needs to fix up. The exceptions are ioctl, fcntl, and exit. The ioctl and fcntl implementations do some extra work around the kernel call but don't call any other functions, so they should still be OK. (If not, we could use __ioctl and __fcntl instead, but without a good reason, we should keep using the standard entry points.) The exit implementation calls atexit handlers. That is almost certainly inappropriate in a failed fork child, so this CL changes that call to __exit on darwin. To avoid making unnecessary changes at this point in the release cycle, this CL leaves OpenBSD calling plain exit, even though that is probably a bug in the OpenBSD port (filed #57446). Fixes #33565. Fixes #56784. Fixes #57263. Fixes #56837. Change-Id: I26812c26a72bdd7fcf72ec41899ba11cf6b9c4ab Reviewed-on: https://go-review.googlesource.com/c/go/+/459176 Reviewed-by: David Chase <[email protected]> Reviewed-by: Cherry Mui <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Run-TryBot: Russ Cox <[email protected]> Reviewed-on: https://go-review.googlesource.com/c/go/+/459178
Issues #33565 and #56784 were caused by hangs in the child process after fork, while it ran atfork handlers that ran into slow paths that didn't work in the child. CL 451735 worked around those two issues by calling a couple functions at startup to try to warm up those child paths. That mostly worked, but it broke programs using cgo with certain macOS frameworks (#57263). CL 459175 reverted CL 451735. This CL introduces a different fix: bypass the atfork child handlers entirely. For a general fork call where the child and parent are both meant to keep executing the original program, atfork handlers can be necessary to fix any state that would otherwise be tied to the parent process. But Go only uses fork as preparation for exec, and it takes care to limit what it attempts to do in the child between the fork and exec. In particular it doesn't use any of the things that the macOS atfork handlers are trying to fix up (malloc, xpc, others). So we can use the low-level fork system call (__fork) instead of the atfork-wrapped one. The full list of functions that can be called in a child after fork in exec_libc2.go is: - ptrace - setsid - setpgid - getpid - ioctl - chroot - setgroups - setgid - setuid - chdir - dup2 - fcntl - close - execve - write - exit I disassembled all of these while attached to a hung exec.test binary and confirmed that nearly all of them are making direct kernel calls, not using anything that the atfork handler needs to fix up. The exceptions are ioctl, fcntl, and exit. The ioctl and fcntl implementations do some extra work around the kernel call but don't call any other functions, so they should still be OK. (If not, we could use __ioctl and __fcntl instead, but without a good reason, we should keep using the standard entry points.) The exit implementation calls atexit handlers. That is almost certainly inappropriate in a failed fork child, so this CL changes that call to __exit on darwin. To avoid making unnecessary changes at this point in the release cycle, this CL leaves OpenBSD calling plain exit, even though that is probably a bug in the OpenBSD port (filed #57446). Fixes #33565. Fixes #56784. Fixes #57263. Fixes #56836. Change-Id: I26812c26a72bdd7fcf72ec41899ba11cf6b9c4ab Reviewed-on: https://go-review.googlesource.com/c/go/+/459176 Reviewed-by: David Chase <[email protected]> Reviewed-by: Cherry Mui <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Run-TryBot: Russ Cox <[email protected]> Reviewed-on: https://go-review.googlesource.com/c/go/+/459179
Change https://go.dev/cl/460476 mentions this issue: |
CL 451735 worked around bugs in Apple's atfork handlers by calling notify_is_valid_token and xpc_atfork_child at startup, so that init code that wouldn't be safe in the child process would be warmed up in the parent process instead, but xpc_atfork_child broke use of the xpc library in Go programs, and xpc is internally used by various macOS frameworks (#57263). CL 459175 reverted that change, and then CL 459176 tried a new approach: use __fork, which doesn't call any of the atfork handlers at all. That worked, but an Apple engineer reviewing the change in private email suggests that since __fork is not public API, it should be avoided. The same engineer (with access to the source code for the xpc library) suggests that the breakage in #57263 is caused by xpc_atfork_child marking the library as unusable, expecting an imminent call to exec, and that calling xpc_date_create_from_current instead would do the necessary initialization without marking xpc as unusable. CL 460475 reverted that change, to prepare for this one. This CL goes back to the original “call functions to warm things up” approach, replacing xpc_atfork_child with xpc_date_create_from_current. The CL also updates cmd/link to use OS and SDK version 10.13.0 for x86 macOS binaries, up from 10.9.0, also suggested by the Apple engineer. Combined with the two warmup calls, this makes the fork hangs go away. The minimum macOS version has been 10.13 High Sierra since Go 1.17, so there should be no problem with writing that in the binaries too. Fixes #33565. Fixes #56784. Fixes #57263. Fixes #57577. Change-Id: I20769d9daa1fe9ea930f8009481335f8a14dc21b Reviewed-on: https://go-review.googlesource.com/c/go/+/460476 Auto-Submit: Russ Cox <[email protected]> Run-TryBot: Russ Cox <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Bryan Mills <[email protected]> Reviewed-by: Cherry Mui <[email protected]>
Please answer these questions before submitting your issue. Thanks!
What did you do?
If I spawn a process in a goroutine, and exit the main thread before the spawn is finished, I am sometimes left with a fork of my original process that does not go away. The fork does not respond to SIGTERM, and sometimes consumes 100% CPU.
https://play.golang.org/p/1HPscvoAiwj
What did you expect to see?
If the main thread exits Go should not leave behind a forked process.
What did you see instead?
A forked process, sometimes using 100% CPU
Does this issue reproduce with the latest release (go1.12.7)?
Yes.
System details
The text was updated successfully, but these errors were encountered: