-
Notifications
You must be signed in to change notification settings - Fork 18k
os/exec: tests hang on macOS due to Apple libc fork bugs #56784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
My Mac Pro (x86) running 13.0.1 has no trouble, with either Go 1.19.3 or tip. |
Updating my laptop to 13.0 did not work. |
Change https://go.dev/cl/451735 mentions this issue: |
13.0.1 didn't help. Same problem on my M1 MacBook Pro with 12.0.1. go.dev/cl/451735 fixes them both. |
We have since discovered that what is unique about my laptop compared to others is that I was running my tests from under a program written and linked against CoreFoundation, which put a magic We'll keep the fix, it's just even more mysterious now. |
I filed the related issue #33565. I just checked, and I also have this This does not tell us anything new but it corroborates what Russ wrote above. |
@gopherbot please backport |
Backport issue(s) opened: #56836 (for 1.18), #56837 (for 1.19). Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases. |
Change https://go.dev/cl/459175 mentions this issue: |
Change https://go.dev/cl/459176 mentions this issue: |
Revert CL 451735 (1f4394a), which fixed #33565 and #56784 but also introduced #57263. I have a different fix to apply instead. Since the first fix was never backported, it will be easiest to backport the new fix if the new fix is done in a separate CL from the revert. Change-Id: I6c8ea3a46e542ee4702675bbc058e29ccd2723e0 Reviewed-on: https://go-review.googlesource.com/c/go/+/459175 Reviewed-by: Cherry Mui <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Run-TryBot: Russ Cox <[email protected]>
Issues #33565 and #56784 were caused by hangs in the child process after fork, while it ran atfork handlers that ran into slow paths that didn't work in the child. CL 451735 worked around those two issues by calling a couple functions at startup to try to warm up those child paths. That mostly worked, but it broke programs using cgo with certain macOS frameworks (#57263). CL 459175 reverted CL 451735. This CL introduces a different fix: bypass the atfork child handlers entirely. For a general fork call where the child and parent are both meant to keep executing the original program, atfork handlers can be necessary to fix any state that would otherwise be tied to the parent process. But Go only uses fork as preparation for exec, and it takes care to limit what it attempts to do in the child between the fork and exec. In particular it doesn't use any of the things that the macOS atfork handlers are trying to fix up (malloc, xpc, others). So we can use the low-level fork system call (__fork) instead of the atfork-wrapped one. The full list of functions that can be called in a child after fork in exec_libc2.go is: - ptrace - setsid - setpgid - getpid - ioctl - chroot - setgroups - setgid - setuid - chdir - dup2 - fcntl - close - execve - write - exit I disassembled all of these while attached to a hung exec.test binary and confirmed that nearly all of them are making direct kernel calls, not using anything that the atfork handler needs to fix up. The exceptions are ioctl, fcntl, and exit. The ioctl and fcntl implementations do some extra work around the kernel call but don't call any other functions, so they should still be OK. (If not, we could use __ioctl and __fcntl instead, but without a good reason, we should keep using the standard entry points.) The exit implementation calls atexit handlers. That is almost certainly inappropriate in a failed fork child, so this CL changes that call to __exit on darwin. To avoid making unnecessary changes at this point in the release cycle, this CL leaves OpenBSD calling plain exit, even though that is probably a bug in the OpenBSD port (filed #57446). Fixes #33565. Fixes #56784. Fixes #57263. Change-Id: I26812c26a72bdd7fcf72ec41899ba11cf6b9c4ab Reviewed-on: https://go-review.googlesource.com/c/go/+/459176 Reviewed-by: David Chase <[email protected]> Reviewed-by: Cherry Mui <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Run-TryBot: Russ Cox <[email protected]>
Change https://go.dev/cl/459178 mentions this issue: |
Change https://go.dev/cl/459179 mentions this issue: |
Issues #33565 and #56784 were caused by hangs in the child process after fork, while it ran atfork handlers that ran into slow paths that didn't work in the child. CL 451735 worked around those two issues by calling a couple functions at startup to try to warm up those child paths. That mostly worked, but it broke programs using cgo with certain macOS frameworks (#57263). CL 459175 reverted CL 451735. This CL introduces a different fix: bypass the atfork child handlers entirely. For a general fork call where the child and parent are both meant to keep executing the original program, atfork handlers can be necessary to fix any state that would otherwise be tied to the parent process. But Go only uses fork as preparation for exec, and it takes care to limit what it attempts to do in the child between the fork and exec. In particular it doesn't use any of the things that the macOS atfork handlers are trying to fix up (malloc, xpc, others). So we can use the low-level fork system call (__fork) instead of the atfork-wrapped one. The full list of functions that can be called in a child after fork in exec_libc2.go is: - ptrace - setsid - setpgid - getpid - ioctl - chroot - setgroups - setgid - setuid - chdir - dup2 - fcntl - close - execve - write - exit I disassembled all of these while attached to a hung exec.test binary and confirmed that nearly all of them are making direct kernel calls, not using anything that the atfork handler needs to fix up. The exceptions are ioctl, fcntl, and exit. The ioctl and fcntl implementations do some extra work around the kernel call but don't call any other functions, so they should still be OK. (If not, we could use __ioctl and __fcntl instead, but without a good reason, we should keep using the standard entry points.) The exit implementation calls atexit handlers. That is almost certainly inappropriate in a failed fork child, so this CL changes that call to __exit on darwin. To avoid making unnecessary changes at this point in the release cycle, this CL leaves OpenBSD calling plain exit, even though that is probably a bug in the OpenBSD port (filed #57446). Fixes #33565. Fixes #56784. Fixes #57263. Fixes #56837. Change-Id: I26812c26a72bdd7fcf72ec41899ba11cf6b9c4ab Reviewed-on: https://go-review.googlesource.com/c/go/+/459176 Reviewed-by: David Chase <[email protected]> Reviewed-by: Cherry Mui <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Run-TryBot: Russ Cox <[email protected]> Reviewed-on: https://go-review.googlesource.com/c/go/+/459178
Issues #33565 and #56784 were caused by hangs in the child process after fork, while it ran atfork handlers that ran into slow paths that didn't work in the child. CL 451735 worked around those two issues by calling a couple functions at startup to try to warm up those child paths. That mostly worked, but it broke programs using cgo with certain macOS frameworks (#57263). CL 459175 reverted CL 451735. This CL introduces a different fix: bypass the atfork child handlers entirely. For a general fork call where the child and parent are both meant to keep executing the original program, atfork handlers can be necessary to fix any state that would otherwise be tied to the parent process. But Go only uses fork as preparation for exec, and it takes care to limit what it attempts to do in the child between the fork and exec. In particular it doesn't use any of the things that the macOS atfork handlers are trying to fix up (malloc, xpc, others). So we can use the low-level fork system call (__fork) instead of the atfork-wrapped one. The full list of functions that can be called in a child after fork in exec_libc2.go is: - ptrace - setsid - setpgid - getpid - ioctl - chroot - setgroups - setgid - setuid - chdir - dup2 - fcntl - close - execve - write - exit I disassembled all of these while attached to a hung exec.test binary and confirmed that nearly all of them are making direct kernel calls, not using anything that the atfork handler needs to fix up. The exceptions are ioctl, fcntl, and exit. The ioctl and fcntl implementations do some extra work around the kernel call but don't call any other functions, so they should still be OK. (If not, we could use __ioctl and __fcntl instead, but without a good reason, we should keep using the standard entry points.) The exit implementation calls atexit handlers. That is almost certainly inappropriate in a failed fork child, so this CL changes that call to __exit on darwin. To avoid making unnecessary changes at this point in the release cycle, this CL leaves OpenBSD calling plain exit, even though that is probably a bug in the OpenBSD port (filed #57446). Fixes #33565. Fixes #56784. Fixes #57263. Fixes #56836. Change-Id: I26812c26a72bdd7fcf72ec41899ba11cf6b9c4ab Reviewed-on: https://go-review.googlesource.com/c/go/+/459176 Reviewed-by: David Chase <[email protected]> Reviewed-by: Cherry Mui <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Run-TryBot: Russ Cox <[email protected]> Reviewed-on: https://go-review.googlesource.com/c/go/+/459179
Change https://go.dev/cl/460476 mentions this issue: |
CL 451735 worked around bugs in Apple's atfork handlers by calling notify_is_valid_token and xpc_atfork_child at startup, so that init code that wouldn't be safe in the child process would be warmed up in the parent process instead, but xpc_atfork_child broke use of the xpc library in Go programs, and xpc is internally used by various macOS frameworks (#57263). CL 459175 reverted that change, and then CL 459176 tried a new approach: use __fork, which doesn't call any of the atfork handlers at all. That worked, but an Apple engineer reviewing the change in private email suggests that since __fork is not public API, it should be avoided. The same engineer (with access to the source code for the xpc library) suggests that the breakage in #57263 is caused by xpc_atfork_child marking the library as unusable, expecting an imminent call to exec, and that calling xpc_date_create_from_current instead would do the necessary initialization without marking xpc as unusable. CL 460475 reverted that change, to prepare for this one. This CL goes back to the original “call functions to warm things up” approach, replacing xpc_atfork_child with xpc_date_create_from_current. The CL also updates cmd/link to use OS and SDK version 10.13.0 for x86 macOS binaries, up from 10.9.0, also suggested by the Apple engineer. Combined with the two warmup calls, this makes the fork hangs go away. The minimum macOS version has been 10.13 High Sierra since Go 1.17, so there should be no problem with writing that in the binaries too. Fixes #33565. Fixes #56784. Fixes #57263. Fixes #57577. Change-Id: I20769d9daa1fe9ea930f8009481335f8a14dc21b Reviewed-on: https://go-review.googlesource.com/c/go/+/460476 Auto-Submit: Russ Cox <[email protected]> Run-TryBot: Russ Cox <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Bryan Mills <[email protected]> Reviewed-by: Cherry Mui <[email protected]>
There is source code available for the Objective-C runtime on opensource.apple.com which redirects to https://github.com/apple-oss-distributions/objc4 |
On my x86 Mac laptop using macOS 12.6.1, all.bash often hangs in the os/exec test. In particular, this never finishes:
The chance of a hang in any given iteration is something like 50%. It's possible this is related to #33565, but I'm opening a separate bug just in case, and to focus the discussion on the fact that our own os/exec tests don't pass.
If I attach to the hung process in lldb, I was originally seeing backtraces like:
This specific hang seems to match dart-lang/sdk#29539, and inspection of the Apple libc code shows that the problem is a race with an os_alloc_once that is in progress in the parent when the address space is split, making the same call die in the child. I changed the Go runtime to do an early call to notify_is_valid_token(0) in osinit. That call is a no-op except that it guarantees the os_alloc_once has been done already, so it cannot race with any future forks.
With that fix, I get a different hang:
This one seems to match what @jacobvosmaer posted in #33565 (comment).
I can't find the libobjc source code so I'm not sure what a workaround for xpc_atfork_child might be.
It must be that C programs on macOS do not use fork. I looked into posix_spawn but it looks like we don't have any other ports that use that.
We need to figure something out for Go 1.20 though.
The text was updated successfully, but these errors were encountered: