-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: fatal error: notewakeup - double wakeup #19305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the report. The usual questions for runtime bugs: Any cgo? Any unsafe? Clean bill of health from the race detector and msan? cc @aclements |
cgo: yes, but not in this the code path for these tests |
@bdarnell recently has; clean AFAIK. |
Does your program ever call CC @aclements @RLH |
We were msan-clean a month ago. My scripts for msan builds appear to have rotted so I don't have a fresh run. There are no calls to GOMAXPROCS in our code. |
Confirmed that we are msan-clean. |
Trybot failure at tip just now: https://storage.googleapis.com/go-build-log/a6e7d344/darwin-amd64-10_11_41ed53a8.log |
We hit this again in cockroachdb/cockroach#14060 (logs here). Is there anything we could instrument our code with to help get some info if we hit it again? |
Seen again in cockroachdb/cockroach#14407. |
Another cockroachdb/cockroach#14440. |
Another cockroachdb/cockroach#14726. |
This just happened in a trybot run while running heapsampling.go: http://storage.googleapis.com/dev-go-build-log/cbb91820/windows-amd64-gce_3070529c.log |
Sorry for the slow response on this. I've looked at this several times but today I finally came up with a strong theory. @a-robinson or @tamird, is this happening often enough that you'd be able to test and confirm a small fix? If not, that's fine too. The change really does fix a bug and it's zero risk, so we can put it in 1.8.2. I'm just not 100% positive it's the cause and it would be nice to confirm that if possible. |
@aclements unfortunately I don't think so. We've been linking every occurrence to this issue, so if you scroll through the "referenced this issue" list in this thread, you'll see that it last happened 15 days ago, and before that it was 26 days ago, and before that it was 32 days ago. Is the fix already in 1.9? Could you link to the CL/issue/commit? |
Not yet. It's trivial to work around in the runtime, but the bug is actually in the compiler (#20335), so this may be just one instance of it. |
CL https://golang.org/cl/43311 mentions this issue. |
Reopening: CL 43311 is not on release branch. |
CL https://golang.org/cl/43412 mentions this issue. |
Cherry-pick of CL 43311. runtime.gchelper depends on the non-atomic load of work.ndone happening strictly before the atomic add of work.nwait. Until very recently (commit 978af9c, fixing #20334), the compiler reordered these operations. This created a race since work.ndone can change as soon as work.nwait is equal to work.ndone. If that happened, more than one gchelper could attempt to wake up the work.alldone note, causing a "double wakeup" panic. This was fixed in the compiler, but to make this code less subtle, make the load of work.ndone atomic. This clearly forces the order of these operations, ensuring the race doesn't happen. Fixes #19305 (though really 978af9c fixed it). Change-Id: Ieb1a84e1e5044c33ac612c8a5ab6297e7db4c57d Reviewed-on: https://go-review.googlesource.com/43412 Run-TryBot: Austin Clements <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Keith Randall <[email protected]>
What version of Go are you using (
go version
)?What operating system and processor architecture are you using (
go env
)?What did you do?
Unfortunately, we have no idea yet whether this is reproducible and we don't have anything resembling a minimal test case for it. This triggered in one of CockroachDB's nightly test runs, which runs inside a docker image. To re-run the test that hit this, run:
What did you expect to see?
The tests to run successfully for up to 15 minutes.
What did you see instead?
A go runtime failure:
I couldn't find any reference to errors like this since #5139, so I'm opening this as a tracker in case others hit the same issue even though we don't have a great repro case.
The full output of the build log is here, also copied below:
The text was updated successfully, but these errors were encountered: