-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime/metrics: add a way to measure runtime-internal lock contention #57071
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think it would be great if we could get the runtime locks fully wired up to the existing sync.Mutex contention profiles. |
Full wiring into With an intended end state of the data being part of that profile, I think 1/ the total wait time should be reported as part of Then, as we move towards reporting the actual call stacks that led to contended I'm inclined to add an It looks like I can call Maybe I can also capture the call stack onto an array that is part of the
Thanks! |
Change https://go.dev/cl/528656 mentions this issue: |
Change https://go.dev/cl/528657 mentions this issue: |
Change https://go.dev/cl/534161 mentions this issue: |
This CL adds four new time histogram metrics: /sched/pauses/stopping/gc:seconds /sched/pauses/stopping/other:seconds /sched/pauses/total/gc:seconds /sched/pauses/total/other:seconds The "stopping" metrics measure the time taken to start a stop-the-world pause. i.e., how long it takes stopTheWorldWithSema to stop all Ps. This can be used to detect STW struggling to preempt Ps. The "total" metrics measure the total duration of a stop-the-world pause, from starting to stop-the-world until the world is started again. This includes the time spent in the "start" phase. The "gc" metrics are used for GC-related STW pauses. The "other" metrics are used for all other STW pauses. All of these metrics start timing in stopTheWorldWithSema only after successfully acquiring sched.lock, thus excluding lock contention on sched.lock. The reasoning behind this is that while waiting on sched.lock the world is not stopped at all (all other Ps can run), so the impact of this contention is primarily limited to the goroutine attempting to stop-the-world. Additionally, we already have some visibility into sched.lock contention via contention profiles (#57071). /sched/pauses/total/gc:seconds is conceptually equivalent to /gc/pauses:seconds, so the latter is marked as deprecated and returns the same histogram as the former. In the implementation, there are a few minor differences: * For both mark and sweep termination stops, /gc/pauses:seconds started timing prior to calling startTheWorldWithSema, thus including lock contention. These details are minor enough, that I do not believe the slight change in reporting will matter. For mark termination stops, moving timing stop into startTheWorldWithSema does have the side effect of requiring moving other GC metric calculations outside of the STW, as they depend on the same end time. Fixes #63340 Change-Id: Iacd0bab11bedab85d3dcfb982361413a7d9c0d05 Reviewed-on: https://go-review.googlesource.com/c/go/+/534161 Reviewed-by: Michael Knyszek <[email protected]> Auto-Submit: Michael Pratt <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]>
@prattmic made a good point on https://go.dev/cl/528657 that the change includes an awkward mix of the semantics of the "mutex" and "block" profiles, by capturing the @prattmic , @mknyszek and I discussed options at last week's "runtime diagnostics improvements" meeting: #57175 (comment) . The conclusion we came to is that we shouldn't ship those strange semantics as on by default for Go 1.22, but that we'd like to start getting the benefits of this from users who understand the current caveats and who understand that those semantics will soon change (improve). We're planning to do this via a GODEBUG setting that's off by default, and which will go away once the contention reports for I updated https://go.dev/cl/528657 with a proposed name of If a temporary GODEBUG setting isn't appropriate here, we can disable the call stack collection in that same way for Go 1.22, attributing everything to the placeholder function so the |
Change https://go.dev/cl/544195 mentions this issue: |
@rhysh FYI, I will be sending a release note CL for this. I don't want to race with you. :) I also propose that we change the name of My thought is that even with |
Change https://go.dev/cl/547058 mentions this issue: |
Change https://go.dev/cl/547057 mentions this issue: |
Change https://go.dev/cl/547055 mentions this issue: |
Change https://go.dev/cl/547056 mentions this issue: |
Change https://go.dev/cl/547059 mentions this issue: |
profileruntimelocks is new in CL 544195, but the name is deceptive. Even with profileruntimelocks=0, runtime-internal locks are still profiled. The actual difference is that call stacks are not collected. Instead all contention is reported at runtime._LostContendedLock. Rename this setting to runtimecontentionstacks to make its name more aligned with its behavior. In addition, for this release the default is profileruntimelocks=0, meaning that users are fairly likely to encounter runtime._LostContendedLock. Rename it to runtime._LostContendedRuntimeLock in an attempt to make it more intuitive that these are runtime locks, not locks in application code. For #57071. Change-Id: I38aac28b2c0852db643d53b1eab3f3bc42a43393 Reviewed-on: https://go-review.googlesource.com/c/go/+/547055 Reviewed-by: Michael Knyszek <[email protected]> Auto-Submit: Michael Pratt <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]> Reviewed-by: Rhys Hiltner <[email protected]>
…total:seconds For #57071. Change-Id: I7ce6c35bed95a6ea3cdc17007f861c5dd82404d2 Reviewed-on: https://go-review.googlesource.com/c/go/+/547056 Auto-Submit: Michael Pratt <[email protected]> Reviewed-by: Rhys Hiltner <[email protected]> Reviewed-by: Michael Knyszek <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]>
Amazingly, we seem to have nearly no in-tree documentation on the semantics of block and mutex profiles. Add brief summaries, including the new behavior from CL 506415 and CL 544195. For #14689. For #44920. For #57071. For #61015. Change-Id: I1a6edce7c434fcb43f17c83eb362b1f9d1a32df1 Reviewed-on: https://go-review.googlesource.com/c/go/+/547057 LUCI-TryBot-Result: Go LUCI <[email protected]> Reviewed-by: Rhys Hiltner <[email protected]> Auto-Submit: Michael Pratt <[email protected]> Reviewed-by: Michael Knyszek <[email protected]>
For #57071. Change-Id: Ic1645af57aa589917c67154a5e4ad0b4edd7ba90 Reviewed-on: https://go-review.googlesource.com/c/go/+/547058 Reviewed-by: Michael Knyszek <[email protected]> Reviewed-by: Rhys Hiltner <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]> Auto-Submit: Michael Pratt <[email protected]>
For #57071. For #61422. Change-Id: I5d546d8828be897cb087e85a1251213c582b3894 Reviewed-on: https://go-review.googlesource.com/c/go/+/547059 Reviewed-by: Rhys Hiltner <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]> Auto-Submit: Michael Pratt <[email protected]> Reviewed-by: Michael Knyszek <[email protected]>
profileruntimelocks is new in CL 544195, but the name is deceptive. Even with profileruntimelocks=0, runtime-internal locks are still profiled. The actual difference is that call stacks are not collected. Instead all contention is reported at runtime._LostContendedLock. Rename this setting to runtimecontentionstacks to make its name more aligned with its behavior. In addition, for this release the default is profileruntimelocks=0, meaning that users are fairly likely to encounter runtime._LostContendedLock. Rename it to runtime._LostContendedRuntimeLock in an attempt to make it more intuitive that these are runtime locks, not locks in application code. For golang#57071. Change-Id: I38aac28b2c0852db643d53b1eab3f3bc42a43393 Reviewed-on: https://go-review.googlesource.com/c/go/+/547055 Reviewed-by: Michael Knyszek <[email protected]> Auto-Submit: Michael Pratt <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]> Reviewed-by: Rhys Hiltner <[email protected]>
…total:seconds For golang#57071. Change-Id: I7ce6c35bed95a6ea3cdc17007f861c5dd82404d2 Reviewed-on: https://go-review.googlesource.com/c/go/+/547056 Auto-Submit: Michael Pratt <[email protected]> Reviewed-by: Rhys Hiltner <[email protected]> Reviewed-by: Michael Knyszek <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]>
Amazingly, we seem to have nearly no in-tree documentation on the semantics of block and mutex profiles. Add brief summaries, including the new behavior from CL 506415 and CL 544195. For golang#14689. For golang#44920. For golang#57071. For golang#61015. Change-Id: I1a6edce7c434fcb43f17c83eb362b1f9d1a32df1 Reviewed-on: https://go-review.googlesource.com/c/go/+/547057 LUCI-TryBot-Result: Go LUCI <[email protected]> Reviewed-by: Rhys Hiltner <[email protected]> Auto-Submit: Michael Pratt <[email protected]> Reviewed-by: Michael Knyszek <[email protected]>
For golang#57071. Change-Id: Ic1645af57aa589917c67154a5e4ad0b4edd7ba90 Reviewed-on: https://go-review.googlesource.com/c/go/+/547058 Reviewed-by: Michael Knyszek <[email protected]> Reviewed-by: Rhys Hiltner <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]> Auto-Submit: Michael Pratt <[email protected]>
For golang#57071. For golang#61422. Change-Id: I5d546d8828be897cb087e85a1251213c582b3894 Reviewed-on: https://go-review.googlesource.com/c/go/+/547059 Reviewed-by: Rhys Hiltner <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]> Auto-Submit: Michael Pratt <[email protected]> Reviewed-by: Michael Knyszek <[email protected]>
Go's diagnostic tools give only a minimal view into contention on runtime-internal locks. I filed #57069 and #57070 with the best data available to me now. The story in #45894 has evolved over time. I haven't kept close track of the paths others take to finding runtime-internal lock contention, but I bet it's hard for them too. (Or: please share your techniques!)
The first one, #57069, appeared in the last few weeks while changing how we operate one of our apps. Pretty quick turnaround from seeing a problem to being able to describe it as a bug.
But #57070 describes a problem I've seen for years and wasn't quite sure I understood. In that app, we saw an increase in our CPU profiles' reports of time in
runtime.futex
when moving from Go 1.17 to 1.18 without any associated regression in the app's performance. I took that to mean that the change was only in the reporting, that those samples arrived in bursts and so were hidden by shortcomings insetitimer
-based profiles. That helped to make sense of the execution traces, but the story was still hard to prove (to myself, and to you here) without having CPU profile samples inside the execution trace data stream.I struggled with #45894 for a while before coming to understand that a lot of that post-GC delay was related to long waits on the runtime-internal mprof lock.
Maybe Go 1.19's tools are enough to get the job done, but I wonder if we should have a more direct measurement of contention and wait time on runtime-internal locks, like an equivalent of #49881.
CC @golang/runtime and @mknyszek
The text was updated successfully, but these errors were encountered: