-
Notifications
You must be signed in to change notification settings - Fork 18k
proposal: runtime/debug: add GCStats field to track count of marked pointers during last GC #52692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
GCStats
to have a NumPointers field for tracking number of marked Pointers
debug.GCStats
to have a NumPointers field for tracking number of marked Pointers
debug.GCStats
to have a NumPointers field for tracking number of marked Pointers
debug.GCStats
to have a field to track count of marked Pointers
during last GC
debug.GCStats
to have a field to track count of marked Pointers
during last GC
CC @mknyszek @golang/runtime |
As a general note, please consider that with every internal statistic we expose, we basically have to track that forever, and either places some restrictions on the GC implementation or means the statistic's meaning changes over time. In general, we have to be careful with what we expose (even if exposed via For this particular metric, however, there's a problem with tracking it: it's going to be pretty expensive. The scan loop in the GC is a very critical path, and is the only path with knowledge of how many pointers there actually are. It's written to be very tight. I think that an already-approved proposal might be what you want, however. FTR, measuring the end-to-end latency of It sounds like (and correct me if I'm wrong) what you want is just to identify when the GC is doing more work than before. I think one metric that would capture this without digging into implementation details is some estimate of total GC CPU time. On that note, there already is an accepted proposal to do that, #47216. Unfortunately it's a toss-up as to whether I'll actually get around to implementing that before the 1.19 development cycle ends (this Sunday). |
Oh, did not think about that case.
TIL something, thanks! Just out of curiosity: is a single increment of a variable (or maybe we can shard this variable) can really cause that much overhead? Of course, my assumption is to increment a single counter for every marked ptr, maybe that is wrong as well?
Nice to know. This means there is no way that GC latency grows but pause does not, right?
Yes. Like we have |
If it's a single global counter, that needs to be atomic. Marking is parallel, so that's going to be a contended global counter, which (IIRC, as a rule of thumb) has an overhead of around 50 ns on modern hardware (due to contention in the cache coherency protocol). Compared to how rapidly local unsynchronized CPU instructions finish that's very high. There are ways around it, like maintaining a local counter that is flushed occasionally. However in a tight loop, even a few extra instructions can make a significant difference. While our scan loop probably isn't the best in the world, and maybe it would even be a tolerable performance hit to do this today, by maintaining this API, we have to live with this design choice forever. If one day (and it probably will happen) we make the scan loop extremely tight, this will be a significant hindrance.
It depends on what you mean by "GC latency." While the concurrent mark phase is active, there are potential increases in latency due to write barriers and GC assists by goroutines. In aggregate this looks like whole-program latency, but the effects are mostly local (unlike pauses, which are global). For example, goroutines assist the GC proportionally to how much they allocate.
I think there are lots of good reasons to surface more GC information by default. See for example #52466. Exposing a GC CPU fraction over the course of the benchmark run might achieve what you want, and could also surface the same noise that#52466 is trying to root out. I think it's worth considering. |
This proposal has been added to the active column of the proposals project |
This proposal has been declined as retracted. |
@mknyszek, I was just lurking through the GC code and found following |
Not quite; |
Hi,
I would like to propose an idea (with few questions). I am still learning internals of Go (especially GC subsystem) so I hope my mental model is correct and clear.
Think about following code:
Now, if I change the
x int
tox *int
in the above code, the GC total latency jumps from~106usec
to~1.6ms
, roughly~10x
more overhead. Now, it is due to the fact that now GC needs to mark the pointer field in the structure, IIUC.I think I am not the only one faced this: https://blog.gopheracademy.com/advent-2018/avoid-gc-overhead-large-heaps/. When there are lots of pointers held in memory, the GC latency increases.
Now I understand that this kind behavior is completely natural. I am more looking for ways to anticipate this kind of behavior during implementation via some kind of counter for observability reasons. This behavior becomes somewhat more subtle(?) if for example x is a string (string itself holds an implicit pointer) or x ia struct that contains a pointer somewhere deep in its definition.
Let me propose an idea:
This would give the programmer a deterministic way to benchmark and optimize for allocations if possible. The current approach I am using to debug this is: to manually invoke GC and calculate elapsed time. This is somewhat ok, but IMHO it would be better and more deterministic to have a simple counter. WDYT?
And one more unrelated question just for learning: why we don't have a field for total GC latency (like a wall time), IIUC we currently have
PauseTotal
which shows the duration forSTW
, but would not it be nice to see whole runningtime(total latency)? Probably this is not possible given that it runs concurrently in different goroutines or maybe is not just necessary?
Thanks,
The text was updated successfully, but these errors were encountered: