runtime: add per-G shadows of writeBarrier.enabled

1. A large fraction of static instructions are used to implement the write barrier enabled check, which currently always uses an absolute memory reference.
2. On supported RISC architectures accessing data at a small offset from a pointer takes fewer instructions than accessing data at an absolute offset.  On all(?) supported architectures, it takes fewer bytes.
3. Reserving a register to point at the `runtime.writeBarrier` struct is possible, but would be a difficult tradeoff.  However, many architectures already reserve a register to point at the executing G.  If we could check write barrier status relative to G, it would save instructions on all RISC architectures.
4. There are many Gs.  Since the enabled flag is updated very rarely, it would be possible to update some or all Gs whenever the master flag is updated.  There is a tradeoff of which Gs to update: all of them, those that have Ms, or those that have Ps.
5. I have a proof of concept patch against the riscv tree (https://review.gerrithub.io/#/c/357282/ or sorear/riscv-go@dab0f89).  I can rebase it against master if there is interest.  This patch takes the approach of keeping Gs updated if they are referenced by any M; thus the additional STW latency scales as the number of Ms, but there is less potential for a race with `exitsyscall`.
6. It is far from clear that I have accounted for all possible races, especially with regards to asynchronous cgo callbacks that could(?) create new Ms at any time.
7. Initial results measured by `.text` size on `cmd/compile`:

                before   after    %
       386      5730855  5731751 +0.016
       amd64    6764675  6765155 +0.007
       arm      6155060  6081080 -1.202
       arm64    5850320  5725184 -2.139
       mips64   7297336  7173880 -1.692
       mips     7159648  7097940 -0.862
       ppc64    6120800  6058392 -1.020
       riscv    3986656  3924656 -1.555
       s390x    8253200  8343808 +1.098

   ppc64 and s390x do not benefit from this patch alone as the current backends for those architectures are unable to use the G register as a base register for memory accesses.  For ppc64 I did the measurement with a [one-line change](https://github.com/sorear/riscv-go/commit/f2e59211b2941837be657a2a4cbd1dbe5e286001#diff-008717913872fea9b232df7cf2ca820dR140) that enables G as a base register, for s390x I tried to make a similar change but was not able to get it to work.

   It may make sense to exclude s390x from the code generation change since s390x can fetch from an absolute address in one instruction; currently the code generation is conditionalized exclusively on `hasGReg`.

   The demonstration patch keeps the per-G shadows updated even on 386 and amd64 where they are not used.  There could be conditionals added to the runtime to avoid that overhead.

8. I do not have any physical user-programmable hardware of the most affected architectures.  While a 2.1% reduction in static instructions for arm64 looks nice on paper, it's moot if it turns out to make things slower for whatever reason.
9. Is this strategically desirable?  It makes moving away from STW at GC phase changes marginally more difficult, increases `g` size, and might cause other problems I'm not considering.

cc @josharian @aclements @randall77 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

runtime: add per-G shadows of writeBarrier.enabled #20005

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

runtime: add per-G shadows of writeBarrier.enabled #20005

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions