Skip to content

Commit 54e52aa

Browse files
authored
[X86] Reduce znver3/4 LoopMicroOpBufferSize to practical loop unrolling values (#91340)
The znver3/4 scheduler models have previously associated the LoopMicroOpBufferSize with the maximum size of their op caches, and when this led to quadratic complexity issues this were reduced to a value of 512 uops, based mainly on compilation time and not its effectiveness on runtime performance. From a runtime performance POV, a large LoopMicroOpBufferSize leads to a higher number of loop unrolls, meaning the cpu has to rely on the frontend decode rate (4 ins/cy max) for much longer to fill the op cache before looping begins and we make use of the faster op cache rate (8/9 ops/cy). This patch proposes we instead cap the size of the LoopMicroOpBufferSize based off the maximum rate from the op cache (znver3 = 8op/cy, znver4 = 9op/cy) and the branch misprediction penalty from the opcache (~12cy) as a estimate of the useful number of ops we can unroll a loop by before mispredictions are likely to cause stalls. This isn't a perfect metric, but does try to be closer to the spirit of how we use LoopMicroOpBufferSize in the compiler vs the size of a similar naming buffer in the cpu.
1 parent 4a5dffc commit 54e52aa

File tree

3 files changed

+44
-965
lines changed

3 files changed

+44
-965
lines changed

llvm/lib/Target/X86/X86ScheduleZnver3.td

+4-7
Original file line numberDiff line numberDiff line change
@@ -33,13 +33,10 @@ def Znver3Model : SchedMachineModel {
3333
// The op cache is organized as an associative cache with 64 sets and 8 ways.
3434
// At each set-way intersection is an entry containing up to 8 macro ops.
3535
// The maximum capacity of the op cache is 4K ops.
36-
// Agner, 22.5 µop cache
37-
// The size of the µop cache is big enough for holding most critical loops.
38-
// FIXME: PR50584: MachineScheduler/PostRAScheduler have quadradic complexity,
39-
// with large values here the compilation of certain loops
40-
// ends up taking way too long.
41-
// let LoopMicroOpBufferSize = 4096;
42-
let LoopMicroOpBufferSize = 512;
36+
// Assuming a maximum dispatch of 8 ops/cy and a mispredict cost of 12cy from
37+
// the op-cache, we limit the loop buffer to 8*12 = 96 to avoid loop unrolling
38+
// leading to excessive filling of the op-cache from frontend.
39+
let LoopMicroOpBufferSize = 96;
4340
// AMD SOG 19h, 2.6.2 L1 Data Cache
4441
// The L1 data cache has a 4- or 5- cycle integer load-to-use latency.
4542
// AMD SOG 19h, 2.12 L1 Data Cache

llvm/lib/Target/X86/X86ScheduleZnver4.td

+5-11
Original file line numberDiff line numberDiff line change
@@ -28,17 +28,11 @@ def Znver4Model : SchedMachineModel {
2828
// AMD SOG 19h, 2.9.1 Op Cache
2929
// The op cache is organized as an associative cache with 64 sets and 8 ways.
3030
// At each set-way intersection is an entry containing up to 8 macro ops.
31-
// The maximum capacity of the op cache is 4K ops.
32-
// Agner, 22.5 µop cache
33-
// The size of the µop cache is big enough for holding most critical loops.
34-
// FIXME: PR50584: MachineScheduler/PostRAScheduler have quadradic complexity,
35-
// with large values here the compilation of certain loops
36-
// ends up taking way too long.
37-
// Ideally for znver4, we should have 6.75K. However we don't add that
38-
// considerting the impact compile time and prefer using default values
39-
// instead.
40-
// Retaining minimal value to influence unrolling as we did for znver3.
41-
let LoopMicroOpBufferSize = 512;
31+
// The maximum capacity of the op cache is 6.75K ops.
32+
// Assuming a maximum dispatch of 9 ops/cy and a mispredict cost of 12cy from
33+
// the op-cache, we limit the loop buffer to 9*12 = 108 to avoid loop
34+
// unrolling leading to excessive filling of the op-cache from frontend.
35+
let LoopMicroOpBufferSize = 108;
4236
// AMD SOG 19h, 2.6.2 L1 Data Cache
4337
// The L1 data cache has a 4- or 5- cycle integer load-to-use latency.
4438
// AMD SOG 19h, 2.12 L1 Data Cache

0 commit comments

Comments
 (0)