-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[mlir][linalg] Add a test to demonstrate peeling + vectorisation #77590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Following on from llvm#75842, we can demonstrate that loop peeling combined with masked vectorisation and existing canonicalization for vector.mask operations leads to the following loop structure: ``` // M dimension scf.for 1:M // N dimension (contains vector ops _without_ masking) scf.for 1:UB // K dimension scf.for 1:K vector.add // N dimension (contains vector ops _with_ masking) scf.for UB:N // K dimension scf.for 1:K vector.mask {vector.add } ``` This is particularly beneficial for scalable vectors which normally require masking. This example demonstrates how to avoid them.
@llvm/pr-subscribers-mlir @llvm/pr-subscribers-mlir-linalg Author: Andrzej Warzyński (banach-space) ChangesFollowing on from #75842, we can demonstrate that loop peeling combined
This is particularly beneficial for scalable vectors which normally Full diff: https://github.com/llvm/llvm-project/pull/77590.diff 1 Files Affected:
diff --git a/mlir/test/Dialect/Linalg/transform-op-peel-and-vectorize.mlir b/mlir/test/Dialect/Linalg/transform-op-peel-and-vectorize.mlir
new file mode 100644
index 00000000000000..016749f81f6205
--- /dev/null
+++ b/mlir/test/Dialect/Linalg/transform-op-peel-and-vectorize.mlir
@@ -0,0 +1,86 @@
+// RUN: mlir-opt %s --transform-interpreter --split-input-file -canonicalize | FileCheck %s
+
+// Demonstrates what happens when peeling the middle loop (2nd parallel
+// dimension) followed by vectorization in the presence of _scalable_ vectors
+// (these are introduced through scalable tiling). The main goal is to verify
+// that canonicalizations fold away the masks in the main loop.
+
+func.func @matmul(%A: tensor<1024x512xf32>,
+ %B: tensor<512x2000xf32>,
+ %C:tensor<1024x2000xf32>) -> tensor<1024x2000xf32> {
+
+// CHECK: #[[MAP:.*]] = affine_map<()[s0] -> (-(2000 mod s0) + 2000)>
+// CHECK-DAG: %[[C1:.*]] = arith.constant 1 : index
+// CHECK-DAG: %[[C2000:.*]] = arith.constant 2000 : index
+// CHECK-DAG: %[[C8:.*]] = arith.constant 8 : index
+// CHECK-DAG: %[[C1024:.*]] = arith.constant 1024 : index
+// CHECK-DAG: %[[C512:.*]] = arith.constant 512 : index
+// CHECK-DAG: %[[C0:.*]] = arith.constant 0 : index
+// CHECK-DAG: %[[C16:.*]] = arith.constant 16 : index
+// CHECK: %[[VSCALE:.*]] = vector.vscale
+// CHECK: %[[STEP:.*]] = arith.muli %[[VSCALE]], %[[C16]] : index
+// CHECK: %2 = scf.for {{.*}} %[[C0]] to %[[C1024]] step %[[C8]] iter_args(%arg4 = %arg2) -> (tensor<1024x2000xf32>) {
+
+// Main loop after vectorisation (without masking)
+
+// CHECK: %[[UB_MAIN:.*]] = affine.apply #[[MAP]]()[%[[STEP]]]
+// CHECK: scf.for {{.*}} %[[C0]] to %[[UB_MAIN]] step %[[STEP]] {{.*}} -> (tensor<1024x2000xf32>) {
+// CHECK: scf.for %arg7 = %[[C0]] to %[[C512]] step %[[C1]] {{.*}} -> (tensor<1024x2000xf32>) {
+// CHECK-NOT: vector.mask
+// CHECK: arith.mulf {{.*}} : vector<8x[16]x1xf32>
+// CHECK-NEXT: vector.shape_cast {{.*}} : vector<8x[16]x1xf32> to vector<8x[16]xf32>
+// CHECK-NEXT: arith.addf {{.*}} : vector<8x[16]xf32>
+// CHECK-NOT: vector.mask
+// CHECK: scf.yield {{.*}} : tensor<1024x2000xf32>
+// CHECK-NEXT: }
+// CHECK-NEXT: scf.yield {{.*}} : tensor<1024x2000xf32>
+// CHECK-NEXT: }
+
+// Remainder loop after vectorisation (with masking)
+
+// CHECK: scf.for {{.*}} %[[UB_MAIN]] to %[[C2000]] step %[[STEP]] {{.*}} -> (tensor<1024x2000xf32>) {
+// CHECK: scf.for {{.*}} %[[C0]] to %[[C512]] step %[[C1]] {{.*}} -> (tensor<1024x2000xf32>) {
+// CHECK: %[[MASK_1:.*]] = vector.create_mask {{.*}} : vector<1x[16]xi1>
+// CHECK: %[[RHS:.*]] = vector.mask %[[MASK_1]] { vector.transfer_read {{.*}} } : vector<1x[16]xi1> -> vector<8x[16]x1xf32>
+// CHECK: %[[MASK_2:.*]] = vector.create_mask {{.*}} : vector<8x[16]xi1>
+// CHECK: %[[LHS:.*]] = vector.mask %[[MASK_2]] { vector.transfer_read {{.*}} } : vector<8x[16]xi1> -> vector<8x[16]xf32>
+// CHECK: %[[MUL:.*]] = arith.mulf %{{.*}}, %[[RHS]] : vector<8x[16]x1xf32>
+// CHECK: %[[MASK_3:.*]] = vector.create_mask {{.*}} : vector<8x[16]xi1>
+// CHECK: vector.shape_cast %[[MUL]] : vector<8x[16]x1xf32> to vector<8x[16]xf32>
+// CHECK: arith.addf %[[LHS]], %{{.*}} : vector<8x[16]xf32>
+// CHECK: arith.select %[[MASK_3]], {{.*}} : vector<8x[16]xi1>, vector<8x[16]xf32>
+// CHECK: vector.mask %[[MASK_2]] { vector.transfer_write {{.*}} } : vector<8x[16]xi1> -> tensor<8x?xf32>
+// CHECK: scf.yield %inserted_slice : tensor<1024x2000xf32>
+// CHECK: }
+// CHECK: scf.yield %7 : tensor<1024x2000xf32>
+// CHECK: }
+// CHECK: scf.yield %5 : tensor<1024x2000xf32>
+// CHECK-NEXT: }
+
+ %res = linalg.matmul ins(%A, %B: tensor<1024x512xf32>, tensor<512x2000xf32>)
+ outs(%C: tensor<1024x2000xf32>) -> tensor<1024x2000xf32>
+ return %res : tensor<1024x2000xf32>
+}
+
+module attributes {transform.with_named_sequence} {
+ transform.named_sequence @__transform_main(%root: !transform.any_op {transform.readonly}) {
+ %matmul = transform.structured.match ops{["linalg.matmul"]} in %root : (!transform.any_op) -> !transform.any_op
+ // 1. Scalable tiling
+ %_, %loop_1, %loop_2, %loop_3 =
+ transform.structured.tile_using_for %matmul [8, [16], 1] : (!transform.any_op)
+ -> (!transform.any_op, !transform.op<"scf.for">, !transform.op<"scf.for">,!transform.op<"scf.for">)
+
+ // 2. Loop peeling (only the middle dimension)
+ %main_loop, %remainder_loop = transform.loop.peel %loop_2 : (!transform.op<"scf.for">) -> (!transform.op<"scf.for">, !transform.op<"scf.for">)
+
+ // 3. Vectorize the main loop
+ %matmul_main = transform.structured.match ops{["linalg.matmul"]} in %main_loop : (!transform.op<"scf.for">) -> !transform.any_op
+ transform.structured.vectorize %matmul_main vector_sizes [8, [16], 1] : !transform.any_op
+
+ // 4. Vectorize the remainder loop
+ %matmul_remainder = transform.structured.match ops{["linalg.matmul"]} in %remainder_loop : (!transform.op<"scf.for">) -> !transform.any_op
+ transform.structured.vectorize %matmul_remainder vector_sizes [8, [16], 1] : !transform.any_op
+
+ transform.yield
+ }
+}
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool stuff!
…m#77590) Following on from llvm#75842, we can demonstrate that loop peeling combined with masked vectorisation and existing canonicalization for vector.mask operations leads to the following loop structure: ``` // M dimension scf.for 1:M // N dimension (contains vector ops _without_ masking) scf.for 1:UB // K dimension scf.for 1:K vector.add // N dimension (contains vector ops _with_ masking) scf.for UB:N // K dimension scf.for 1:K vector.mask { vector.add } ``` This is particularly beneficial for scalable vectors which normally require masking. This example demonstrates how to avoid them.
Following on from #75842, we can demonstrate that loop peeling combined
with masked vectorisation and existing canonicalization for vector.mask
operations leads to the following loop structure:
This is particularly beneficial for scalable vectors which normally
require masking. This example demonstrates how to avoid them.