-
Notifications
You must be signed in to change notification settings - Fork 13.4k
[RISC-V] Remove round-trip to memory when using compressstore
#113242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@llvm/issue-subscribers-backend-risc-v Author: Niles Salter (Validark)
This Zig code:
fn compressstore(vec: @<!-- -->Vector(64, u8), ptr: *@<!-- -->Vector(64, u8), bitstr: u64) void {
return struct {
extern fn @"llvm.masked.compressstore.v64i8"(@<!-- -->Vector(64, u8), *@<!-- -->Vector(64, u8), @<!-- -->Vector(64, u1)) callconv(.Unspecified) void;
}.@"llvm.masked.compressstore.v64i8"(vec, ptr, @<!-- -->bitCast(bitstr));
}
export fn compress(vec: @<!-- -->Vector(64, u8), bitstr: u64, vec2: @<!-- -->Vector(64, u8)) @<!-- -->Vector(64, u8) {
var buffer: [64]u8 align(64) = undefined;
compressstore(vec, &buffer, bitstr);
return buffer -% vec2;
} Gives us this optimized LLVM code: define dso_local void @<!-- -->compress(ptr noalias nocapture nonnull writeonly sret(<64 x i8>) %0, ptr nocapture noundef readonly %1, i64 %2, ptr nocapture noundef readonly %3) local_unnamed_addr {
Entry:
%4 = alloca [64 x i8], align 64
%5 = load <64 x i8>, ptr %1, align 64
%6 = load <64 x i8>, ptr %3, align 64
#dbg_value(<64 x i8> %5, !138, !DIExpression(), !139)
#dbg_value(i64 %2, !140, !DIExpression(), !139)
#dbg_value(<64 x i8> %6, !141, !DIExpression(), !139)
#dbg_declare(ptr %4, !142, !DIExpression(), !144)
#dbg_value(<64 x i8> %5, !145, !DIExpression(), !149)
#dbg_value(ptr %4, !151, !DIExpression(), !149)
#dbg_value(i64 %2, !152, !DIExpression(), !149)
%7 = bitcast i64 %2 to <64 x i1>
call fastcc void @<!-- -->llvm.masked.compressstore.v64i8(<64 x i8> %5, ptr nonnull align 64 %4, <64 x i1> %7)
%8 = load <64 x i8>, ptr %4, align 64
%9 = sub <64 x i8> %8, %6
store <64 x i8> %9, ptr %0, align 64
ret void
}
declare fastcc void @<!-- -->llvm.masked.compressstore.v64i8(<64 x i8>, ptr nocapture, <64 x i1>) #<!-- -->1 Which gets emitted like so for compress:
addi sp, sp, -128
sd ra, 120(sp)
sd s0, 112(sp)
addi s0, sp, 128
andi sp, sp, -64
li a4, 64
vsetvli zero, a4, e8, m2, ta, ma
vle8.v v8, (a1)
vle8.v v10, (a3)
vsetivli zero, 1, e64, m1, ta, ma
vmv.s.x v12, a2
vsetvli zero, a4, e8, m2, ta, ma
vcompress.vm v14, v8, v12
vcpop.m a1, v12
mv a2, sp
vsetvli zero, a1, e8, m2, ta, ma
vse8.v v14, (a2)
vsetvli zero, a4, e8, m2, ta, ma
vle8.v v8, (a2)
vsub.vv v8, v8, v10
vse8.v v8, (a0)
addi sp, s0, -128
ld ra, 120(sp)
ld s0, 112(sp)
addi sp, sp, 128
ret Is it necessary to have this section of the assembly? vse8.v v14, (a2)
vsetvli zero, a4, e8, m2, ta, ma
vle8.v v8, (a2) I haven't read that much RISC-V Vector assembly yet, but my hunch is this could be done better. |
Same thing happens on Zen 4 as well: compress:
push rbp
mov rbp, rsp
and rsp, -64
sub rsp, 128
kmovq k1, rdi
vpcompressb zmmword ptr [rsp] {k1}, zmm0
vmovdqa64 zmm0, zmmword ptr [rsp]
vpsubb zmm0, zmm0, zmm1
mov rsp, rbp
pop rbp
ret |
I think it is not doing things wrong? It just keeps the |
Yes, it is not doing anything incorrect, but it would be nice to optimize this away. I don't really have another way to access this functionality in Zig, so it would be great if using this cross-platform LLVM intrinsic gave me optimized code on RISC-V targets. Otherwise, I would have to dip into inline assembly I think, since I think we don't have access to vscale types in Zig. |
Maybe you can try llvm.experimental.vector.compress.*’ Intrinsics?
|
Looks like it's not hooked up to the compress instruction on RISC-V targets. Godbolt |
Oh sorry, this intrinsic was introduced three months ago, RISC-V target hasn't supported it. I will add a custom lowering for it today. |
Thank you! |
I noticed we don't have a fn expandload(ptr: *const @Vector(64, u8), bitstr: u64, fallback: @Vector(64, u8)) @Vector(64, u8) {
return struct {
extern fn @"llvm.masked.expandload.v64i8"(*const @Vector(64, u8), @Vector(64, u1), @Vector(64, u8)) callconv(.Unspecified) @Vector(64, u8);
}.@"llvm.masked.expandload.v64i8"(ptr, @bitCast(bitstr), fallback);
}
export fn expand(vec: @Vector(64, u8), bitstr: u64) @Vector(64, u8) {
return expandload(&vec, bitstr, @splat(0));
} Compiled for Zen 5: (I won't show the RISC-V version for now since your PR was not merged yet) expand:
push rbp
mov rbp, rsp
and rsp, -64
sub rsp, 128
vmovaps zmmword ptr [rsp], zmm0
kmovq k1, rdi
vpexpandb zmm0 {k1} {z}, zmmword ptr [rsp]
mov rsp, rbp
pop rbp
ret |
This intrinsic was introduced by llvm#92289 and currently we just expand it for RISC-V. This patch adds custom lowering for this intrinsic and simply maps it to `vcompress` instruction. Fixes llvm#113242.
Yeah, we don't have intrinsic for decompress. Though it should be able to be synthesized via |
@Validark Can you please file another issue to track this? |
I am glad that your PR added support for |
|
Yeah, I agree that we may have some potential gains. For example, we can propagate the store value to later load if we can prove they are the same value, so that we can avoid actual memory accesses. But I think we can't remove the store totally, the optimization should not change the semantics.
to
So we don't really need to load value from But for vectors, the precondition is that we store/load the same value. As for your example, the AVLs (
|
Is there an intrinsic for a runtime vector shuffle? We could probably do with a 16, 32, 64, and 128 byte lookup tables (and the index vector size can vary independently). Even if the zeroing semantics were not made consistent by LLVM, it would be extremely convenient to have those so I don't have to interact with vscale vectors. Should I open an issue for such a thing? |
Maybe ‘shufflevector’ Instruction? |
shufflevector only works with compile time constant indices. |
Then we may use |
I don't think |
Well, I was thinking we may be able to synthesize it via |
X86 backend has code to detect patterns of extractelts with variable offsets and insertelts with constant offsets and turn them into the equivalent of vrgather.vv. See LowerBUILD_VECTORAsVariablePermute in X86ISelLowering.cpp and the var-permute-*.ll tests in llvm/test/CodeGen/X86 |
Thanks! So it is possible to expand future |
This Zig code:
Gives us this optimized LLVM code:
Which gets emitted like so for
spacemit_x60
:Is it necessary to have this section of the assembly?
I haven't read that much RISC-V Vector assembly yet, but my hunch is this could be done better.
The text was updated successfully, but these errors were encountered: