Skip to content

Commit 177ce19

Browse files
authored
[LLVM] Add llvm.experimental.vector.compress intrinsic (#92289)
This PR adds a new vector intrinsic `@llvm.experimental.vector.compress` to "compress" data within a vector based on a selection mask, i.e., it moves all selected values (i.e., where `mask[i] == 1`) to consecutive lanes in the result vector. A `passthru` vector can be provided, from which remaining lanes are filled. The main reason for this is that the existing `@llvm.masked.compressstore` has very strong constraints in that it can only write values that were selected, resulting in guard branches for all targets except AVX-512 (and even there the AMD implementation is _very_ slow). More instruction sets support "compress" logic, but only within registers. So to store the values, an additional store is needed. But this combination is likely significantly faster on many target as it avoids branches. In follow up PRs, my plan is to add target-specific lowerings for x86, SVE, and possibly RISCV. I also want to combine this with a store instruction, as this is probably a common case and we can avoid some memory writes in that case. See [discussion in forum](https://discourse.llvm.org/t/new-intrinsic-for-masked-vector-compress-without-store/78663) for initial discussion on the design.
1 parent 329e7c8 commit 177ce19

27 files changed

+1105
-1
lines changed

llvm/docs/GlobalISel/GenericOpcode.rst

+7
Original file line numberDiff line numberDiff line change
@@ -726,6 +726,13 @@ The type of the operand must be equal to or larger than the vector element
726726
type. If the operand is larger than the vector element type, the scalar is
727727
implicitly truncated to the vector element type.
728728

729+
G_VECTOR_COMPRESS
730+
^^^^^^^^^^^^^^^^^
731+
732+
Given an input vector, a mask vector, and a passthru vector, continuously place
733+
all selected (i.e., where mask[i] = true) input lanes in an output vector. All
734+
remaining lanes in the output are taken from passthru, which may be undef.
735+
729736
Vector Reduction Operations
730737
---------------------------
731738

llvm/docs/LangRef.rst

+87
Original file line numberDiff line numberDiff line change
@@ -19525,6 +19525,93 @@ the follow sequence of operations:
1952519525

1952619526
The ``mask`` operand will apply to at least the gather and scatter operations.
1952719527

19528+
19529+
.. _int_vector_compress:
19530+
19531+
'``llvm.experimental.vector.compress.*``' Intrinsics
19532+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
19533+
19534+
LLVM provides an intrinsic for compressing data within a vector based on a selection mask.
19535+
Semantically, this is similar to :ref:`llvm.masked.compressstore <int_compressstore>` but with weaker assumptions
19536+
and without storing the results to memory, i.e., the data remains in the vector.
19537+
19538+
Syntax:
19539+
"""""""
19540+
This is an overloaded intrinsic. A number of scalar values of integer, floating point or pointer data type are collected
19541+
from an input vector and placed adjacently within the result vector. A mask defines which elements to collect from the vector.
19542+
The remaining lanes are filled with values from ``passthru``.
19543+
19544+
:: code-block:: llvm
19545+
19546+
declare <8 x i32> @llvm.experimental.vector.compress.v8i32(<8 x i32> <value>, <8 x i1> <mask>, <8 x i32> <passthru>)
19547+
declare <16 x float> @llvm.experimental.vector.compress.v16f32(<16 x float> <value>, <16 x i1> <mask>, <16 x float> undef)
19548+
19549+
Overview:
19550+
"""""""""
19551+
19552+
Selects elements from input vector ``value`` according to the ``mask``.
19553+
All selected elements are written into adjacent lanes in the result vector,
19554+
from lower to higher.
19555+
The mask holds an entry for each vector lane, and is used to select elements
19556+
to be kept.
19557+
If a ``passthru`` vector is given, all remaining lanes are filled with the
19558+
corresponding lane's value from ``passthru``.
19559+
The main difference to :ref:`llvm.masked.compressstore <int_compressstore>` is
19560+
that the we do not need to guard against memory access for unselected lanes.
19561+
This allows for branchless code and better optimization for all targets that
19562+
do not support or have inefficient
19563+
instructions of the explicit semantics of
19564+
:ref:`llvm.masked.compressstore <int_compressstore>` but still have some form
19565+
of compress operations.
19566+
The result vector can be written with a similar effect, as all the selected
19567+
values are at the lower positions of the vector, but without requiring
19568+
branches to avoid writes where the mask is ``false``.
19569+
19570+
Arguments:
19571+
""""""""""
19572+
19573+
The first operand is the input vector, from which elements are selected.
19574+
The second operand is the mask, a vector of boolean values.
19575+
The third operand is the passthru vector, from which elements are filled
19576+
into remaining lanes.
19577+
The mask and the input vector must have the same number of vector elements.
19578+
The input and passthru vectors must have the same type.
19579+
19580+
Semantics:
19581+
""""""""""
19582+
19583+
The ``llvm.experimental.vector.compress`` intrinsic compresses data within a vector.
19584+
It collects elements from possibly non-adjacent lanes of a vector and places
19585+
them contiguously in the result vector based on a selection mask, filling the
19586+
remaining lanes with values from ``passthru``.
19587+
This intrinsic performs the logic of the following C++ example.
19588+
All values in ``out`` after the last selected one are undefined if
19589+
``passthru`` is undefined.
19590+
If all entries in the ``mask`` are 0, the ``out`` vector is ``passthru``.
19591+
If any element of the mask is poison, all elements of the result are poison.
19592+
Otherwise, if any element of the mask is undef, all elements of the result are undef.
19593+
If ``passthru`` is undefined, the number of valid lanes is equal to the number
19594+
of ``true`` entries in the mask, i.e., all lanes >= number-of-selected-values
19595+
are undefined.
19596+
19597+
.. code-block:: cpp
19598+
19599+
// Consecutively place selected values in a vector.
19600+
using VecT __attribute__((vector_size(N))) = int;
19601+
VecT compress(VecT vec, VecT mask, VecT passthru) {
19602+
VecT out;
19603+
int idx = 0;
19604+
for (int i = 0; i < N / sizeof(int); ++i) {
19605+
out[idx] = vec[i];
19606+
idx += static_cast<bool>(mask[i]);
19607+
}
19608+
for (; idx < N / sizeof(int); ++idx) {
19609+
out[idx] = passthru[idx];
19610+
}
19611+
return out;
19612+
}
19613+
19614+
1952819615
Matrix Intrinsics
1952919616
-----------------
1953019617

llvm/docs/ReleaseNotes.rst

+1
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,7 @@ Changes to the LLVM IR
7979
* ``llvm.instprof.mcdc.tvbitmap.update``: 3rd argument has been
8080
removed. The next argument has been changed from byte index to bit
8181
index.
82+
* Added ``llvm.experimental.vector.compress`` intrinsic.
8283

8384
Changes to LLVM infrastructure
8485
------------------------------

llvm/include/llvm/CodeGen/GlobalISel/LegalizerHelper.h

+1
Original file line numberDiff line numberDiff line change
@@ -412,6 +412,7 @@ class LegalizerHelper {
412412
LegalizeResult lowerUnmergeValues(MachineInstr &MI);
413413
LegalizeResult lowerExtractInsertVectorElt(MachineInstr &MI);
414414
LegalizeResult lowerShuffleVector(MachineInstr &MI);
415+
LegalizeResult lowerVECTOR_COMPRESS(MachineInstr &MI);
415416
Register getDynStackAllocTargetPtr(Register SPReg, Register AllocSize,
416417
Align Alignment, LLT PtrTy);
417418
LegalizeResult lowerDynStackAlloc(MachineInstr &MI);

llvm/include/llvm/CodeGen/ISDOpcodes.h

+8
Original file line numberDiff line numberDiff line change
@@ -659,6 +659,14 @@ enum NodeType {
659659
/// non-constant operands.
660660
STEP_VECTOR,
661661

662+
/// VECTOR_COMPRESS(Vec, Mask, Passthru)
663+
/// consecutively place vector elements based on mask
664+
/// e.g., vec = {A, B, C, D} and mask = {1, 0, 1, 0}
665+
/// --> {A, C, ?, ?} where ? is undefined
666+
/// If passthru is defined, ?s are replaced with elements from passthru.
667+
/// If passthru is undef, ?s remain undefined.
668+
VECTOR_COMPRESS,
669+
662670
/// MULHU/MULHS - Multiply high - Multiply two integers of type iN,
663671
/// producing an unsigned/signed value of type i[2*N], then return the top
664672
/// part.

llvm/include/llvm/CodeGen/TargetLowering.h

+4
Original file line numberDiff line numberDiff line change
@@ -5496,6 +5496,10 @@ class TargetLowering : public TargetLoweringBase {
54965496
/// method accepts vectors as its arguments.
54975497
SDValue expandVectorSplice(SDNode *Node, SelectionDAG &DAG) const;
54985498

5499+
/// Expand a vector VECTOR_COMPRESS into a sequence of extract element, store
5500+
/// temporarily, advance store position, before re-loading the final vector.
5501+
SDValue expandVECTOR_COMPRESS(SDNode *Node, SelectionDAG &DAG) const;
5502+
54995503
/// Legalize a SETCC or VP_SETCC with given LHS and RHS and condition code CC
55005504
/// on the current target. A VP_SETCC will additionally be given a Mask
55015505
/// and/or EVL not equal to SDValue().

llvm/include/llvm/IR/Intrinsics.td

+5
Original file line numberDiff line numberDiff line change
@@ -2398,6 +2398,11 @@ def int_masked_compressstore:
23982398
[IntrWriteMem, IntrArgMemOnly, IntrWillReturn,
23992399
NoCapture<ArgIndex<1>>]>;
24002400

2401+
def int_experimental_vector_compress:
2402+
DefaultAttrsIntrinsic<[llvm_anyvector_ty],
2403+
[LLVMMatchType<0>, LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>, LLVMMatchType<0>],
2404+
[IntrNoMem, IntrWillReturn]>;
2405+
24012406
// Test whether a pointer is associated with a type metadata identifier.
24022407
def int_type_test : DefaultAttrsIntrinsic<[llvm_i1_ty], [llvm_ptr_ty, llvm_metadata_ty],
24032408
[IntrNoMem, IntrWillReturn, IntrSpeculatable]>;

llvm/include/llvm/Support/TargetOpcodes.def

+3
Original file line numberDiff line numberDiff line change
@@ -754,6 +754,9 @@ HANDLE_TARGET_OPCODE(G_SHUFFLE_VECTOR)
754754
/// Generic splatvector.
755755
HANDLE_TARGET_OPCODE(G_SPLAT_VECTOR)
756756

757+
/// Generic masked compress.
758+
HANDLE_TARGET_OPCODE(G_VECTOR_COMPRESS)
759+
757760
/// Generic count trailing zeroes.
758761
HANDLE_TARGET_OPCODE(G_CTTZ)
759762

llvm/include/llvm/Target/GenericOpcodes.td

+7
Original file line numberDiff line numberDiff line change
@@ -1548,6 +1548,13 @@ def G_SPLAT_VECTOR: GenericInstruction {
15481548
let hasSideEffects = false;
15491549
}
15501550

1551+
// Generic masked compress.
1552+
def G_VECTOR_COMPRESS: GenericInstruction {
1553+
let OutOperandList = (outs type0:$dst);
1554+
let InOperandList = (ins type0:$vec, type1:$mask, type0:$passthru);
1555+
let hasSideEffects = false;
1556+
}
1557+
15511558
//------------------------------------------------------------------------------
15521559
// Vector reductions
15531560
//------------------------------------------------------------------------------

llvm/include/llvm/Target/GlobalISel/SelectionDAGCompat.td

+1
Original file line numberDiff line numberDiff line change
@@ -193,6 +193,7 @@ def : GINodeEquiv<G_VECREDUCE_UMAX, vecreduce_umax>;
193193
def : GINodeEquiv<G_VECREDUCE_SMIN, vecreduce_smin>;
194194
def : GINodeEquiv<G_VECREDUCE_SMAX, vecreduce_smax>;
195195
def : GINodeEquiv<G_VECREDUCE_ADD, vecreduce_add>;
196+
def : GINodeEquiv<G_VECTOR_COMPRESS, vector_compress>;
196197

197198
def : GINodeEquiv<G_STRICT_FADD, strict_fadd>;
198199
def : GINodeEquiv<G_STRICT_FSUB, strict_fsub>;

llvm/include/llvm/Target/TargetSelectionDAG.td

+8
Original file line numberDiff line numberDiff line change
@@ -266,6 +266,12 @@ def SDTMaskedScatter : SDTypeProfile<0, 4, [
266266
SDTCisSameNumEltsAs<0, 1>, SDTCisSameNumEltsAs<0, 3>
267267
]>;
268268

269+
def SDTVectorCompress : SDTypeProfile<1, 3, [
270+
SDTCisVec<0>, SDTCisSameAs<0, 1>,
271+
SDTCisVec<2>, SDTCisSameNumEltsAs<1, 2>,
272+
SDTCisSameAs<1, 3>
273+
]>;
274+
269275
def SDTVecShuffle : SDTypeProfile<1, 2, [
270276
SDTCisSameAs<0, 1>, SDTCisSameAs<1, 2>
271277
]>;
@@ -757,6 +763,8 @@ def masked_gather : SDNode<"ISD::MGATHER", SDTMaskedGather,
757763
def masked_scatter : SDNode<"ISD::MSCATTER", SDTMaskedScatter,
758764
[SDNPHasChain, SDNPMayStore, SDNPMemOperand]>;
759765

766+
def vector_compress : SDNode<"ISD::VECTOR_COMPRESS", SDTVectorCompress>;
767+
760768
// Do not use ld, st directly. Use load, extload, sextload, zextload, store,
761769
// and truncst (see below).
762770
def ld : SDNode<"ISD::LOAD" , SDTLoad,

llvm/lib/CodeGen/GlobalISel/IRTranslator.cpp

+2
Original file line numberDiff line numberDiff line change
@@ -1994,6 +1994,8 @@ unsigned IRTranslator::getSimpleIntrinsicOpcode(Intrinsic::ID ID) {
19941994
return TargetOpcode::G_VECREDUCE_UMAX;
19951995
case Intrinsic::vector_reduce_umin:
19961996
return TargetOpcode::G_VECREDUCE_UMIN;
1997+
case Intrinsic::experimental_vector_compress:
1998+
return TargetOpcode::G_VECTOR_COMPRESS;
19971999
case Intrinsic::lround:
19982000
return TargetOpcode::G_LROUND;
19992001
case Intrinsic::llround:

llvm/lib/CodeGen/GlobalISel/LegalizerHelper.cpp

+89
Original file line numberDiff line numberDiff line change
@@ -4034,6 +4034,8 @@ LegalizerHelper::lower(MachineInstr &MI, unsigned TypeIdx, LLT LowerHintTy) {
40344034
return lowerExtractInsertVectorElt(MI);
40354035
case G_SHUFFLE_VECTOR:
40364036
return lowerShuffleVector(MI);
4037+
case G_VECTOR_COMPRESS:
4038+
return lowerVECTOR_COMPRESS(MI);
40374039
case G_DYN_STACKALLOC:
40384040
return lowerDynStackAlloc(MI);
40394041
case G_STACKSAVE:
@@ -7593,6 +7595,93 @@ LegalizerHelper::lowerShuffleVector(MachineInstr &MI) {
75937595
return Legalized;
75947596
}
75957597

7598+
LegalizerHelper::LegalizeResult
7599+
LegalizerHelper::lowerVECTOR_COMPRESS(llvm::MachineInstr &MI) {
7600+
auto [Dst, DstTy, Vec, VecTy, Mask, MaskTy, Passthru, PassthruTy] =
7601+
MI.getFirst4RegLLTs();
7602+
7603+
if (VecTy.isScalableVector())
7604+
report_fatal_error("Cannot expand masked_compress for scalable vectors.");
7605+
7606+
Align VecAlign = getStackTemporaryAlignment(VecTy);
7607+
MachinePointerInfo PtrInfo;
7608+
Register StackPtr =
7609+
createStackTemporary(TypeSize::getFixed(VecTy.getSizeInBytes()), VecAlign,
7610+
PtrInfo)
7611+
.getReg(0);
7612+
MachinePointerInfo ValPtrInfo =
7613+
MachinePointerInfo::getUnknownStack(*MI.getMF());
7614+
7615+
LLT IdxTy = LLT::scalar(32);
7616+
LLT ValTy = VecTy.getElementType();
7617+
Align ValAlign = getStackTemporaryAlignment(ValTy);
7618+
7619+
auto OutPos = MIRBuilder.buildConstant(IdxTy, 0);
7620+
7621+
bool HasPassthru =
7622+
MRI.getVRegDef(Passthru)->getOpcode() != TargetOpcode::G_IMPLICIT_DEF;
7623+
7624+
if (HasPassthru)
7625+
MIRBuilder.buildStore(Passthru, StackPtr, PtrInfo, VecAlign);
7626+
7627+
Register LastWriteVal;
7628+
std::optional<APInt> PassthruSplatVal =
7629+
isConstantOrConstantSplatVector(*MRI.getVRegDef(Passthru), MRI);
7630+
7631+
if (PassthruSplatVal.has_value()) {
7632+
LastWriteVal =
7633+
MIRBuilder.buildConstant(ValTy, PassthruSplatVal.value()).getReg(0);
7634+
} else if (HasPassthru) {
7635+
auto Popcount = MIRBuilder.buildZExt(MaskTy.changeElementSize(32), Mask);
7636+
Popcount = MIRBuilder.buildInstr(TargetOpcode::G_VECREDUCE_ADD,
7637+
{LLT::scalar(32)}, {Popcount});
7638+
7639+
Register LastElmtPtr =
7640+
getVectorElementPointer(StackPtr, VecTy, Popcount.getReg(0));
7641+
LastWriteVal =
7642+
MIRBuilder.buildLoad(ValTy, LastElmtPtr, ValPtrInfo, ValAlign)
7643+
.getReg(0);
7644+
}
7645+
7646+
unsigned NumElmts = VecTy.getNumElements();
7647+
for (unsigned I = 0; I < NumElmts; ++I) {
7648+
auto Idx = MIRBuilder.buildConstant(IdxTy, I);
7649+
auto Val = MIRBuilder.buildExtractVectorElement(ValTy, Vec, Idx);
7650+
Register ElmtPtr =
7651+
getVectorElementPointer(StackPtr, VecTy, OutPos.getReg(0));
7652+
MIRBuilder.buildStore(Val, ElmtPtr, ValPtrInfo, ValAlign);
7653+
7654+
LLT MaskITy = MaskTy.getElementType();
7655+
auto MaskI = MIRBuilder.buildExtractVectorElement(MaskITy, Mask, Idx);
7656+
if (MaskITy.getSizeInBits() > 1)
7657+
MaskI = MIRBuilder.buildTrunc(LLT::scalar(1), MaskI);
7658+
7659+
MaskI = MIRBuilder.buildZExt(IdxTy, MaskI);
7660+
OutPos = MIRBuilder.buildAdd(IdxTy, OutPos, MaskI);
7661+
7662+
if (HasPassthru && I == NumElmts - 1) {
7663+
auto EndOfVector =
7664+
MIRBuilder.buildConstant(IdxTy, VecTy.getNumElements() - 1);
7665+
auto AllLanesSelected = MIRBuilder.buildICmp(
7666+
CmpInst::ICMP_UGT, LLT::scalar(1), OutPos, EndOfVector);
7667+
OutPos = MIRBuilder.buildInstr(TargetOpcode::G_UMIN, {IdxTy},
7668+
{OutPos, EndOfVector});
7669+
ElmtPtr = getVectorElementPointer(StackPtr, VecTy, OutPos.getReg(0));
7670+
7671+
LastWriteVal =
7672+
MIRBuilder.buildSelect(ValTy, AllLanesSelected, Val, LastWriteVal)
7673+
.getReg(0);
7674+
MIRBuilder.buildStore(LastWriteVal, ElmtPtr, ValPtrInfo, ValAlign);
7675+
}
7676+
}
7677+
7678+
// TODO: Use StackPtr's FrameIndex alignment.
7679+
MIRBuilder.buildLoad(Dst, StackPtr, PtrInfo, VecAlign);
7680+
7681+
MI.eraseFromParent();
7682+
return Legalized;
7683+
}
7684+
75967685
Register LegalizerHelper::getDynStackAllocTargetPtr(Register SPReg,
75977686
Register AllocSize,
75987687
Align Alignment,

0 commit comments

Comments
 (0)