[AMDGPU] Fix test failures when expensive checks are enabled #130644

shiltian · 2025-03-10T17:42:32Z

This PR fixes test failures introduced in #127353 when expensive checks are enabled.

For llvm/test/CodeGen/AMDGPU/materialize-frame-index-sgpr.ll and llvm/test/CodeGen/AMDGPU/materialize-frame-index-sgpr.gfx10.ll, s59 is no longer in live-ins because it is caller saved. Switch to s55 in this PR.

shiltian · 2025-03-10T17:42:51Z

[AMDGPU] Fix test failures when expensive checks are enabled #130644 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

llvmbot · 2025-03-10T17:43:12Z

@llvm/pr-subscribers-backend-amdgpu

Author: Shilei Tian (shiltian)

Changes

[MLIR][Affine] Fix crash in loop unswitching/hoistAffineIfOp (#130401)

Fix obvious crash as a result of missing affine.parallel handling. Also,
fix bug exposed in a helper method used by hoistAffineIfOp.

Fixes: #62323

[clang][bytecode] Implement _builtin{memchr,strchr,char_memchr} (#130420)

llvm has recently started to use __builitn_memchr at compile time, so
implement this. Still needs some work but the basics are done.

[mlir]Add a check to ensure bailing out when reducing to a scalar (#129694)

Fixes issue #64075
Referencing this comment for more detailed view ->
#64075 (comment)

Minimal example crashing :

func.func @<!-- -->multi_reduction(%0: vector&lt;4x2xf32&gt;, %acc1: f32) -&gt; f32 {
  %2 = vector.multi_reduction &lt;add&gt;, %0, %acc1 [0, 1] : vector&lt;4x2xf32&gt; to f32
  return %2 : f32
}

[X86] combineINSERT_SUBVECTOR - attempt to recursively shuffle combine if both base/sub-vectors are already shuffles (#130304)

[lldb] Remove progress report coalescing (#130329)

Remove support for coalescing progress reports in LLDB. This
functionality was motivated by Xcode, which wanted to listen for less
frequent, aggregated progress events at the cost of losing some detail.
See the original RFC [1] for more details. Since then, they've
reevaluated this trade-off and opted to listen for the regular, full
fidelity progress events and do any post processing on their end.

rdar://146425487

[gn build] Port a1b14db

[clang][bytecode][NFC] Check conditional op condition for ConstantExprs (#130425)

Same thing we now do in if statements. Check the condition of a
conditional operator for a statically known true/false value.

[RISCV] Added test for dag spill fix

[TableGen] Use Register in FastISel output. NFC

[clang][bytecode] Surround bcp condition with Start/EndSpeculation (#130427)

This is similar to what the current interpreter is doing - the
FoldConstant RAII object surrounds the entire HandleConditionalOperator
call, which means the condition and both TrueExpr or FalseExpr.

[X86] combineConcatVectorOps - convert (V)SHUFPS concatenation to use combineConcatVectorOps recursion (#130426)

Only concatenate X86ISD::SHUFP nodes if at least one operand is
beneficial to concatenate - helps prevent lot of unnecessary AVX1
concatenations

[SandboxVec] Add region-from-bbs helper pass (#130153)

RegionFromBBs is a helper Sandbox IR function pass that builds a region
for each BB. This helps test region passes in isolation without relying
on other parts of the vectorizer, which is especially useful for
stress-testing.

[llvm-profdata] Fix typo in llvm-profdata (#114675)

Signed-off-by: Peter Jung <[email protected]>

[llvm][NFC]Fix a few typos (#110844)

[gn build] Port 2fb1f03

[X86] Fix typo in X86ISD::SHUFP concatenation

[Clang] Fix typo 'dereferencable' to 'dereferenceable' (#116761)

This patch corrects the typo 'dereferencable' to 'dereferenceable' in
CGCall.cpp.
The typo is located within a comment inside the void CodeGenModule::ConstructAttributeList function.

[NFC][YAML][IR] Output CfiFunction sorted (#130379)

As-is it's NFC, as internally CfiFunction* are std::set<>.

We are changing internals of CfiFunctionDefs and
CfiFunctionDecls so they will be ordered by GUID.

Sorting by name is unnecessary but good for
readability and tests.

[libc++][NFC] Fixed bad link in 21.rst (#130428)

Re-land "[mlir][ODS] Add a generated builder that takes the Properties struct" (#130117) (#130373)

This reverts commit 32f5437.

Investigations showed that the unit test utilities were calling erase(),
causing a use-after-free. Fixed by rearranging checks in the test

Revert "[gold] Fix compilation (#130334)"

This reverts commit b0baa1d.

Reverting follow-up commit to ce9e1d3 since the original commit test is flaky.

Revert "[clangd] fix warning by adding missing parens"

This reverts commit df79000.

Reverting follow-up commit to ce9e1d3 since the original commit test is flaky.

Revert "Modify the localCache API to require an explicit commit on CachedFile… (#115331)"

This reverts commit ce9e1d3.

The unittest added in this commit seems to be flaky causing random failure on buildbots:

[NFC][YAML] Switch std::sort to llvm::sort (#130448)

Follow up to #130379.

[gn build] Port 1d763f3

[clangd] Add BuiltinHeaders config option (#129459)

This option, under CompileFlags, governs whether clangd uses its own
built-in headers (Clangd option value) or the built-in headers of the driver
in the file's compile command (QueryDriver option value, applicable to
cases where --query-driver is used to instruct clangd to ask the driver
for its system include paths).

The default value is Clangd, preserving clangd's current defaut behaviour.

Fixes clangd/clangd#2074

[AArch64] Use Register in AArch64FastISel.cpp. NFC

[NFC][IR] De-duplicate CFI related code (#130450)

[msan] Handle Arm NEON pairwise min/max instructions (#129824)

Change the handling of:

llvm.aarch64.neon.fmaxp
llvm.aarch64.neon.fminp
llvm.aarch64.neon.fmaxnmp
llvm.aarch64.neon.fminnmp
llvm.aarch64.neon.smaxp
llvm.aarch64.neon.sminp
llvm.aarch64.neon.umaxp
llvm.aarch64.neon.uminp
from the incorrect heuristic handler (maybeHandleSimpleNomemIntrinsic)
to handlePairwiseShadowOrIntrinsic.

Updates the tests from #129760

Adds a note that maybeHandleSimpleNomemIntrinsic may incorrectly match
horizontal/pairwise intrinsics.

[NFC][YAML] Replace iterators with simple getter (#130449)

To simplify #130382.

[libc++][NFC] Comment cleanup for <type_traits> (#130422)

Aligns all version comments to line 65.
Consistently uses // since C++ instead of // C++, in lowercase as
used in most version comments.
Consistently uses class to introduce type template parameters.
Consistently uses inline constexpr for variable templates.
Corrects the comment for bool_constant to // since C++17 as it's a
C++17 feature.
Changes the class-key of result_of to struct, which follows N4659
[depr.meta.types] and the actual usage in libc++.
Adds missed // since C++17 for is_(nothrow_)invocable(_r).
Moves the comments for is_nothrow_convertible_v to the part for
variable templates.
Removes duplicated comments for true_type and false_type.

[msan] Apply handleVectorReduceIntrinsic to max/min vector instructions (#129819)

Changes the handling of:

llvm.aarch64.neon.smaxv
llvm.aarch64.neon.sminv
llvm.aarch64.neon.umaxv
llvm.aarch64.neon.uminv
llvm.vector.reduce.smax
llvm.vector.reduce.smin
llvm.vector.reduce.umax
llvm.vector.reduce.umin
llvm.vector.reduce.fmax
llvm.vector.reduce.fmin
from the default strict handling (visitInstruction) to
handleVectorReduceIntrinsic.

Also adds a parameter to handleVectorReduceIntrinsic to specify whether
the return type must match the elements of the vector.

Updates the tests from #129741,
#129810,
#129768

[libc] Added type-generic macros for fixed-point functions (#129371)

Adds macros absfx, countlsfx and roundfx .
ref: #129111

[clang][bytecode][NFC] Bail out on non constant evaluated builtins (#130431)

If the ASTContext says so, don't bother trying to constant evaluate the
given builtin.

[ARM] Use Register in FastISel. NFC

[ARM] Remove unused argument. NFC

[AArch64] Remove unused DenseMap variable. NFC

[ARM] Change FastISel Address from a struct to a class. NFC

This allows us to use Register in the interface, but store an
unsigned internally in a union.

[clang][analyzer][NFC] Fix typos in comments (#130456)

[ExecutionEngine] Avoid repeated map lookups (NFC) (#130461)

[IPO] Avoid repeated hash lookups (NFC) (#130462)

[Scalar] Avoid repeated hash lookups (NFC) (#130463)

[Utils] Avoid repeated hash lookups (NFC) (#130464)

[llvm-jitlink] Avoid repeated hash lookups (NFC) (#130465)

[llvm-profgen] Avoid repeated hash lookups (NFC) (#130466)

[lld][LoongArch] Relax call36/tail36: R_LARCH_CALL36 (#123576)

Instructions with relocation R_LARCH_CALL36 may be relax as follows:

From:
   pcaddu18i $dest, %call36(foo)
     R_LARCH_CALL36, R_LARCH_RELAX
   jirl $r, $dest, 0
To:
   b/bl foo  # bl if r=$ra, b if r=$zero
     R_LARCH_B26

[InstCombine] Add handling for (or (zext x), (shl (zext (ashr x, bw/2-1))), bw/2) -> (sext x) fold (#130316)

Minor tweak to #129363 which handled all the cases where there was a sext for the original source value, but not for cases where the source is already half the size of the destination type

Another regression noticed in #76524

[clang][bytecode] Fix getting pointer element type in __builtin_memcmp (#130485)

When such a pointer is heap allocated, the type we get is a pointer
type. Take the pointee type in that case.

[X86] combineConcatVectorOps - use all_of to check for matching PSHUFD/PSHUFLW/PSHUFHW shuffle mask.

Prep work before adding 512-bit support.

[clang-tidy] Fix invalid fixit from modernize-use-ranges for nullptr used with std::unique_ptr (#127162)

This PR fixes issue #124815 by correcting the handling of nullptr with
std::unique_ptr in the modernize-use-ranges check.

Updated the logic to suppress warnings for nullptr in std::find.

[X86] Combine bitcast(v1Ty insert_vector_elt(X, Y, 0)) to Y (#130475)

Though it only happens in v1i1 when we generate llvm.masked.load/store
intrinsics for APX cload/cstore.

https://godbolt.org/z/vjsrofsqx

[ValueTracking] Bail out on x86_fp80 when computing fpclass with knownbits (#130477)

In #97762, we assume the
minimum possible value of X is NaN implies X is NaN. But it doesn't hold
for x86_fp80 format. If the knownbits of X are
?'011111111111110'????????????????????????????????????????????????????????????????,
the minimum possible value of X is NaN/unnormal. However, it can be a
normal value.

Closes #130408.

Revert "[lld][LoongArch] Relax call36/tail36: R_LARCH_CALL36 (#123576)"

This reverts commit 6fbe491.
Broke check-lld, see the many bot comments on
#123576

[VPlan] Refactor VPlan creation, add transform introducing region (NFC). (#128419)

Create an empty VPlan first, then let the HCFG builder create a plain
CFG for the top-level loop (w/o a top-level region). The top-level
region is introduced by a separate VPlan-transform. This is instead of
creating the vector loop region before building the VPlan CFG for the
input loop.

This simplifies the HCFG builder (which should probably be renamed) and
moves along the roadmap ('buildLoop') outlined in [1].

As follow-up, I plan to also preserve the exit branches in the initial
VPlan out of the CFG builder, including connections to the exit blocks.

The conversion from plain CFG with potentially multiple exits to a
single entry/exit region will be done as VPlan transform in a follow-up.

This is needed to enable VPlan-based predication. Currently early exit
support relies on building the block-in masks on the original CFG,
because exiting branches and conditions aren't preserved in the VPlan.
So in order to switch to VPlan-based predication, we will have to
preserve them in the initial plain CFG, so the exit conditions are
available explicitly when we convert to single entry/exit regions.

Another follow-up is updating the outer loop handling to also introduce
VPRegionBlocks for nested loops as transform. Currently the existing
logic in the builder will take care of creating VPRegionBlocks for
nested loops, but not the top-level loop.

[1]
https://llvm.org/devmtg/2023-10/slides/techtalks/Hahn-VPlan-StatusUpdateAndRoadmap.pdf

PR: #128419

[gn build] Port fd26708

[NFC][Cloning] Make ClonedModule case more obvious in CollectDebugInfoForCloning (#129143)

Summary:
The code's behavior is unchanged, but it's more obvious right now.

Test Plan:
ninja check-llvm-unit check-llvm

[libc++] Protect more code against -Wdeprecated. (#130419)

This seems needed when updating the CI Docker image.

[libc++][CI] Update action runner base image. (#130433)

Updates to the latest release. The side effect of this change is
updating all compilers to the latest upstream version.

[HLSL] Disallow virtual inheritance and functions (#127346)

This PR disallows virtual inheritance and virtual functions in HLSL.

[NFC][Cloning] Simplify the flow in FindDebugInfoToIdentityMap (#129144)

Summary:
The new flow should make it more clear what is happening in cases of
Different of Cloned modules.

Test Plan:
ninja check-llvm-unit check-llvm

[Sanitizers][Darwin] Correct iterating of MachO load commands (#130161)

The condition to stop iterating so far was to look for load command cmd
field == 0. The iteration would continue past the commands area, and
would finally find lc->cmd ==0, if lucky. Or crash with bus error, if
out of luck.

Correcting this by limiting the number of iterations to the count
specified in mach_header(_64) ncmds field.

rdar://143903403

Co-authored-by: Mariusz Borsa <[email protected]>

[AArch64] Improve vector funnel shift by constant costs. (#130044)

We now have better codegen, and can have better costs to match. The
generated code should now produce a shl+usra and can be seen in
testcases such as:

llvm-project/llvm/test/CodeGen/AArch64/fsh.ll

Line 3941 in 7e5821b

define <16 x i8> @fshl_v16i8_c(<16 x i8> %a, <16 x i8> %b) {

.

[LV] Add outer loop test with different successor orders in inner latch.

[NFC][Cloning] Add a helper to collect debug info from instructions (#129145)

Summary:
Just moving around. This helper will be used for further refactoring.

Test Plan:
ninja check-llvm-unit check-llvm

Revert "[ARM] Change FastISel Address from a struct to a class. NFC"

This reverts commit d47bc6f.

I forgot to commit clang-format cleanup before I pushed this.

Recommit "[ARM] Change FastISel Address from a struct to a class. NFC"

With clang-format this time.

Original message:
This allows us to use Register in the interface, but store an
unsigned internally in a union.

[lldb] Add missing converstion to optional

[X86] Use Register in FastISel. NFC

Replace 'Reg == 0' with '!Reg'

[HLSL] select scalar overloads for vector conditions (#129396)

This PR adds scalar/vector overloads for vector conditions to the
select builtin, and updates the sema checking and codegen to allow
scalars to extend to vectors.

Fixes #126570

[ADT] Use adl_being/end in hasSingleElement (#130506)

This is to make sure that ADT helpers consistently use argument
dependent lookup when dealing with input ranges.

This was a part of #87936 but reverted due to buildbot failures. Now
that I have a threadripper system, I'm landing this piece-by-piece.

[gn build] Port e85e29c

[alpha.webkit.UnretainedLambdaCapturesChecker] Add a WebKit checker for lambda capturing NS or CF types. (#128651)

Add a new WebKit checker for checking that lambda captures of CF types
use RetainPtr either when ARC is disabled or enabled, and those of NS
types use RetainPtr when ARC is disabled.

[Clang] use constant evaluation context for constexpr if conditions (#123667)

Fixes #123524

This PR addresses the issue of immediate function expressions not
properly evaluated in constexpr if conditions. Adding the
ConstantEvaluated context for expressions in constexpr if statements
ensures that these expressions are treated as manifestly
constant-evaluated and parsed correctly.

[Xtensa] Implement Xtensa MAC16 Option. (#130004)

[RISCV] Fix incorrect mask of shuffle vector in the test. (NFC) (#130244)

The mask of shuffle vector should be <u, u, 4, 6, 8, 10, 12, 14>, not
<u, u, 4, 6, *6, 10, 12, 14> for steps of 2.

And the mask of suffle vector with an undef initial element has been
supported by #118509.

[clang] Fix typos in options text. (#130129)

[clang] Reject constexpr-unknown values as constant expressions more consistently (#129952)

Perform the check for constexpr-unknown values in the same place we
perform checks for other values which don't count as constant
expressions.

While I'm here, also fix a rejects-valid with a reference that doesn't
have an initializer. This diagnostic was also covering up some of the
bugs here.

The existing behavior with -fexperimental-new-constant-interpreter seems
to be correct, but the diagnostics are slightly different; it would be
helpful if someone could check on that as a followup.

Followup to #128409.

Fixes #129844. Fixes #129845.

[llvm-objdump][ELF]Fix crash when reading strings from .dynstr (#125679)

This change introduces a check for the strtab offset to prevent
llvm-objdump from crashing when processing malformed ELF files.
It provide a minimal reproduce test for
#86612 (comment).
Additionally, it modifies how llvm-objdump handles and outputs malformed
ELF files with invalid string offsets.(More info:
https://discourse.llvm.org/t/should-llvm-objdump-objdump-display-actual-corrupted-values-in-malformed-elf-files/84391)

Fixes: #86612

Co-authored-by: James Henderson <[email protected]>

[APFloat] Fix IEEEFloat::addOrSubtractSignificand and IEEEFloat::normalize (#98721)

Fixes #63895
Fixes #104984

Before this PR, addOrSubtractSignificand presumed that the loss came
from the side being subtracted, and didn't handle the case where lhs ==
rhs and there was loss. This can occur during FMA. This PR fixes the
situation by correctly determining where the loss came from and handling
it appropriately.

Additionally, normalize failed to adjust the exponent when the
significand is zero but lost_fraction != lfExactlyZero. This meant
that the test case from #63895 was rounded incorrectly as the loss
wasn't adjusted to account for the exponent being below the minimum
exponent. This PR fixes this by only skipping the exponent adjustment if
the significand is zero and there was no lost fraction.

(Note to reviewer: I don't have commit access)

[Clang][CodeGen] Fix demangler invariant comment assertion (#130522)

This patch makes the assertion (that is currently in a comment) that
validates that names mangled by clang can be demangled by LLVM actually
compile/work. There were some minor issues that needed to be fixed (like
starts_with not being available on std::string and needing to call
getDecl() on GD), and a logic issue that should be fixed in this patch.
This enables just uncommenting the assertion to enable it within the
compiler (minus needing to add the header file).

Reland [lld][LoongArch] Relax call36/tail36: R_LARCH_CALL36

Instructions with relocation R_LARCH_CALL36 may be relax as follows:

From:
   pcaddu18i $dest, %call36(foo)
     R_LARCH_CALL36, R_LARCH_RELAX
   jirl $r, $dest, 0
To:
   b/bl foo  # bl if r=$ra, b if r=$zero
     R_LARCH_B26

This patch fixes the buildbots failuer of lld tests.
Changes: Modify test files: from sym@plt to %plt(sym).

InstCombine: Fix a crash in PointerReplacer when constructing a new PHI (#130256)

When constructing a PHI node in PointerReplacer::replace, the incoming
operands are expected to have already been replaced and in the
replacement map. However, when one of the incoming operands is a load,
the search of the map is unsuccessful, and a nullptr is returned from
getReplacement. The reason is that, when a load is replaced, all the
uses of the load has been actually replaced by the new load. It is
useless to insert the original load into the map. Instead, we should
place the new load into the map to meet the expectation of the later map
search.

Fixes: SWDEV-516420

[AMDGPU] Add GFX12 S_ALLOC_VGPR instruction (#130018)

This patch only adds the instruction for disassembly support.

We neither have an instrinsic nor codegen support, and it is unclear
whether we actually want to ever have an intrinsic, given the fragile
semantics.

For now, it will be generated only by the backend in very specific
circumstances.

Co-authored-by: Jannik Silvanus <[email protected]>

[RISCV] Remove Predicates from classes in RISCVInstrInfoXTHead.td. NFC

All of instantiations of these classes also specify Predicates
making the base class redundant or unnecessary. The Predicates on the
instantiations aren't always the same as the base class so those
are needed.

Also move the DecoderNamespace to the instantiations for consistency
with the Predicates.

[AArch64][CostModel] Alter sdiv/srem cost where the divisor is constant (#123552)

This patch revises the cost model for sdiv/srem and draws its inspiration from the udiv/urem patch #122236

The typical codegen for the different scenarios has been mentioned as notes/comments in the code itself( this is done owing to lot of scenarios such that it would be difficult to mention them here in the patch description).

[AArch64] Avoid repeated hash lookups (NFC) (#130542)

[CodeGen] Avoid repeated hash lookups (NFC) (#130543)

[alpha.webkit.NoUnretainedMemberChecker] Add a new WebKit checker for unretained member variables and ivars. (#128641)

Add a new WebKit checker for member variables and instance variables of
NS and CF types. A member variable or instance variable to a CF type
should be RetainPtr regardless of whether ARC is enabled or disabled,
and that of a NS type should be RetainPtr when ARC is disabled.

[mlir] Apply ClangTidy finding (NFC)

loop variable is copied but only used as const reference; consider making it a const reference

[clang][NFC] Clean up Expr::EvaluateAsConstantExpr (#130498)

The Info.EnableNewConstInterp case is already handled above.

[Clang] Fix segmentation fault caused by VarBypassDetector stack overflow on deeply nested expressions (#124128)

This happens when using -O2.

Similarly to #111701
(test),
not adding a test that reproduces since this test is slow and likely to
be hard to maintained as discussed here and in previous
discussion.
Test that was reverted here:
d6b5576

[mlir][CAPI][python] bind CallSiteLoc, FileLineColRange, FusedLoc, NameLoc (#129351)

This PR extends the python bindings for CallSiteLoc, FileLineColRange,
FusedLoc, NameLoc with field accessors. It also adds the missing
value.location accessor.

I also did some "spring cleaning" here (cast -> dyn_cast) after
running into some of my own illegal casts.

[libunwind][RISCV] Make asm statement volatile (#130286)

Compiling with O3, the early-machinelicm pass hoisted the asm
statement to a path that has been executed unconditionally during stack
unwinding. On hardware without vector extension support, this resulted
in reading a nonexistent register.

[ADT/Support] Add includes to fix module build

Current Clang complains that 'size_t' / 'reference_wrapper' "must be
declared before it is used."

[Clang][AArch64] Add support for SHF_AARCH64_PURECODE ELF section flag (2/3) (#125688)

Add support for the new SHF_AARCH64_PURECODE ELF section flag:
ARM-software/abi-aa#304

The general implementation follows the existing one for ARM targets.
Simlarly to ARM targets, generating object files with the
SHF_AARCH64_PURECODE flag set is enabled by the
-mexecute-only/-mpure-code driver flag.

Related PRs:

Revert "[clang] Implement instantiation context note for checking template parameters (#126088)"

This reverts commit a24523a.

This is causing significant compile-time regressions for C++ code, see:
#126088 (comment)

[X86] checkBitcastSrcVectorSize - early return when reach to MaxRecursionDepth. (#130226)

[readobj][Arm][AArch64] Refactor Build Attributes parsing under ELFAtributeParser and add support for AArch64 Build Attributes (#128727)

Refactor readobj to integrate AArch64 Build Attributes under
ELFAttributeParser. ELFAttributeParser now serves as a base class for:

ELFCompactAttrParser, handling Arm-style attributes with a single
build attribute subsection.
ELFExtendedAttrParser, handling AArch64-style attributes with multiple
build attribute subsections. This improves code organization and better
aligns with the attribute parsing model.

Add support for parsing AArch64 Build Attributes.

[MCA] Adding missing instructions in AArch64 Neoverse V1 tests (#128892)

Added missing instructions for LLVM Opcodes coverage. It will help to
maintain TableGen scheduling information of AArch64 Neoverse V1.

Follow up of MR ##126703
This is a dispatch of new instructions of the big test:
V1-scheduling-info.s
I have created a new test for special instructions without scheduling
info in Software Optimization Guide: V1-misc-instructions.s

No more asm instruction comments to maintain.

[gn build] Port b1ebfac

[X86] Add test case showing its not always beneficial to fold concat(palignr(),palignr()) -> palignr(concat(),concat())

[lldb] Add more ARM checks in TestLldbGdbServer.py (#130277)

When #130034 enabled RISC-V
here I noticed that these should run for ARM as well.

ARM only has 4 argument registers, which matches Arm's ABI for it:
https://github.com/ARM-software/abi-aa/blob/main/aapcs32/aapcs32.rst#core-registers

The ABI defines a link register LR, and I assume that's what becomes
'ra' in LLDB.

Tested on ARM and AArch64 Linux.

[lldb] Clean up UnwindAssemblyInstEmulation (#129030)

My main motivation was trying to understand how the function and whether
the rows need to be (shared) pointers. I noticed that the function
essentially constructs two copies unwind plans in parallel (the second
being the saved_unwind_states).

If we delay the construction of the unwind plan to the end of the
function, then we never need two copies of a single row (we can just
move it into the final result), so we can just use them as value types.
This makes the overall logic of the function stand out better as it
avoids the laborious deep copies of the Row shared pointer.

I've also noticed that a large portion of the function is devoted to
recomputing certain properties of the unwind state (e.g. the
m_fp_is_cfa field). Instead of doing that, this patch just
saves/restores them together with the rest of the state.

[MLIR][py] Add PyThreadPool as wrapper around MlirLlvmThreadPool in MLIR python bindings (#130109)

In some projects like JAX ir.Context are used with disabled multi-threading to avoid
caching multiple threading pools:

https://github.com/jax-ml/jax/blob/623865fe9538100d877ba9d36f788d0f95a11ed2/jax/_src/interpreters/mlir.py#L606-L611

However, when context has enabled multithreading it also uses locks on
the StorageUniquers and this can be helpful to avoid data races in the
multi-threaded execution (for example with free-threaded cpython,
jax-ml/jax#26272).
With this PR user can enable the multi-threading: 1) enables additional
locking and 2) set a shared threading pool such that cached contexts can
have one global pool.

[X86] Add test case showing its not always beneficial to fold concat(pack(),pack()) -> pack(concat(),concat())

[mlir] Refactor ConvertVectorToLLVMPass options (#128219)

The VectorTransformsOptions on the ConvertVectorToLLVMPass is
currently represented as a struct, which makes it not serialisable. This
means a pass pipeline that contains this pass cannot be represented as
textual form, which breaks reproducer generation and options such as
--dump-pass-pipeline.

This PR expands the VectorTransformsOptions struct into the two
options that are actually used by the Pass' patterns:
vector-contract-lowering and vector-transpose-lowering . The other
options present in VectorTransformOptions are not used by any patterns
in this pass.

Additionally, I have changed some interfaces to only take these specific
options over the full options struct as, again, the vector contract and
transpose lowering patterns only need one of their respective options.

Finally, I have added a simple lit test that just prints the pass
pipeline using --dump-pass-pipeline to ensure the options on this pass
remain serialisable.

Fixes #129046

[IR] Fix assertion error in User new/delete edge case (#129914)

Fixes #129900

If operator delete was called after an unsuccessful constructor call
after operator new, we ran into undefined behaviour.
This was discovered by our malfunction tests while preparing an upgrade
to LLVM 20, that explicitly check for such kind of bugs.

[MergeFunc] Check full IR and comdat keys in comdat.ll.

Spelling in lit.cfg.py

[X86] combineConcatVectorOps - convert X86ISD::PALIGNR concatenation to use combineConcatVectorOps recursion (#130572)

Only concatenate X86ISD::PALIGNR nodes if at least one operand is beneficial to concatenate

[flang] Move parser invocations into ParserActions (#130309)

FrontendActions.cpp is currently one of the biggest compilation units in
all of flang. Measuring its compilation gives the following metrics:

User time (seconds): 139.21
System time (seconds): 4.65
Maximum resident set size (kbytes): 5891440 (5.61 GB)

This commit separates out explicit invocations of the parser into a
separate compilation unit - ParserActions.cpp - through helper functions
in order to decrease the maximum compilation time and memory usage of a
single unit.
After the split, the measurements of FrontendActions.cpp are as follows:

User time (seconds): 70.08
System time (seconds): 3.16
Maximum resident set size (kbytes): 3961492 (3.7 GB)

While the ones for the newly created ParserActions.cpp as follows:

User time (seconds): 104.33
System time (seconds): 3.37
Maximum resident set size (kbytes): 4185600 (3.99 GB)

Signed-off-by: Kajetan Puchalski <[email protected]>

[TailDuplicator] Do not restrict the computed gotos (#114990)

Fixes #106846.

This is what I learned from GCC. I found that GCC does not duplicate the
BB that has indirect jumps with the jump table. I believe GCC has
provided a clear explanation here:

> Duplicate the blocks containing computed gotos. This basically
unfactors computed gotos that were factored early on in the compilation
process to speed up edge based data flow. We used to not unfactor them
again, which can seriously pessimize code with many computed jumps in
the source code, such as interpreters.

Revert "[lldb][asan] Add temporary logging to ReportRetriever"

This reverts commit 39a4da2.

We skipped the failing tests in 6cc8b0bef07f4270303bec0fc203f251a2fde262.

[lldb] Remove an extraneous printf statement. (#130453)

This was missed in review but is showing up in lldb-dap output.

[Clang][AArch64] Fix typo with colon-separated syntax for system registers (#105608)

The range for Op0 was set to 1 instead of 3.

The description of e493f17 visually
explains the encoding of implementation-defined system registers.

llvm-project/llvm/lib/Target/AArch64/AArch64SystemOperands.td

Lines 658 to 674 in 796787d

    
           class SysReg<string name, bits<2> op0, bits<3> op1, bits<4> crn, bits<4> crm, 
        
                        bits<3> op2> : SearchableTable { 
        
             let SearchableFields = ["Name", "Encoding"]; 
        
             let EnumValueField = "Encoding"; 
        
             string Name = name; 
        
             string AltName = name; 
        
             bits<16> Encoding; 
        
             let Encoding{15-14} = op0; 
        
             let Encoding{13-11} = op1; 
        
             let Encoding{10-7} = crn; 
        
             let Encoding{6-3} = crm; 
        
             let Encoding{2-0} = op2; 
        
             bit Readable = ?; 
        
             bit Writeable = ?; 
        
             code Requires = [{ {} }]; 
        
           }

Gobolt: https://godbolt.org/z/WK9PqPvGE

Co-authored-by: v01dxyz <[email protected]>

[AMDGPU][NewPM] Port AMDGPUReserveWWMRegs to NPM (#123722)

[X86] Add test case showing its not always beneficial to fold concat(pshufb(),pshufb()) -> pshufb(concat(),concat())

[X86] Improve test coverage for concat(pmaddubsw(),pmaddubsw()) -> pmaddubsw(concat(),concat())

Ensure we have tests for both beneficial/non-beneficial concatenation cases

[AArch64][ELF Parser] Fix out-of-scope variable usage (#130576)

Return a reference to a persistent variable instead of a temporary copy.

[LLVM][SVE] Add isel for scalable vector bfloat copysign operations. (#130098)

[clang] NNS: don't print trailing scope resolution operator in diagnostics (#130529)

This clears up the printing of a NestedNameSpecifier so a trailing '::'
is not printed, unless it refers into the global scope.

This fixes a bunch of diagnostics where the trailing :: was awkward.
This also prints the NNS quoted consistenty.

There is a drive-by improvement to error recovery, where now we print
the actual type instead of <dependent type>.

This will clear up further uses of NNS printing in further patches.

AMDGPU: Move enqueued block handling into clang (#128519)

The previous implementation wasn't maintaining a faithful IR
representation of how this really works. The value returned by
createEnqueuedBlockKernel wasn't actually used as a function, and
hacked up later to be a pointer to the runtime handle global
variable. In reality, the enqueued block is a struct where the first
field is a pointer to the kernel descriptor, not the kernel itself. We
were also relying on passing around a reference to a global using a
string attribute containing its name. It's better to base this on a
proper IR symbol reference during final emission.

This now avoids using a function attribute on kernels and avoids using
the additional "runtime-handle" attribute to populate the final
metadata. Instead, associate the runtime handle reference to the
kernel with the !associated global metadata. We can then get a final,
correctly mangled name at the end.

I couldn't figure out how to get rename-with-external-symbol behavior
using a combination of comdats and aliases, so leaves an IR pass to
externalize the runtime handles for codegen. If anything breaks, it's
most likely this, so leave avoiding this for a later step. Use a
special section name to enable this behavior. This also means it's
possible to declare enqueuable kernels in source without going through
the dedicated block syntax or other dedicated compiler support.

We could move towards initializing the runtime handle in the
compiler/linker. I have a working patch where the linker sets up the
first field of the handle, avoiding the need to export the block
kernel symbol for the runtime. We would need new relocations to get
the private and group sizes, but that would avoid the runtime's
special case handling that requires the device_enqueue_symbol metadata
field.

https://reviews.llvm.org/D141700

[RISCV][test] Add test case showing case where machine copy propagation leaves behind a no-op reg move

Pre-commit for #129889.

[AArch64][ELF Parser] Fix out-of-scope variable usage (#130594)

Return a reference to a persistent variable instead of a temporary copy.

[DAG] fold AVGFLOORS to AVGFLOORU for non-negative operand (#84746) (#129678)

Fold ISD::AVGFLOORS to ISD::AVGFLOORU for non-negative operand. Cover test is modified for uhadd with zero extension.

Fixes #84746

Revert "[clang] Fix missing diagnostic of declaration use when accessing TypeDecls through typename access (#129681)"

This caused incorrect -Wunguarded-availability warnings. See comment on
the pull request.

> We were missing a call to DiagnoseUseOfDecl when performing typename
> access.
>
> This refactors the code so that TypeDecl lookups funnel through a helper
> which performs all the necessary checks, removing some related
> duplication on the way.
>
> Fixes #58547
>
> Differential Revision: https://reviews.llvm.org/D136533

This reverts commit 4c4fd6b.

[X86] combineConcatVectorOps - convert X86ISD::HADD/SUB concatenation to use combineConcatVectorOps recursion (#130579)

Only concatenate X86ISD::HADD/SUB nodes if at least one operand is beneficial to concatenate

[X86][APX] Try to replace non-NF with NF instructions when optimizeCompareInstr (#130488)

https://godbolt.org/z/rWYdqnjjx

[clang] fix matching of nested template template parameters (#130447)

When checking the template template parameters of template template
parameters, the PartialOrdering context was not correctly propagated.

This also has a few drive-by fixes, such as checking the template
parameter lists of template template parameters, which was previously
missing and would have been it's own bug, but we need to fix it in order
to prevent crashes in error recovery in a simple way.

Fixes #130362

[Clang] Force expressions with UO_Not to not be non-negative (#126846)

This PR addresses the bug of not throwing warnings for the following
code:

int test13(unsigned a, int *b) {
        return a &gt; ~(95 != *b); // expected-warning {{comparison of integers of different signs}}
}

However, in the original issue, a comment mentioned that negation,
pre-increment, and pre-decrement operators are also incorrect in this
case.

Fixes #18878

[flang][OpenMP] Implement HAS_DEVICE_ADDR clause (#128568)

The HAS_DEVICE_ADDR indicates that the object(s) listed exists at an
address that is a valid device address. Specifically,
has_device_addr(x) means that (in C/C++ terms) &x is a device
address.

When entering a target region, x does not need to be allocated on the
device, or have its contents copied over (in the absence of additional
mapping clauses). Passing its address verbatim to the region for use is
sufficient, and is the intended goal of the clause.

Some Fortran objects use descriptors in their in-memory representation.
If x had a descriptor, both the descriptor and the contents of x
would be located in the device memory. However, the descriptors are
managed by the compiler, and can be regenerated at various points as
needed. The address of the effective descriptor may change, hence it's
not safe to pass the address of the descriptor to the target region.
Instead, the descriptor itself is always copied, but for objects like
x, no further mapping takes place (as this keeps the storage pointer
in the descriptor unchanged).

Co-authored-by: Sergio Afonso <[email protected]>

[flang][OpenMP] Accept old FLUSH syntax in METADIRECTIVE (#130122)

Accommodate it in OmpDirectiveSpecification, which may become the
primary component of the actual FLUSH construct in the future.

[MachineCopyPropagation] Recognise and delete no-op moves produced after forwarded uses (#129889)

This change removes 189 static instances of no-op reg-reg moves (i.e.
where src == dest) across llvm-test-suite when compiled for RISC-V
rv64gc and with SPEC included.

[gn build] Port 0d2c55c

[X86] combineConcatVectorOps - convert PSHUFB/PSADBW/VPMADDUBSW/VPMADDUBSW concatenation to use combineConcatVectorOps recursion (#130592)

Only concatenate nodes if at least one operand is beneficial to concatenate

[clang][test] Don't require specific alignment in test case (#130589)

#129952 /
42d49a7 added this test which is
failing on 32-bit ARM because the alignment chosen is 4 not 8. Which
would make sense if this is a 32/64 bit difference

https://lab.llvm.org/buildbot/#/builders/154/builds/13059

&lt;stdin&gt;:34:30: note: scanning from here
define dso_local void @<!-- -->_Z1fv(ptr dead_on_unwind noalias writable sret(%struct.B) align 4 %agg.result) #<!-- -->0 {
                             ^
&lt;stdin&gt;:38:2: note: possible intended match here
 %0 = load ptr, ptr @<!-- -->x, align 4
 ^

The other test does not check alignment, so I'm assuming that it is not
important here.

[SLP]Reduce number of alternate instruction, where possible

Previous version was reviewed here #123360
It is mostly the same, adjusted after graph-to-tree transformation

Patch tries to remove wide alternate operations.
Currently SLP vectorizer emits something like this:

%0 = add i32
%1 = sub i32
%2 = add i32
%3 = sub i32
%4 = add i32
%5 = sub i32
%6 = add i32
%7 = sub i32

transformes to

%v1 = add &lt;8 x i32&gt;
%v2 = sub &lt;8 x i32&gt;
%res = shuffle %v1, %v2, &lt;0, 9, 2, 11, 4, 13, 6, 15&gt;

i.e. half of the results are just unused. This leads to increased
register pressure and potentially doubles number of operations.

Patch introduces SplitVectorize mode, where it splits the operations by
opcodes and produces instead something like this:

%v1 = add &lt;4 x i32&gt;
%v2 = sub &lt;4 x i32&gt;
%res = shuffle %v1, %v2, &lt;0, 4, 1, 5, 2, 6, 3, 7&gt;

It allows to improve the performance by reducing number of ops. Also, it
turns on some other improvements, like improved graph reordering.

-O3+LTO, AVX512
Metric: size..text
Program size..text
results results0 diff
test-suite :: MultiSource/Benchmarks/Olden/tsp/tsp.test 2788.00 2820.00 1.1%
test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 278168.00 280904.00 1.0%
test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test 82682.00 83258.00 0.7%
test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test 139344.00 139712.00 0.3%
test-suite :: MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.test 27149.00 27197.00 0.2%
test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 1008188.00 1009948.00 0.2%
test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 39226.00 39290.00 0.2%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 39229.00 39293.00 0.2%
test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 2074533.00 2076549.00 0.1%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 2074533.00 2076549.00 0.1%
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 798440.00 798952.00 0.1%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 44123.00 44139.00 0.0%
test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 318942.00 319038.00 0.0%
test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 1159880.00 1160152.00 0.0%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniAMR/miniAMR.test 73595.00 73611.00 0.0%
test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test 1146124.00 1146348.00 0.0%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/CLAMR/CLAMR.test 203831.00 203847.00 0.0%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 207662.00 207678.00 0.0%
test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test 589851.00 589883.00 0.0%
test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 1398543.00 1398559.00 0.0%
test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 1398543.00 1398559.00 0.0%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 2050990.00 2051006.00 0.0%

  test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12559687.00 12559591.00 -0.0%
                 test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test  3074157.00  3074125.00 -0.0%
     test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test  1092252.00  1092188.00 -0.0%
        test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test   779763.00   779715.00 -0.0%
     test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test   253517.00   253485.00 -0.0%
              test-suite :: MultiSource/Applications/JM/lencod/lencod.test   848259.00   848035.00 -0.0%
 test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test    93064.00    93016.00 -0.1%
              test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test   383747.00   383475.00 -0.1%
      test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   673051.00   662907.00 -1.5%
       test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   673051.00   662907.00 -1.5%

Olden/tsp - small variations
Prolangs-C/TimberWolfMC - small variations, some code not inlined
FreeBench/pifft - extra store <8 x double> vectorized, some other extra
vectorizations
CFP2006/433.milc - better vector code
FreeBench/fourinarow - better vector code
Benchmarks/tramp3d-v4 - extra vector code, small variations
mediabench/gsm/toast - small variations
MiBench/telecomm-gsm - small variations
CINT2017rate/500.perlbench_r
CINT2017speed/600.perlbench_s - better vector code, small variations
CINT2006/464.h264ref - some smaller code + changes similar to x264
DOE-ProxyApps-C/miniGMG - small variations
Benchmarks/Bullet - small variations
CFP2017rate/511.povray_r - small variations
DOE-ProxyApps-C/miniAMR - small variations
CFP2006/453.povray - small variations
DOE-ProxyApps-C++/CLAMR - small variations
MiBench/consumer-lame - small variations
CFP2006/447.dealII - small variations
CFP2017rate/538.imagick_r
CFP2017speed/638.imagick_s - small variations
CFP2017rate/510.parest_r - better vector code, small variations
CFP2017rate/526.blender_r - small variations
CINT2006/403.gcc - small variations
CINT2006/400.perlbench - small variations
CFP2017rate/508.namd_r - small variations
ASCI_Purple/SMG2000 - small variations
JM/lencod - extra store <16 x i32>, small variations
DOE-ProxyApps-C++/miniFE - small variations
JM/ldecod - extra vector code, small variations, less shuffles
CINT2017speed/625.x264_s
CINT2017rate/525.x264_r - the number of instructions increased, but
looks like they are more performant. E.g., for function
x264_pixel_satd_8x8, llvm-mca reports better throughput - 84 for the
current version and 59 for the new version.

-O3+LTO, mcpu=sifive-p470

Metric: size..text

                                                                           results    results0   diff
                             test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test  580768.00  581118.00   0.1%
                                    test-suite :: MultiSource/Applications/d/make_dparser.test   78854.00   78894.00   0.1%
                                  test-suite :: MultiSource/Applications/JM/lencod/lencod.test  633448.00  633750.00   0.0%
                                       test-suite :: MultiSource/Benchmarks/Bullet/bullet.test  277002.00  277080.00   0.0%
                         test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test  931938.00  931960.00   0.0%
                                     test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test 2512806.00 2512822.00   0.0%
                            test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 7659880.00 7659876.00  -0.0%
                             test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 7659880.00 7659876.00  -0.0%
                        test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 1602448.00 1602434.00  -0.0%
                      test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 9496664.00 9496542.00  -0.0%
                 test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test  147424.00  147422.00  -0.0%
                test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 1764608.00 1764578.00  -0.0%
                 test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 1764608.00 1764578.00  -0.0%
                                 test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test  841656.00  841632.00  -0.0%
                                test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test  949026.00  948962.00  -0.0%
                        test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test  946348.00  946284.00  -0.0%
                                  test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test  279794.00  279764.00  -0.0%
                   test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test    4776.00    4772.00  -0.1%
                          test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test   25074.00   25028.00  -0.2%
                   test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test   25074.00   25028.00  -0.2%
                     test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test   29336.00   29184.00  -0.5%
                           test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test  535390.00  510124.00  -4.7%
                          test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test  535390.00  510124.00  -4.7%

test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/ieee/GCC-C-execute-ieee-pr50310.test 886.00 608.00 -31.4%

CINT2006/464.h264ref - extra v16i32 reduction
d/make_dparser - better vector code
JM/lencod - extra v16i32 reduction
Benchmarks/Bullet - smaller vector code
CINT2006/400.perlbench - better vector code
CINT2006/403.gcc - small variations
CINT2017speed/602.gcc_s
CINT2017rate/502.gcc_r - small variations
CFP2017rate/510.parest_r - small variations
CFP2017rate/526.blender_r - small variations
MiBench/consumer-lame - small variations
CINT2017speed/600.perlbench_s
CINT2017rate/500.perlbench_r - small variations
Benchmarks/7zip - small variations
CFP2017rate/511.povray_r - small variations
JM/ldecod - extra vector code
mediabench/g721/g721encode - extra vector code
mediabench/gsm - extra vector code
MiBench/telecomm-gsm - extra vector code
DOE-ProxyApps-C/miniGMG - extra vector code
CINT2017rate/525.x264_r
CINT2017speed/625.x264_s - reduced number of wide operations and
shuffles, saving the registers, similar to X86, extra code in
pixel_hadamard_ac vectorized
ieee/GCC-C-execute-ieee-pr50310 - extra code vectorized

CINT2006/464.h264ref - extra vector code in find_sad_16x16
JM/lencod - extra vector code in find_sad_16x16
d/make_dparser - smaller vector code
Benchmarks/Bullet - small variations
CINT2006/400.perlbench - smaller vector code
CFP2017rate/526.blender_r - small variations, extra store <8 x float> in
the loop, extra store <8 x i8> in loop
CINT2017rate/500.perlbench_r
CINT2017speed/600.perlbench_s - small variations
MiBench/consumer-lame - small variations
JM/ldecod - extra vector code
mediabench/g721/g721encode - small variations

Reviewers: hiraditya

Reviewed By: hiraditya

Pull Request: #128907

Revert "[libc++] Don't try to wait on a thread that hasn't started in std::async (#125433)"

This reverts commit 11766a4.

[ARM] Fix HW thread pointer functionality (#130027)

Separate check for hardware support of TLS register from check for support of Thumb2 encoding
Base decision to auto enable TLS on both hardware support and Thumb2 encoding support
Fix HW support check to correctly exclude M-Profile and include ARMV6K variants

reference:https://reviews.llvm.org/D114116

[X86] combineConcatVectorOps - add missing VT/Subtarget checks for MOV*DUP concatenation folds.

[clang][SPIR-V] Use the SPIR-V backend by default (#129545)

The SPIR-V backend is now a supported backend, and we believe it is
ready to be used by default in Clang over the SPIR-V translator.

Some IR generated by Clang today, such as those requiring SPIR-V target
address spaces, cannot be compiled by the translator for reasons in this
RFC,
so we expect even more programs to work as well.

Enable it by default, but keep some of the code as it is still called by
the HIP toolchain directly.

Signed-off-by: Sarnie, Nick <[email protected]>

[OpenMP] Mark Failing OpenMP Tests as XFAIL on Windows (#129040)

This patch marks specific OpenMP runtime tests as XFAIL on Windows due
to failures reported in #129023

[LLD][COFF] Add /noexp for link.exe compatibility (#128814)

See #107346

[mlir] Fix bazel build after f3dcc0f

[mlir][TOSA] Fix linalg lowering of depthwise conv2d (#130293)

Current lowering for tosa.depthwise_conv2d assumes if both zero points
are zero then it's a floating-point operation by hardcoding the use of a
arith.addf in the lowered code. Fix code to check for the element type
to decide what add operation to use.

[OpenACC] Implement 'bind' ast/sema for 'routine' directive

The 'bind' clause allows the renaming of a function during code
generation. There are a few rules about when this can/cannot happen,
and it takes either a string or identifier (previously mis-implemetned
as ID-expression) argument.

Note there are additional rules to this in the implicit-function routine
case, but that isn't implemented in this patch, as implicit-function
routine is not yet implemented either.

[clang][bytecode] Fix builtin_memcmp buffer sizes for pointers (#130570)

Don't use the pointer size, but the number of elements multiplied by the
element size.

[libc++][docs] Remove mis-added entry for P2513R4 (#130581)

P2513R4 neither touched library wording nor required library
implementation to change. So it was probably a mistake to list it in
libc++'s implementation status table.

[ARM][Thumb] Save FPSCR + FPEXC for save-vfp attribute

FPSCR and FPEXC will be stored in FPStatusRegs, after GPRCS2 has been
saved.

GPRCS1
GPRCS2
FPStatusRegs (new)
DPRCS
GPRCS3
DPRCS2

FPSCR is present on all targets with a VFP, but the FPEXC register is
not present on Cortex-M devices, so different amounts of bytes are
being pushed onto the stack depending on our target, which would
affect alignment for subsequent saves.

DPRCS1 will sum up all previous bytes that were saved, and will emit
extra instructions to ensure that its alignment is correct. My
assumption is that if DPRCS1 is able to correct its alignment to be
correct, then all subsequent saves will also have correct alignment.

Avoid annotating the saving of FPSCR and FPEXC for functions marked
with the interrupt_save_fp attribute, even though this is done as part
of frame setup. Since these are status registers, there really is no
viable way of annotating this. Since these aren't GPRs or DPRs, they
can't be used with .save or .vsave directives. Instead, just record
that the intermediate registers r4 and r5 are saved to the stack
again.

Co-authored-by: Jake Vossen <[email protected]>
Co-authored-by: Alan Phipps <[email protected]>

Revert "[ARM][Thumb] Save FPSCR + FPEXC for save-vfp attribute"

This reverts commit 1f05703.

[HLSL][Driver] Use temporary files correctly (#130436)

This updates the DXV and Metal Converter actions to properly use
temporary files created by the driver. I've abstracted away a check to
determine if an action is the last in the sequence because we may have
between 1 and 3 actions depending on the arguments and environment.

AMDGPU: Rename variable from undef to poison (#130460)

StructurizeCFG: Use poison instead of undef (#130459)

There are a surprising number of codegen changes from this.

Revert "Reland "[clang] Lower modf builtin using llvm.modf intrinsic" (#129885)"

This broke modff calls on 32-bit x86 Windows. See comment on the PR.

> This updates the existing modf[f|l] builtin to be lowered via the
> llvm.modf.* intrinsic (rather than directly to a library call).
>
> The legalization issues exposed by the original PR (#126750) should have
> been fixed in #128055 and #129264.

This reverts commit cd1d9a8.

[libc] Add -Wno-sign-conversion & re-attempt -Wconversion (#129811)

Relates to
#119281 (comment)

Forbid co_await and co_yield in invalid expr contexts (#130455)

Fix #78426

C++26 introduced braced initializer lists as template arguments.
However, such contexts should be considered invalid for co_await and
co_yield. This commit explicitly rules out the possibility of using
these exprs in template arguments.

Co-authored-by: cor3ntin <[email protected]>

[X86] Add test cases showing its not always beneficial to fold concat(add/mul(),add/mul()) -> add/mul(concat(),concat())

[lldb] fix set SBLineEntryColumn (#130435)

Calling the public API SBLineEntry::SetColumn() sets the row instead
of the column.

This probably should be backported as it has been since version 3.4.

Co-authored-by: Jonas Devlieghere <[email protected]>

[ADT] Use adl_begin/adl_end in make_filter_range (#130512)

This is to make sure that ADT helpers consistently use argument
dependent lookup when dealing with input ranges.

This was a part of #87936 but
reverted due to buildbot failures.

Also fix potential issue with double-move on the input range.

[Libc] Turn implicit to explicit conversion (#130615)

This fixes a build issue on the AMDGPU libc bot after
#126846 landed that introduced
a warning.

Co-authored-by: Joseph Huber <[email protected]>

[X86] combineConcatVectorOps - convert ADD/SUB/MUL concatenation to use combineConcatVectorOps recursion

Only concatenate ADD/SUB/MUL nodes if at least one operand is beneficial to concatenate

[mlir] Fix bazel build after f3dcc0f TD files

[mlir][SparseTensor][NFC] Migrate to OpAsmAttrInterface for ASM alias generation (#130483)

After the introduction of OpAsmAttrInterface, it is favorable to
migrate code using OpAsmDialectInterface for ASM alias generation,
which lives in Dialect.cpp, to use OpAsmAttrInterface, which lives
in Attrs.td. In this way, attribute behavior is placed near its
tablegen definition and people won't need to go through other files to
know what other (unexpected) hooks comes into play.

[libc] Fix implicit conversion warnings. (#130635)

[flang][OpenMP] Parse cancel-directive-name as clause (#130146)

The cancellable construct names on CANCEL or CANCELLATION POINT
directives are actually clauses (with the same names as the
corresponding constructs).

Instead of parsing them into a custom structure, parse them as a clause,
which will make CANCEL/CANCELLATION POINT follow the same uniform scheme
as other constructs (<directive> [(<arguments>)] [clauses]).

[lldb-dap] Migrating terminated statistics to the event body. (#130454)

Per the DAP spec, the event 'body' field should contain any additional
data related to the event. I updated the lldb-dap 'statistics' extension
into the terminated event's body like:

{
  "type": "event",
  "seq": 0,
  "event": "terminated",
  "body": {
    "$__lldb_statistics": {...}
  }
}

This allows us to more uniformly handle event messages.

[AMDGPU] Fix test failures when expensive checks are enabled

This PR fixes test failures introduced in #127353 when expensive checkes are
enabled.

Patch is 122.17 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/130644.diff

3 Files Affected:

(modified) llvm/test/CodeGen/AMDGPU/materialize-frame-index-sgpr.gfx10.ll (+915-209)
(modified) llvm/test/CodeGen/AMDGPU/materialize-frame-index-sgpr.ll (+177-137)
(modified) llvm/test/CodeGen/AMDGPU/schedule-amdgpu-tracker-physreg-crash.ll (+1-1)

diff --git a/llvm/test/CodeGen/AMDGPU/materialize-frame-index-sgpr.gfx10.ll b/llvm/test/CodeGen/AMDGPU/materialize-frame-index-sgpr.gfx10.ll
index 4ca00f2daf97a..4b5a7c207055a 100644
--- a/llvm/test/CodeGen/AMDGPU/materialize-frame-index-sgpr.gfx10.ll
+++ b/llvm/test/CodeGen/AMDGPU/materialize-frame-index-sgpr.gfx10.ll
@@ -12,7 +12,13 @@ define void @scalar_mov_materializes_frame_index_unavailable_scc() #0 {
 ; GFX10_1-LABEL: scalar_mov_materializes_frame_index_unavailable_scc:
 ; GFX10_1:       ; %bb.0:
 ; GFX10_1-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX10_1-NEXT:    s_xor_saveexec_b32 s4, -1
+; GFX10_1-NEXT:    s_add_i32 s5, s32, 0x80880
+; GFX10_1-NEXT:    buffer_store_dword v1, off, s[0:3], s5 ; 4-byte Folded Spill
+; GFX10_1-NEXT:    s_waitcnt_depctr 0xffe3
+; GFX10_1-NEXT:    s_mov_b32 exec_lo, s4
 ; GFX10_1-NEXT:    v_lshrrev_b32_e64 v0, 5, s32
+; GFX10_1-NEXT:    v_writelane_b32 v1, s55, 0
 ; GFX10_1-NEXT:    s_and_b32 s4, 0, exec_lo
 ; GFX10_1-NEXT:    v_add_nc_u32_e32 v0, 64, v0
 ; GFX10_1-NEXT:    ;;#ASMSTART
@@ -20,16 +26,28 @@ define void @scalar_mov_materializes_frame_index_unavailable_scc() #0 {
 ; GFX10_1-NEXT:    ;;#ASMEND
 ; GFX10_1-NEXT:    v_lshrrev_b32_e64 v0, 5, s32
 ; GFX10_1-NEXT:    v_add_nc_u32_e32 v0, 0x4040, v0
-; GFX10_1-NEXT:    v_readfirstlane_b32 s59, v0
+; GFX10_1-NEXT:    v_readfirstlane_b32 s55, v0
 ; GFX10_1-NEXT:    ;;#ASMSTART
-; GFX10_1-NEXT:    ; use s59, scc
+; GFX10_1-NEXT:    ; use s55, scc
 ; GFX10_1-NEXT:    ;;#ASMEND
+; GFX10_1-NEXT:    v_readlane_b32 s55, v1, 0
+; GFX10_1-NEXT:    s_xor_saveexec_b32 s4, -1
+; GFX10_1-NEXT:    s_add_i32 s5, s32, 0x80880
+; GFX10_1-NEXT:    buffer_load_dword v1, off, s[0:3], s5 ; 4-byte Folded Reload
+; GFX10_1-NEXT:    s_waitcnt_depctr 0xffe3
+; GFX10_1-NEXT:    s_mov_b32 exec_lo, s4
+; GFX10_1-NEXT:    s_waitcnt vmcnt(0)
 ; GFX10_1-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX10_3-LABEL: scalar_mov_materializes_frame_index_unavailable_scc:
 ; GFX10_3:       ; %bb.0:
 ; GFX10_3-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX10_3-NEXT:    s_xor_saveexec_b32 s4, -1
+; GFX10_3-NEXT:    s_add_i32 s5, s32, 0x80880
+; GFX10_3-NEXT:    buffer_store_dword v1, off, s[0:3], s5 ; 4-byte Folded Spill
+; GFX10_3-NEXT:    s_mov_b32 exec_lo, s4
 ; GFX10_3-NEXT:    v_lshrrev_b32_e64 v0, 5, s32
+; GFX10_3-NEXT:    v_writelane_b32 v1, s55, 0
 ; GFX10_3-NEXT:    s_and_b32 s4, 0, exec_lo
 ; GFX10_3-NEXT:    v_add_nc_u32_e32 v0, 64, v0
 ; GFX10_3-NEXT:    ;;#ASMSTART
@@ -37,17 +55,27 @@ define void @scalar_mov_materializes_frame_index_unavailable_scc() #0 {
 ; GFX10_3-NEXT:    ;;#ASMEND
 ; GFX10_3-NEXT:    v_lshrrev_b32_e64 v0, 5, s32
 ; GFX10_3-NEXT:    v_add_nc_u32_e32 v0, 0x4040, v0
-; GFX10_3-NEXT:    v_readfirstlane_b32 s59, v0
+; GFX10_3-NEXT:    v_readfirstlane_b32 s55, v0
 ; GFX10_3-NEXT:    ;;#ASMSTART
-; GFX10_3-NEXT:    ; use s59, scc
+; GFX10_3-NEXT:    ; use s55, scc
 ; GFX10_3-NEXT:    ;;#ASMEND
+; GFX10_3-NEXT:    v_readlane_b32 s55, v1, 0
+; GFX10_3-NEXT:    s_xor_saveexec_b32 s4, -1
+; GFX10_3-NEXT:    s_add_i32 s5, s32, 0x80880
+; GFX10_3-NEXT:    buffer_load_dword v1, off, s[0:3], s5 ; 4-byte Folded Reload
+; GFX10_3-NEXT:    s_mov_b32 exec_lo, s4
+; GFX10_3-NEXT:    s_waitcnt vmcnt(0)
 ; GFX10_3-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX11-LABEL: scalar_mov_materializes_frame_index_unavailable_scc:
 ; GFX11:       ; %bb.0:
 ; GFX11-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX11-NEXT:    s_xor_saveexec_b32 s0, -1
+; GFX11-NEXT:    s_add_i32 s1, s32, 0x4044
+; GFX11-NEXT:    scratch_store_b32 off, v1, s1 ; 4-byte Folded Spill
+; GFX11-NEXT:    s_mov_b32 exec_lo, s0
 ; GFX11-NEXT:    s_add_i32 s0, s32, 64
-; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; GFX11-NEXT:    v_writelane_b32 v1, s55, 0
 ; GFX11-NEXT:    v_mov_b32_e32 v0, s0
 ; GFX11-NEXT:    s_and_b32 s0, 0, exec_lo
 ; GFX11-NEXT:    s_addc_u32 s0, s32, 0x4040
@@ -57,10 +85,16 @@ define void @scalar_mov_materializes_frame_index_unavailable_scc() #0 {
 ; GFX11-NEXT:    s_bitcmp1_b32 s0, 0
 ; GFX11-NEXT:    s_bitset0_b32 s0, 0
 ; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
-; GFX11-NEXT:    s_mov_b32 s59, s0
+; GFX11-NEXT:    s_mov_b32 s55, s0
 ; GFX11-NEXT:    ;;#ASMSTART
-; GFX11-NEXT:    ; use s59, scc
+; GFX11-NEXT:    ; use s55, scc
 ; GFX11-NEXT:    ;;#ASMEND
+; GFX11-NEXT:    v_readlane_b32 s55, v1, 0
+; GFX11-NEXT:    s_xor_saveexec_b32 s0, -1
+; GFX11-NEXT:    s_add_i32 s1, s32, 0x4044
+; GFX11-NEXT:    scratch_load_b32 v1, off, s1 ; 4-byte Folded Reload
+; GFX11-NEXT:    s_mov_b32 exec_lo, s0
+; GFX11-NEXT:    s_waitcnt vmcnt(0)
 ; GFX11-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX12-LABEL: scalar_mov_materializes_frame_index_unavailable_scc:
@@ -70,7 +104,13 @@ define void @scalar_mov_materializes_frame_index_unavailable_scc() #0 {
 ; GFX12-NEXT:    s_wait_samplecnt 0x0
 ; GFX12-NEXT:    s_wait_bvhcnt 0x0
 ; GFX12-NEXT:    s_wait_kmcnt 0x0
+; GFX12-NEXT:    s_xor_saveexec_b32 s0, -1
+; GFX12-NEXT:    scratch_store_b32 off, v1, s32 offset:16388 ; 4-byte Folded Spill
+; GFX12-NEXT:    s_wait_alu 0xfffe
+; GFX12-NEXT:    s_mov_b32 exec_lo, s0
+; GFX12-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
 ; GFX12-NEXT:    s_and_b32 s0, 0, exec_lo
+; GFX12-NEXT:    v_writelane_b32 v1, s55, 0
 ; GFX12-NEXT:    s_add_co_ci_u32 s0, s32, 0x4000
 ; GFX12-NEXT:    v_mov_b32_e32 v0, s32
 ; GFX12-NEXT:    s_wait_alu 0xfffe
@@ -80,34 +120,54 @@ define void @scalar_mov_materializes_frame_index_unavailable_scc() #0 {
 ; GFX12-NEXT:    ; use alloca0 v0
 ; GFX12-NEXT:    ;;#ASMEND
 ; GFX12-NEXT:    s_wait_alu 0xfffe
-; GFX12-NEXT:    s_mov_b32 s59, s0
+; GFX12-NEXT:    s_mov_b32 s55, s0
 ; GFX12-NEXT:    ;;#ASMSTART
-; GFX12-NEXT:    ; use s59, scc
+; GFX12-NEXT:    ; use s55, scc
 ; GFX12-NEXT:    ;;#ASMEND
+; GFX12-NEXT:    v_readlane_b32 s55, v1, 0
+; GFX12-NEXT:    s_xor_saveexec_b32 s0, -1
+; GFX12-NEXT:    scratch_load_b32 v1, off, s32 offset:16388 ; 4-byte Folded Reload
 ; GFX12-NEXT:    s_wait_alu 0xfffe
+; GFX12-NEXT:    s_mov_b32 exec_lo, s0
+; GFX12-NEXT:    s_wait_loadcnt 0x0
 ; GFX12-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX8-LABEL: scalar_mov_materializes_frame_index_unavailable_scc:
 ; GFX8:       ; %bb.0:
 ; GFX8-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX8-NEXT:    s_xor_saveexec_b64 s[4:5], -1
+; GFX8-NEXT:    s_add_i32 s6, s32, 0x101100
+; GFX8-NEXT:    buffer_store_dword v1, off, s[0:3], s6 ; 4-byte Folded Spill
+; GFX8-NEXT:    s_mov_b64 exec, s[4:5]
 ; GFX8-NEXT:    v_lshrrev_b32_e64 v0, 6, s32
 ; GFX8-NEXT:    v_add_u32_e32 v0, vcc, 64, v0
+; GFX8-NEXT:    v_writelane_b32 v1, s55, 0
 ; GFX8-NEXT:    ;;#ASMSTART
 ; GFX8-NEXT:    ; use alloca0 v0
 ; GFX8-NEXT:    ;;#ASMEND
 ; GFX8-NEXT:    v_lshrrev_b32_e64 v0, 6, s32
-; GFX8-NEXT:    s_movk_i32 s59, 0x4040
-; GFX8-NEXT:    v_add_u32_e32 v0, vcc, s59, v0
+; GFX8-NEXT:    s_movk_i32 s55, 0x4040
+; GFX8-NEXT:    v_add_u32_e32 v0, vcc, s55, v0
+; GFX8-NEXT:    v_readfirstlane_b32 s55, v0
 ; GFX8-NEXT:    s_and_b64 s[4:5], 0, exec
-; GFX8-NEXT:    v_readfirstlane_b32 s59, v0
 ; GFX8-NEXT:    ;;#ASMSTART
-; GFX8-NEXT:    ; use s59, scc
+; GFX8-NEXT:    ; use s55, scc
 ; GFX8-NEXT:    ;;#ASMEND
+; GFX8-NEXT:    v_readlane_b32 s55, v1, 0
+; GFX8-NEXT:    s_xor_saveexec_b64 s[4:5], -1
+; GFX8-NEXT:    s_add_i32 s6, s32, 0x101100
+; GFX8-NEXT:    buffer_load_dword v1, off, s[0:3], s6 ; 4-byte Folded Reload
+; GFX8-NEXT:    s_mov_b64 exec, s[4:5]
+; GFX8-NEXT:    s_waitcnt vmcnt(0)
 ; GFX8-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX900-LABEL: scalar_mov_materializes_frame_index_unavailable_scc:
 ; GFX900:       ; %bb.0:
 ; GFX900-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX900-NEXT:    s_xor_saveexec_b64 s[4:5], -1
+; GFX900-NEXT:    s_add_i32 s6, s32, 0x101100
+; GFX900-NEXT:    buffer_store_dword v1, off, s[0:3], s6 ; 4-byte Folded Spill
+; GFX900-NEXT:    s_mov_b64 exec, s[4:5]
 ; GFX900-NEXT:    v_lshrrev_b32_e64 v0, 6, s32
 ; GFX900-NEXT:    v_add_u32_e32 v0, 64, v0
 ; GFX900-NEXT:    ;;#ASMSTART
@@ -115,34 +175,52 @@ define void @scalar_mov_materializes_frame_index_unavailable_scc() #0 {
 ; GFX900-NEXT:    ;;#ASMEND
 ; GFX900-NEXT:    v_lshrrev_b32_e64 v0, 6, s32
 ; GFX900-NEXT:    v_add_u32_e32 v0, 0x4040, v0
+; GFX900-NEXT:    v_writelane_b32 v1, s55, 0
+; GFX900-NEXT:    v_readfirstlane_b32 s55, v0
 ; GFX900-NEXT:    s_and_b64 s[4:5], 0, exec
-; GFX900-NEXT:    v_readfirstlane_b32 s59, v0
 ; GFX900-NEXT:    ;;#ASMSTART
-; GFX900-NEXT:    ; use s59, scc
+; GFX900-NEXT:    ; use s55, scc
 ; GFX900-NEXT:    ;;#ASMEND
+; GFX900-NEXT:    v_readlane_b32 s55, v1, 0
+; GFX900-NEXT:    s_xor_saveexec_b64 s[4:5], -1
+; GFX900-NEXT:    s_add_i32 s6, s32, 0x101100
+; GFX900-NEXT:    buffer_load_dword v1, off, s[0:3], s6 ; 4-byte Folded Reload
+; GFX900-NEXT:    s_mov_b64 exec, s[4:5]
+; GFX900-NEXT:    s_waitcnt vmcnt(0)
 ; GFX900-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX942-LABEL: scalar_mov_materializes_frame_index_unavailable_scc:
 ; GFX942:       ; %bb.0:
 ; GFX942-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX942-NEXT:    s_xor_saveexec_b64 s[0:1], -1
+; GFX942-NEXT:    s_add_i32 s2, s32, 0x4044
+; GFX942-NEXT:    scratch_store_dword off, v1, s2 ; 4-byte Folded Spill
+; GFX942-NEXT:    s_mov_b64 exec, s[0:1]
 ; GFX942-NEXT:    s_add_i32 s0, s32, 64
 ; GFX942-NEXT:    v_mov_b32_e32 v0, s0
 ; GFX942-NEXT:    s_and_b64 s[0:1], 0, exec
 ; GFX942-NEXT:    s_addc_u32 s0, s32, 0x4040
 ; GFX942-NEXT:    s_bitcmp1_b32 s0, 0
 ; GFX942-NEXT:    s_bitset0_b32 s0, 0
+; GFX942-NEXT:    v_writelane_b32 v1, s55, 0
+; GFX942-NEXT:    s_mov_b32 s55, s0
 ; GFX942-NEXT:    ;;#ASMSTART
 ; GFX942-NEXT:    ; use alloca0 v0
 ; GFX942-NEXT:    ;;#ASMEND
-; GFX942-NEXT:    s_mov_b32 s59, s0
 ; GFX942-NEXT:    ;;#ASMSTART
-; GFX942-NEXT:    ; use s59, scc
+; GFX942-NEXT:    ; use s55, scc
 ; GFX942-NEXT:    ;;#ASMEND
+; GFX942-NEXT:    v_readlane_b32 s55, v1, 0
+; GFX942-NEXT:    s_xor_saveexec_b64 s[0:1], -1
+; GFX942-NEXT:    s_add_i32 s2, s32, 0x4044
+; GFX942-NEXT:    scratch_load_dword v1, off, s2 ; 4-byte Folded Reload
+; GFX942-NEXT:    s_mov_b64 exec, s[0:1]
+; GFX942-NEXT:    s_waitcnt vmcnt(0)
 ; GFX942-NEXT:    s_setpc_b64 s[30:31]
   %alloca0 = alloca [4096 x i32], align 64, addrspace(5)
   %alloca1 = alloca i32, align 4, addrspace(5)
   call void asm sideeffect "; use alloca0 $0", "v"(ptr addrspace(5) %alloca0)
-  call void asm sideeffect "; use $0, $1", "{s59},{scc}"(ptr addrspace(5) %alloca1, i32 0)
+  call void asm sideeffect "; use $0, $1", "{s55},{scc}"(ptr addrspace(5) %alloca1, i32 0)
   ret void
 }
 
@@ -152,36 +230,65 @@ define void @scalar_mov_materializes_frame_index_dead_scc() #0 {
 ; GFX10_1-LABEL: scalar_mov_materializes_frame_index_dead_scc:
 ; GFX10_1:       ; %bb.0:
 ; GFX10_1-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX10_1-NEXT:    s_xor_saveexec_b32 s4, -1
+; GFX10_1-NEXT:    s_add_i32 s5, s32, 0x80880
+; GFX10_1-NEXT:    buffer_store_dword v1, off, s[0:3], s5 ; 4-byte Folded Spill
+; GFX10_1-NEXT:    s_waitcnt_depctr 0xffe3
+; GFX10_1-NEXT:    s_mov_b32 exec_lo, s4
+; GFX10_1-NEXT:    v_writelane_b32 v1, s55, 0
 ; GFX10_1-NEXT:    v_lshrrev_b32_e64 v0, 5, s32
-; GFX10_1-NEXT:    s_lshr_b32 s59, s32, 5
-; GFX10_1-NEXT:    s_addk_i32 s59, 0x4040
+; GFX10_1-NEXT:    s_lshr_b32 s55, s32, 5
+; GFX10_1-NEXT:    s_addk_i32 s55, 0x4040
 ; GFX10_1-NEXT:    v_add_nc_u32_e32 v0, 64, v0
 ; GFX10_1-NEXT:    ;;#ASMSTART
 ; GFX10_1-NEXT:    ; use alloca0 v0
 ; GFX10_1-NEXT:    ;;#ASMEND
 ; GFX10_1-NEXT:    ;;#ASMSTART
-; GFX10_1-NEXT:    ; use s59
+; GFX10_1-NEXT:    ; use s55
 ; GFX10_1-NEXT:    ;;#ASMEND
+; GFX10_1-NEXT:    v_readlane_b32 s55, v1, 0
+; GFX10_1-NEXT:    s_xor_saveexec_b32 s4, -1
+; GFX10_1-NEXT:    s_add_i32 s5, s32, 0x80880
+; GFX10_1-NEXT:    buffer_load_dword v1, off, s[0:3], s5 ; 4-byte Folded Reload
+; GFX10_1-NEXT:    s_waitcnt_depctr 0xffe3
+; GFX10_1-NEXT:    s_mov_b32 exec_lo, s4
+; GFX10_1-NEXT:    s_waitcnt vmcnt(0)
 ; GFX10_1-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX10_3-LABEL: scalar_mov_materializes_frame_index_dead_scc:
 ; GFX10_3:       ; %bb.0:
 ; GFX10_3-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX10_3-NEXT:    s_xor_saveexec_b32 s4, -1
+; GFX10_3-NEXT:    s_add_i32 s5, s32, 0x80880
+; GFX10_3-NEXT:    buffer_store_dword v1, off, s[0:3], s5 ; 4-byte Folded Spill
+; GFX10_3-NEXT:    s_mov_b32 exec_lo, s4
+; GFX10_3-NEXT:    v_writelane_b32 v1, s55, 0
 ; GFX10_3-NEXT:    v_lshrrev_b32_e64 v0, 5, s32
-; GFX10_3-NEXT:    s_lshr_b32 s59, s32, 5
-; GFX10_3-NEXT:    s_addk_i32 s59, 0x4040
+; GFX10_3-NEXT:    s_lshr_b32 s55, s32, 5
+; GFX10_3-NEXT:    s_addk_i32 s55, 0x4040
 ; GFX10_3-NEXT:    v_add_nc_u32_e32 v0, 64, v0
 ; GFX10_3-NEXT:    ;;#ASMSTART
 ; GFX10_3-NEXT:    ; use alloca0 v0
 ; GFX10_3-NEXT:    ;;#ASMEND
 ; GFX10_3-NEXT:    ;;#ASMSTART
-; GFX10_3-NEXT:    ; use s59
+; GFX10_3-NEXT:    ; use s55
 ; GFX10_3-NEXT:    ;;#ASMEND
+; GFX10_3-NEXT:    v_readlane_b32 s55, v1, 0
+; GFX10_3-NEXT:    s_xor_saveexec_b32 s4, -1
+; GFX10_3-NEXT:    s_add_i32 s5, s32, 0x80880
+; GFX10_3-NEXT:    buffer_load_dword v1, off, s[0:3], s5 ; 4-byte Folded Reload
+; GFX10_3-NEXT:    s_mov_b32 exec_lo, s4
+; GFX10_3-NEXT:    s_waitcnt vmcnt(0)
 ; GFX10_3-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX11-LABEL: scalar_mov_materializes_frame_index_dead_scc:
 ; GFX11:       ; %bb.0:
 ; GFX11-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX11-NEXT:    s_xor_saveexec_b32 s0, -1
+; GFX11-NEXT:    s_add_i32 s1, s32, 0x4044
+; GFX11-NEXT:    scratch_store_b32 off, v1, s1 ; 4-byte Folded Spill
+; GFX11-NEXT:    s_mov_b32 exec_lo, s0
+; GFX11-NEXT:    v_writelane_b32 v1, s55, 0
 ; GFX11-NEXT:    s_add_i32 s0, s32, 64
 ; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
 ; GFX11-NEXT:    v_mov_b32_e32 v0, s0
@@ -189,10 +296,16 @@ define void @scalar_mov_materializes_frame_index_dead_scc() #0 {
 ; GFX11-NEXT:    ;;#ASMSTART
 ; GFX11-NEXT:    ; use alloca0 v0
 ; GFX11-NEXT:    ;;#ASMEND
-; GFX11-NEXT:    s_mov_b32 s59, s0
+; GFX11-NEXT:    s_mov_b32 s55, s0
 ; GFX11-NEXT:    ;;#ASMSTART
-; GFX11-NEXT:    ; use s59
+; GFX11-NEXT:    ; use s55
 ; GFX11-NEXT:    ;;#ASMEND
+; GFX11-NEXT:    v_readlane_b32 s55, v1, 0
+; GFX11-NEXT:    s_xor_saveexec_b32 s0, -1
+; GFX11-NEXT:    s_add_i32 s1, s32, 0x4044
+; GFX11-NEXT:    scratch_load_b32 v1, off, s1 ; 4-byte Folded Reload
+; GFX11-NEXT:    s_mov_b32 exec_lo, s0
+; GFX11-NEXT:    s_waitcnt vmcnt(0)
 ; GFX11-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX12-LABEL: scalar_mov_materializes_frame_index_dead_scc:
@@ -202,67 +315,110 @@ define void @scalar_mov_materializes_frame_index_dead_scc() #0 {
 ; GFX12-NEXT:    s_wait_samplecnt 0x0
 ; GFX12-NEXT:    s_wait_bvhcnt 0x0
 ; GFX12-NEXT:    s_wait_kmcnt 0x0
+; GFX12-NEXT:    s_xor_saveexec_b32 s0, -1
+; GFX12-NEXT:    scratch_store_b32 off, v1, s32 offset:16388 ; 4-byte Folded Spill
+; GFX12-NEXT:    s_wait_alu 0xfffe
+; GFX12-NEXT:    s_mov_b32 exec_lo, s0
+; GFX12-NEXT:    v_writelane_b32 v1, s55, 0
 ; GFX12-NEXT:    s_add_co_i32 s0, s32, 0x4000
 ; GFX12-NEXT:    v_mov_b32_e32 v0, s32
+; GFX12-NEXT:    s_wait_alu 0xfffe
+; GFX12-NEXT:    s_mov_b32 s55, s0
 ; GFX12-NEXT:    ;;#ASMSTART
 ; GFX12-NEXT:    ; use alloca0 v0
 ; GFX12-NEXT:    ;;#ASMEND
-; GFX12-NEXT:    s_wait_alu 0xfffe
-; GFX12-NEXT:    s_mov_b32 s59, s0
 ; GFX12-NEXT:    ;;#ASMSTART
-; GFX12-NEXT:    ; use s59
+; GFX12-NEXT:    ; use s55
 ; GFX12-NEXT:    ;;#ASMEND
+; GFX12-NEXT:    v_readlane_b32 s55, v1, 0
+; GFX12-NEXT:    s_xor_saveexec_b32 s0, -1
+; GFX12-NEXT:    scratch_load_b32 v1, off, s32 offset:16388 ; 4-byte Folded Reload
 ; GFX12-NEXT:    s_wait_alu 0xfffe
+; GFX12-NEXT:    s_mov_b32 exec_lo, s0
+; GFX12-NEXT:    s_wait_loadcnt 0x0
 ; GFX12-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX8-LABEL: scalar_mov_materializes_frame_index_dead_scc:
 ; GFX8:       ; %bb.0:
 ; GFX8-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX8-NEXT:    s_xor_saveexec_b64 s[4:5], -1
+; GFX8-NEXT:    s_add_i32 s6, s32, 0x101100
+; GFX8-NEXT:    buffer_store_dword v1, off, s[0:3], s6 ; 4-byte Folded Spill
+; GFX8-NEXT:    s_mov_b64 exec, s[4:5]
+; GFX8-NEXT:    v_writelane_b32 v1, s55, 0
+; GFX8-NEXT:    s_lshr_b32 s55, s32, 6
 ; GFX8-NEXT:    v_lshrrev_b32_e64 v0, 6, s32
-; GFX8-NEXT:    s_lshr_b32 s59, s32, 6
+; GFX8-NEXT:    s_addk_i32 s55, 0x4040
 ; GFX8-NEXT:    v_add_u32_e32 v0, vcc, 64, v0
 ; GFX8-NEXT:    ;;#ASMSTART
 ; GFX8-NEXT:    ; use alloca0 v0
 ; GFX8-NEXT:    ;;#ASMEND
-; GFX8-NEXT:    s_addk_i32 s59, 0x4040
 ; GFX8-NEXT:    ;;#ASMSTART
-; GFX8-NEXT:    ; use s59
+; GFX8-NEXT:    ; use s55
 ; GFX8-NEXT:    ;;#ASMEND
+; GFX8-NEXT:    v_readlane_b32 s55, v1, 0
+; GFX8-NEXT:    s_xor_saveexec_b64 s[4:5], -1
+; GFX8-NEXT:    s_add_i32 s6, s32, 0x101100
+; GFX8-NEXT:    buffer_load_dword v1, off, s[0:3], s6 ; 4-byte Folded Reload
+; GFX8-NEXT:    s_mov_b64 exec, s[4:5]
+; GFX8-NEXT:    s_waitcnt vmcnt(0)
 ; GFX8-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX900-LABEL: scalar_mov_materializes_frame_index_dead_scc:
 ; GFX900:       ; %bb.0:
 ; GFX900-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX900-NEXT:    s_xor_saveexec_b64 s[4:5], -1
+; GFX900-NEXT:    s_add_i32 s6, s32, 0x101100
+; GFX900-NEXT:    buffer_store_dword v1, off, s[0:3], s6 ; 4-byte Folded Spill
+; GFX900-NEXT:    s_mov_b64 exec, s[4:5]
+; GFX900-NEXT:    v_writelane_b32 v1, s55, 0
+; GFX900-NEXT:    s_lshr_b32 s55, s32, 6
 ; GFX900-NEXT:    v_lshrrev_b32_e64 v0, 6, s32
-; GFX900-NEXT:    s_lshr_b32 s59, s32, 6
+; GFX900-NEXT:    s_addk_i32 s55, 0x4040
 ; GFX900-NEXT:    v_add_u32_e32 v0, 64, v0
 ; GFX900-NEXT:    ;;#ASMSTART
 ; GFX900-NEXT:    ; use alloca0 v0
 ; GFX900-NEXT:    ;;#ASMEND
-; GFX900-NEXT:    s_addk_i32 s59, 0x4040
 ; GFX900-NEXT:    ;;#ASMSTART
-; GFX900-NEXT:    ; use s59
+; GFX900-NEXT:    ; use s55
 ; GFX900-NEXT:    ;;#ASMEND
+; GFX900-NEXT:    v_readlane_b32 s55, v1, 0
+; GFX900-NEXT:    s_xor_saveexec_b64 s[4:5], -1
+; GFX900-NEXT:    s_add_i32 s6, s32, 0x101100
+; GFX900-NEXT:    buffer_load_dword v1, off, s[0:3], s6 ; 4-byte Folded Reload
+; GFX900-NEXT:    s_mov_b64 exec, s[4:5]
+; GFX900-NEXT:    s_waitcnt vmcnt(0)
 ; GFX900-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX942-LABEL: scalar_mov_materializes_frame_index_dead_scc:
 ; GFX942:       ; %bb.0:
 ; GFX942-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX942-NEXT:    s_xor_saveexec_b64 s[0:1], -1
+; GFX942-NEXT:    s_add_i32 s2, s32, 0x4044
+; GFX942-NEXT:    scratch_store_dword off, v1, s2 ; 4-byte Folded Spill
+; GFX942-NEXT:    s_mov_b64 exec, s[0:1]
 ; GFX942-NEXT:    s_add_i32 s0, s32, 64
 ; GFX942-NEXT:    v_mov_b32_e32 v0, s0
 ; GFX942-NEXT:    s_add_i32 s0, s32, 0x4040
+; GFX942-NEXT:    v_writelane_b32 v1, s55, 0
+; GFX942-NEXT:    s_mov_b32 s55, s0
 ; GFX942-NEXT:    ;;#ASMSTART
 ; GFX942-NEXT:    ; use alloca0 v0
 ; GFX942-NEXT:    ;;#ASMEND
-; GFX942-NEXT:    s_mov_b32 s59, s0
 ; GFX942-NEXT:    ;;#ASMSTART
-; GFX942-NEXT:    ; use s59
+; GFX942-NEXT:    ; use s55
 ; GFX942-NEXT:    ;;#ASMEND
+; GFX942-NEXT:    v_readlane_b32 s55, v1, 0
+; GFX942-NEXT:    s_xor_saveexec_b64 s[0:1], -1
+; GFX942-NEXT:    s_add_i32 s2, s32, 0x4044
+; GFX942-NEXT:    scratch_load_dword v1, off, s2 ; 4-byte Folded Reload
+; GFX942-NEXT:    s_mov_b64 exec, s[0:1]
+; GFX942-NEXT:    s_waitcnt vmcnt(0)
 ; GFX942-NEXT:    s_setpc_b64 s[30:31]
   %alloca0 = alloca [4096 x i32], align 64, addrspace(5)
   %alloca1 = alloca i32, align 4, addrspace(5)
   call void asm sideeffect "; use alloca0 $0", "v"(ptr addrspace(5) %alloca0)
-  call void asm sideeffect "; use $0", "{s59}"(ptr addrspace(5) %alloca1)
+  call void asm sideeffect "; use $0", "{s55}"(ptr addrspace(5) %alloca1)
   ret void
 }
 
@@ -272,8 +428,14 @@ define void @scalar_mov_materializes_frame_index_unavailable_scc_fp() #1 {
 ; GFX10_1-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX10_1-NEXT:    s_mov_b32 s5, s33
 ; GFX10_1-NEXT:    s_mov_b32 s33, s32
-; GFX10_1-NEXT:    s_add_i32 s32, s32, 0x81000
+; GFX10_1-NEXT:    s_xor_saveexec_b32 s4, -1
+; GFX10_1-NEXT:    s_add_i32 s6, s33, 0x80880
+; GFX10_1-NEXT:    buffer_store_dword v1, off, s[0:3], s6 ; 4-byte Folded Spill
+; GFX10_1-NEXT:    s_waitcnt_depctr 0xffe3
+; GFX10_1-NEXT:    s_mov_b32 exec_lo, s4
 ; GFX10_1-NEXT:    v_lshrrev_b32_e64 v0, 5, s33
+; GFX10_1-NEXT:    v_writelane_b32 v1, s55, 0
+; GFX10_1-NEXT:    s_add_i32 s32, s32, 0x81000
 ; GFX10_1-NEXT:    s_and_b32 s4, 0, exec_lo
 ; GFX10_1-NEXT:    s_mov_b32 s32, s33
 ; GFX10_1-NEXT:    v_add_nc_u32_e32 v0, 64, v0
@@ -281,12 +443,19 @@ define void @scalar_mov_materializes_frame_index_u...
[truncated]

This PR fixes test failures introduced in #127353 when expensive checkes are enabled.

shiltian · 2025-03-10T17:46:16Z

llvm/test/CodeGen/AMDGPU/materialize-frame-index-sgpr.gfx10.ll

 ; GFX942-NEXT:    s_setpc_b64 s[30:31]
  %alloca0 = alloca [4096 x i32], align 64, addrspace(5)
  %alloca1 = alloca i32, align 4, addrspace(5)
  call void asm sideeffect "; use alloca0 $0", "v"(ptr addrspace(5) %alloca0)
-  call void asm sideeffect "; use $0, $1", "{s59},{scc}"(ptr addrspace(5) %alloca1, i32 0)
+  call void asm sideeffect "; use $0, $1", "{s55},{scc}"(ptr addrspace(5) %alloca1, i32 0)


s59 is no longer lives in because it is caller saved. Switch to s55 here, but I'm not sure why there are massive spillings generated.

Can you mention this in the commit message, I wasn't sure how this only touched tests

Done.

They only fail with expensive checks enabled.

#130644)" As suggested on 5ec884e#commitcomment-153707488 this seems to fix the following tests when building with -DLLVM_ENABLE_EXPENSIVE_CHECKS=ON: LLVM :: CodeGen/AMDGPU/materialize-frame-index-sgpr.gfx10.ll LLVM :: CodeGen/AMDGPU/materialize-frame-index-sgpr.ll LLVM :: CodeGen/AMDGPU/schedule-amdgpu-tracker-physreg-crash.ll > This PR fixes test failures introduced in #127353 when expensive checks > are enabled. > > For `llvm/test/CodeGen/AMDGPU/materialize-frame-index-sgpr.ll` and > `llvm/test/CodeGen/AMDGPU/materialize-frame-index-sgpr.gfx10.ll`, `s59` > is no longer in live-ins because it is caller saved. Switch to `s55` in > this PR.

llvm#130644)" As suggested on llvm@5ec884e#commitcomment-153707488 this seems to fix the following tests when building with -DLLVM_ENABLE_EXPENSIVE_CHECKS=ON: LLVM :: CodeGen/AMDGPU/materialize-frame-index-sgpr.gfx10.ll LLVM :: CodeGen/AMDGPU/materialize-frame-index-sgpr.ll LLVM :: CodeGen/AMDGPU/schedule-amdgpu-tracker-physreg-crash.ll > This PR fixes test failures introduced in llvm#127353 when expensive checks > are enabled. > > For `llvm/test/CodeGen/AMDGPU/materialize-frame-index-sgpr.ll` and > `llvm/test/CodeGen/AMDGPU/materialize-frame-index-sgpr.gfx10.ll`, `s59` > is no longer in live-ins because it is caller saved. Switch to `s55` in > this PR.

llvmbot added the backend:AMDGPU label Mar 10, 2025

shiltian mentioned this pull request Mar 10, 2025

[AMDGPU] Change SGPR layout to striped caller/callee saved #127353

Merged

[AMDGPU] Fix test failures when expensive checks are enabled

af30b17

This PR fixes test failures introduced in #127353 when expensive checkes are enabled.

shiltian force-pushed the users/shiltian/fix-expensive-checks-after-sgpr-layout-change branch from dd09b8d to af30b17 Compare March 10, 2025 17:44

shiltian changed the title ~~[MLIR][Affine] Fix crash in loop unswitching/hoistAffineIfOp (#130401)~~ [AMDGPU] Fix test failures when expensive checks are enabled Mar 10, 2025

shiltian requested review from jrbyrnes and arsenm March 10, 2025 17:44

shiltian commented Mar 10, 2025

View reviewed changes

arsenm approved these changes Mar 11, 2025

View reviewed changes

shiltian merged commit 72bb0a9 into main Mar 11, 2025
11 checks passed

shiltian deleted the users/shiltian/fix-expensive-checks-after-sgpr-layout-change branch March 11, 2025 04:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMDGPU] Fix test failures when expensive checks are enabled #130644

[AMDGPU] Fix test failures when expensive checks are enabled #130644

shiltian commented Mar 10, 2025 •

edited

Loading

shiltian commented Mar 10, 2025

llvmbot commented Mar 10, 2025

shiltian Mar 10, 2025

arsenm Mar 10, 2025

shiltian Mar 10, 2025

[AMDGPU] Fix test failures when expensive checks are enabled #130644

[AMDGPU] Fix test failures when expensive checks are enabled #130644

Conversation

shiltian commented Mar 10, 2025 • edited Loading

shiltian commented Mar 10, 2025

llvmbot commented Mar 10, 2025

shiltian Mar 10, 2025

Choose a reason for hiding this comment

arsenm Mar 10, 2025

Choose a reason for hiding this comment

shiltian Mar 10, 2025

Choose a reason for hiding this comment

shiltian commented Mar 10, 2025 •

edited

Loading