Skip to content

8353266: C2: Wrong execution with Integer.bitCount(int) intrinsic on AArch64 #25551

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

marc-chevalier
Copy link
Member

@marc-chevalier marc-chevalier commented May 30, 2025

Problem

On Aarch64, using Integer.bitCount can modify its argument. The problem comes from the implementation of popCountI on Aarch64. For instance, that's what we get with the reproducer Reduced.java on the related issue:

; Load lFld into local x
ldr  x11,      [x10, #120]
; popCountI
mov  w11,      w11
mov  v16.d[0], x11
cnt  v16.8b,   v16.8b
addv b16,      v16.8b
mov  x13,      v16.d[0]
; [...]
; store local x (which is believed to still contain lFld) into result
str  x11,      [x10, #128]

The instruction mov w11, w11 is used to cut the 32 higher bits of x11 since we use popCountI (from Integer.bitCount): on aarch64 (like other architectures), assigning the 32 lower bits of a register reset the 32 higher bits. Short: the input is modified, but the implementation of popCountI doesn't declare it:

instruct popCountI(iRegINoSp dst, iRegIorL2I src, vRegF tmp) %{
  match(Set dst (PopCountI src));
  effect(TEMP tmp);
  [...]
%}

But then, why resetting the upper word of x11? It all starts with vector instructions:

cnt  v16.8b,   v16.8b
addv b16,      v16.8b

The 8b specifies that it operates on the 8 lower bytes of v16, it would be nice to simply use 4b, but that doesn't exist: vector instructions can only work on either the whole 128-bit register, or the 64 lower bits (by blocks of 1, 2, 4, 8 or 16 bytes). There is no suffix (and encoding) for a vector instruction to work only on the 32 lower bits, so not to pollute the bit count, we need to reset the 32 higher bits of v16.d[0] (aka d16), that is v16.s[1], that is v16[32:63] in a more bit-explicit notation. Moreover, unlike with general purpose register doing

mov  v16.s[0], w11

would set v16[0:31] to w11, but not reset v16[32:63]. Which makes sense! Otherwise, using vector registers would be impractical if writing any piece would reset the rest... So we indeed need to set all of v16[0:63], which

mov  w11,      w11
mov  v16.d[0], x11

does, but by destroying x11.

Solution

Simply adding USE_KILL src in the effects would be nice, but unfortunately not possible: iRegIorL2I is an operand class (either a 32-bit register or a L2I of a 64-bit register) and those cannot be used in effect lists.

The way I went for is rather not to modify the source, but rather do write the two lower words of v16 we are interested in separately:

mov  v16.s[1], wzr      ; Reset the 1-indexed word of v16, that is v16[32:63] <- 0
mov  v16.s[0], w11      ; Set the 0-indexed word of v16 to w11, that is v[0:31] <- w11
cnt  v16.8b,   v16.8b
addv b16,      v16.8b
mov  x13,      v16.s[0]

Unlike other solutions, this is relatively straightforward as it doesn't write twice the same bits, as for instance, this would:

mov  v16.d[0], xzr      ; Reset the 0-indexed double word of v16, that is v16[0:63] <- 0
mov  v16.s[0], w11      ; Set the 0-indexed word of v16 to w11, that is v[0:31] <- w11

and it doesn't use additional temporaries, like this would:

mov  w12,      w11      ; Using a fresh register x12
mov  v16.d[0], x12

Using the zero register rather than an immediate is convenient as it allows to set 32 bits at once, while a 32-bit immediate would not fit in a single instruction.

Format

The printing of this instruction is not very satisfactory. We used to have something that renders in OptoAssembly

movw l2i(R29), l2i(R29)
mov  V16, l2i(R29) # vector (1D)
cnt  V16, V16      # vector (8B)
addv V16, V16      # vector (8B)
mov  R13, V16      # vector (1D)

This is... somewhat arguable. With context, I can understand or guess what movw l2i(R29), l2i(R29) means, but I don't think it's a very nice printout. Also, it's not clear that the second instruction works on the lower word of V16. Alas, my new version is not much better:

mov  V16, zr       # vector (1S)
mov  V16, l2i(R29) # vector (1S)
cnt  V16, V16      # vector (8B)
addv V16, V16      # vector (8B)
mov  R13, V16      # vector (1D)

It's not clear that the first instruction is on the 1-indexed word of V16 while the second is on the 0-indexed word. I couldn't find a nicer example in a similar situation, so I'm open to suggestions! Maybe simply hardcoding it in the format? as such:

format %{ "mov    $tmp.s[1], zr\t# vector (1S)\n\t"
          "mov    $tmp.s[0], $src\t# vector (1S)\n\t"
          "cnt    $tmp, $tmp\t# vector (8B)\n\t"
          "addv   $tmp, $tmp\t# vector (8B)\n\t"
          "mov    $dst, $tmp\t# vector (1D)" %}

Not sure what's the best practice here.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8353266: C2: Wrong execution with Integer.bitCount(int) intrinsic on AArch64 (Bug - P3)(⚠️ The fixVersion in this issue is [26] but the fixVersion in .jcheck/conf is 25, a new backport will be created when this pr is integrated.)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/25551/head:pull/25551
$ git checkout pull/25551

Update a local copy of the PR:
$ git checkout pull/25551
$ git pull https://git.openjdk.org/jdk.git pull/25551/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 25551

View PR using the GUI difftool:
$ git pr show -t 25551

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/25551.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented May 30, 2025

👋 Welcome back mchevalier! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented May 30, 2025

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk
Copy link

openjdk bot commented May 30, 2025

@marc-chevalier The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@marc-chevalier
Copy link
Member Author

Opinion for people in charge: should I fix the fixVersion in the JBS issue, or wait a bit to integrate?

@marc-chevalier marc-chevalier marked this pull request as ready for review May 30, 2025 15:38
@openjdk openjdk bot added the rfr Pull request is ready for review label May 30, 2025
@mlbridge
Copy link

mlbridge bot commented May 30, 2025

Webrevs

@dean-long
Copy link
Member

dean-long commented May 31, 2025

Opinion for people in charge: should I fix the fixVersion in the JBS issue, or wait a bit to integrate?

I would say yes, change the fixVersion to 25 and try to get this into 25, resulting it one less backport needed.

@sendaoYan
Copy link
Member

Hi, how does this bug was found, seems the original testcase generated by a fuzz tool.

test();
if (result != 0xfedc_ba98_7654_3210L) {
// Wrongly outputs the cut input 0x7654_3210 == 1985229328
throw new RuntimeException("Wrong result. lFld=" + lFld + "; result=" + result);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

throw new RuntimeException("Wrong result. Expected result = " + lFld + "; Actual result = " + result);

Comment on lines +7770 to +7771
__ mov($tmp$$FloatRegister, __ S, 1, zr); // tmp[32:63] <- 0
__ mov($tmp$$FloatRegister, __ S, 0, $src$$Register); // tmp[ 0:31] <- src
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Where the entire 128-bit wide register is not fully utilized, the vector or scalar quantity is held in the least significant bits of the register, with the most significant bits being cleared to zero on a write."

Suggested change
__ mov($tmp$$FloatRegister, __ S, 1, zr); // tmp[32:63] <- 0
__ mov($tmp$$FloatRegister, __ S, 0, $src$$Register); // tmp[ 0:31] <- src
__ fmovs($tmp$$FloatRegister, $src$$Register);

should do it.

@theRealAph
Copy link
Contributor

Opinion for people in charge: should I fix the fixVersion in the JBS issue, or wait a bit to integrate?

Get it in 25. Low risk, significant Java compatibility bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-compiler [email protected] rfr Pull request is ready for review
Development

Successfully merging this pull request may close these issues.

4 participants