Skip to content

Faster decode and find_max_char implementations. #120196

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rhpvorderman opened this issue Jun 7, 2024 · 4 comments
Closed

Faster decode and find_max_char implementations. #120196

rhpvorderman opened this issue Jun 7, 2024 · 4 comments
Labels
performance Performance or resource usage type-feature A feature request or enhancement

Comments

@rhpvorderman
Copy link
Contributor

rhpvorderman commented Jun 7, 2024

Feature or enhancement

Proposal:

I have spotted a few inefficiencies in the stringlib implementations that hinder the compilers ability to optimize the code. These could be fixed.

  • find_max_char, the 1-byte version. This unrolls checking 4 or 8-byte chunks. Alignment (which does not matter for x86-64 but may be important on other platforms) happens by checking one character at the time. This can be sped up by simply bitwise OR-ing all the characters together, and only check all the alginments with one check. Furthermore, the loop can be unrolled using 32-byte chunks. (4 size_t integers). By doing so, the compiler needs only very few extra instructions to do the bitwise or and can use 16-byte vectors. These are available on both x86-64 and ARM64 and the compiler will optimize easily. The less than 32 byte remainder can then be obtained by simply bitwise OR-ing these characters together and perform the check.
  • Find_max_char, the 2-byte and 4-byte version. These now work with unrolls of 4. For the 2-byte version this means an 8-byte load. Increasing the unroll to 8, this means 16-byte and 32-byte loads. The compiler can vectorize this.
  • Stringlib codecs.h utf8_decode on line 47 states, fast unrolled copy. These statements can be replaced by memcpy(*_p, *_s, SIZEOF_SIZE_T); Using *restrict` the compiler should understand that a read does not need to be performed twice, and memcpy using a fixed size is always optimized out.
  • ascii_decode: same as find_max_char. This can be optimized using larger chunks.

Has this already been discussed elsewhere?

This is a minor feature, which does not need previous discussion elsewhere

Links to previous discussion of this feature:

No response

Linked PRs

@rhpvorderman rhpvorderman added the type-feature A feature request or enhancement label Jun 7, 2024
@Eclips4 Eclips4 added the performance Performance or resource usage label Jun 7, 2024
@erlend-aasland
Copy link
Contributor

Thanks for the report, Ruben; this looks interesting. Would you like to propose a PR?

@rhpvorderman
Copy link
Contributor Author

I'd love to. I have the code ready. Except I can't benchmark it because the build time is so long due to the module freeze. See: https://discuss.python.org/t/building-python-takes-longer-than-an-hour-due-to-freezing-objects-and-still-counting/55144/1

@rhpvorderman
Copy link
Contributor Author

I submitted the code in a PR. I hope I will be able to benchmark this at some point in the coming weeks, but given the slow building time this seems unlikely.

@rhpvorderman
Copy link
Contributor Author

The PR was closed due to the added code complexity not weighing up to the performance improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants