Faster decode and find_max_char implementations. #120196

rhpvorderman · 2024-06-07T07:44:03Z

Feature or enhancement

Proposal:

I have spotted a few inefficiencies in the stringlib implementations that hinder the compilers ability to optimize the code. These could be fixed.

find_max_char, the 1-byte version. This unrolls checking 4 or 8-byte chunks. Alignment (which does not matter for x86-64 but may be important on other platforms) happens by checking one character at the time. This can be sped up by simply bitwise OR-ing all the characters together, and only check all the alginments with one check. Furthermore, the loop can be unrolled using 32-byte chunks. (4 size_t integers). By doing so, the compiler needs only very few extra instructions to do the bitwise or and can use 16-byte vectors. These are available on both x86-64 and ARM64 and the compiler will optimize easily. The less than 32 byte remainder can then be obtained by simply bitwise OR-ing these characters together and perform the check.
Find_max_char, the 2-byte and 4-byte version. These now work with unrolls of 4. For the 2-byte version this means an 8-byte load. Increasing the unroll to 8, this means 16-byte and 32-byte loads. The compiler can vectorize this.
Stringlib codecs.h utf8_decode on line 47 states, fast unrolled copy. These statements can be replaced by memcpy(*_p, *_s, SIZEOF_SIZE_T); Using *restrict` the compiler should understand that a read does not need to be performed twice, and memcpy using a fixed size is always optimized out.
ascii_decode: same as find_max_char. This can be optimized using larger chunks.

Has this already been discussed elsewhere?

This is a minor feature, which does not need previous discussion elsewhere

Links to previous discussion of this feature:

No response

Linked PRs

The text was updated successfully, but these errors were encountered:

erlend-aasland · 2024-06-07T09:56:57Z

Thanks for the report, Ruben; this looks interesting. Would you like to propose a PR?

rhpvorderman · 2024-06-07T10:05:16Z

I'd love to. I have the code ready. Except I can't benchmark it because the build time is so long due to the module freeze. See: https://discuss.python.org/t/building-python-takes-longer-than-an-hour-due-to-freezing-objects-and-still-counting/55144/1

rhpvorderman · 2024-06-07T10:25:32Z

I submitted the code in a PR. I hope I will be able to benchmark this at some point in the coming weeks, but given the slow building time this seems unlikely.

rhpvorderman · 2024-09-16T06:15:49Z

The PR was closed due to the added code complexity not weighing up to the performance improvement.

rhpvorderman added the type-feature A feature request or enhancement label Jun 7, 2024

Eclips4 added the performance Performance or resource usage label Jun 7, 2024

bedevere-app bot mentioned this issue Jun 7, 2024

gh-120196: Faster ascii_decode and find_max_char implementations #120212

Closed

rhpvorderman mentioned this issue Jun 14, 2024

gh-120196: Reuse find_max_char for bytes objects #120497

Merged

vstinner pushed a commit that referenced this issue Jun 17, 2024

gh-120196: Reuse find_max_char() for bytes objects (#120497)

945a89b

mrahtz pushed a commit to mrahtz/cpython that referenced this issue Jun 30, 2024

pythongh-120196: Reuse find_max_char() for bytes objects (python#120497)

b93d655

noahbkim pushed a commit to hudson-trading/cpython that referenced this issue Jul 11, 2024

pythongh-120196: Reuse find_max_char() for bytes objects (python#120497)

54f2eaa

estyxx pushed a commit to estyxx/cpython that referenced this issue Jul 17, 2024

pythongh-120196: Reuse find_max_char() for bytes objects (python#120497)

9d3b846

rhpvorderman closed this as completed Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster decode and find_max_char implementations. #120196

Faster decode and find_max_char implementations. #120196

rhpvorderman commented Jun 7, 2024 •

edited by bedevere-app bot

Loading

erlend-aasland commented Jun 7, 2024

rhpvorderman commented Jun 7, 2024

rhpvorderman commented Jun 7, 2024

rhpvorderman commented Sep 16, 2024

Faster decode and find_max_char implementations. #120196

Faster decode and find_max_char implementations. #120196

Comments

rhpvorderman commented Jun 7, 2024 • edited by bedevere-app bot Loading

Feature or enhancement

Proposal:

Has this already been discussed elsewhere?

Links to previous discussion of this feature:

Linked PRs

erlend-aasland commented Jun 7, 2024

rhpvorderman commented Jun 7, 2024

rhpvorderman commented Jun 7, 2024

rhpvorderman commented Sep 16, 2024

rhpvorderman commented Jun 7, 2024 •

edited by bedevere-app bot

Loading