-
-
Notifications
You must be signed in to change notification settings - Fork 31.9k
gh-120397: improve the speed of str.count, bytes.count et al. for single characters by about 2x. #120398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- What are the performances where you don't have the character in the sequence?
- What are the performances with larger inputs? can you generate inputs of size 10k and with a lot of occurrences of the character, no occurrence at all, and sparse occurrences?
- The timings that you report are for 4 statements. It would be better if we had single-case benchmarkings (where you perform only one
count
call).
Objects/stringlib/fastsearch.h
Outdated
while (cursor < end_ptr) { | ||
if (*cursor == p0) { | ||
count += 1; | ||
} | ||
cursor += 1; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need a while loop or can you live with a for-loop here?
Objects/stringlib/fastsearch.h
Outdated
/* By unrolling in chunks of 32, the compiler can auto vectorize, resulting | ||
in much better performance. */ | ||
while (cursor < unroll_end_ptr) { | ||
for(size_t i=0; i<32; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for(size_t i=0; i<32; i++) { | |
for(size_t i = 0; i < 32; i++) { |
Let us keep the same style as before.
Objects/stringlib/fastsearch.h
Outdated
const STRINGLIB_CHAR *restrict cursor = s; | ||
const STRINGLIB_CHAR *end_ptr = s + n; | ||
const STRINGLIB_CHAR *unroll_end_ptr = end_ptr - 31; | ||
/* By unrolling in chunks of 32, the compiler can auto vectorize, resulting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 32 optimal for any supported architecture? or is it possible to use 64-bit chunks for 64-bit architecture?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are byte-chunks, not bitchunks. ARM64 and x86-64 have 16-byte (128-bit) vectors by default. Clang and GCC are able to use these properly. On other architectures the loop is simply unrolled.
MSVC does not unroll the loop however I see. That might have an impact on performance. I should test that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are byte-chunks, not bitchunks
Oh yes, sorry (well, my question remains the same actually).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On other architectures the loop is unrolled. Meaning it is going to be 32 compare and add instructions in a row.
This should be more optimal than looping, as the CPU can go on for a while until it hits a jump. Although the assembly does not look very elegant. Of course the performance impact of this can only be evaluated on these architectures, but I suspect it will be minimal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I was wondering whether we could have a macro defining the correct constant to use depending on the architecture. That way, it could be more or less optimized per architecture. But if we do not already have that information, let's keep your 32.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea. Let's see what happens when I get to the windows benchmarking. MSVC does not unroll the inner loop at all, so the performance is potentially going to be very poor. I was also thinking of enclosing the unrolled loop in compile guards and only allow it for architectures that are known to perform better this way. Anyway, I will get to that after some benchmarks in the coming days.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, for MSVC, you could unroll the loop manually actually. While I may understand that it's maybe an overkill, it might be worthwhile I'd say.
Objects/stringlib/fastsearch.h
Outdated
while (cursor < unroll_end_ptr) { | ||
for(size_t i=0; i<32; i++) { | ||
if (cursor[i] == p0) { | ||
count += 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you put the count >= maxcount
here, does the runtime increases a lot or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Then this PR makes no sense any more.
Because the current code tells the compiler that it:
- It can read the next 32 bytes as these are all valid memory
- Only counting of the character needs to be performed
If it needs to abort reading when the count == maxcount
is made, it cannot use vectors to do the reading as 32 byte reads are not guaranteed.
So instead count is allowed to overshoot and a count >= maxcount
check is placed outside the loop. This will mean that the function will read at most 31 bytes too much.
Thank you for your very insightful comments @picnixz ! You are right this needs extensive benchmarks for all possible use cases. I will get back to this another day as I have also other tasks to attend to. |
Hmm. I did some further investigation of the code. Compilers can also optimize without all the hints, provided the maxcount check is not done. |
There we go. Before:
After:
@picnixz So sorry for letting you review all the unrolled code. Turns out a simple copy paste and special casing was enough 😅 . I hope I did not waste your time. EDIT: On the upside, an evaluation of all possible platforms is not needed! This code should never run slower on any platform. |
Objects/stringlib/fastsearch.h
Outdated
@@ -753,6 +753,22 @@ STRINGLIB(count_char)(const STRINGLIB_CHAR *s, Py_ssize_t n, | |||
} | |||
|
|||
|
|||
static inline Py_ssize_t | |||
STRINGLIB(count_char_no_maximum)(const STRINGLIB_CHAR *s, Py_ssize_t n, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe make that function private (I don't think it should be exposed except in this module).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is private, as it is static. The STRINGLIB macro is to prevent name clobbering. This function will be generated for STRINGLIB_CHAR==Py_UCS1, Py_UCS2 and PyUCS4. I think keeping it this way is correct. But I may be wrong of course. How do you suggest making it private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh just by adding an underscore before its name (I should have been clearer when I said "private", I meant it in the naming but I think we don't care about underscores in C files).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we don't use underscore prefix in Python for static functions. Moreover, the macro adds a prefix such as ucs1lib_
.
Misc/NEWS.d/next/Core and Builtins/2024-06-12-13-47-25.gh-issue-120397.n-I_cc.rst
Outdated
Show resolved
Hide resolved
…e-120397.n-I_cc.rst Co-authored-by: Bénédikt Tran <[email protected]>
Objects/stringlib/fastsearch.h
Outdated
@@ -753,6 +753,22 @@ STRINGLIB(count_char)(const STRINGLIB_CHAR *s, Py_ssize_t n, | |||
} | |||
|
|||
|
|||
static inline Py_ssize_t | |||
STRINGLIB(count_char_no_maximum)(const STRINGLIB_CHAR *s, Py_ssize_t n, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we don't use underscore prefix in Python for static functions. Moreover, the macro adds a prefix such as ucs1lib_
.
Misc/NEWS.d/next/Core and Builtins/2024-06-12-13-47-25.gh-issue-120397.n-I_cc.rst
Outdated
Show resolved
Hide resolved
cc @serhiy-storchaka @pitrou: This change looks very promising. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Misc/NEWS.d/next/Core and Builtins/2024-06-12-13-47-25.gh-issue-120397.n-I_cc.rst
Outdated
Show resolved
Hide resolved
251f77b
to
ce9ab9b
Compare
Co-authored-by: Nice Zombies <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Head branch was pushed to by a user without write access
@@ -773,6 +789,9 @@ FASTSEARCH(const STRINGLIB_CHAR* s, Py_ssize_t n, | |||
else if (mode == FAST_RSEARCH) | |||
return STRINGLIB(rfind_char)(s, n, p[0]); | |||
else { | |||
if (maxcount == PY_SSIZE_T_MAX) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (maxcount == PY_SSIZE_T_MAX) { | |
if (maxcount >= n) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maxcount is only used in the replace function, it is very unlikely that this condition will ever be triggered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, but there's no reason to check for PY_SSIZE_T_MAX
specifically when this works as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you are correct. However, this function needs some refactoring, as this maxcount provision is only there for replace. Replace for single characters is special.cased elsewhere, so maxcount is actually always Pyssize_t_max I think. I want to revisit this at a later point.
@vstinner, so I had to botch the automerge. I made a few mistakes when implementing the suggestions and just after the push my attention was required elsewhere. All tests pass now. |
Merged, thank you. |
Thank you for the review and the merging. It was a pleasant process. I think I will make more of these "making CPython faster, one function at a time" PRs. If it is preferred that I bundle these, please let me know. |
I prefer to do it one function per change, as you can see it's already complicated to change a single function. |
Benchmarks using:
This is testing a real use case where the GC content of a DNA sequence is calculated. Other possible usages are counting newlines.
Before:
After: