Skip to content

gh-120397: improve the speed of str.count, bytes.count et al. for single characters by about 2x. #120398

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Jun 13, 2024

Conversation

rhpvorderman
Copy link
Contributor

@rhpvorderman rhpvorderman commented Jun 12, 2024

Benchmarks using:

./python -m timeit -s "seq='TTTATGGTTATTTATATTTATTTATTTTTGAGATGGAGTTTTGCTCTTGCTGCCTAGGCTGGAGTGCAATGGCACGATCTCGGCTCACTGCAACCTCCGCCTCCCAGGTTCAAGCGATTCTCCTGCCTCAGCCTCCTGAGTAGCTGGGATT'" "seq.count('A'); seq.count('C'); seq.count('G');  seq.count('T')"

This is testing a real use case where the GC content of a DNA sequence is calculated. Other possible usages are counting newlines.

Before:

500000 loops, best of 5: 461 nsec per loop

After:

1000000 loops, best of 5: 216 nsec per loop

@rhpvorderman rhpvorderman changed the title gh-120397: gh-120397: improve the speed of str.count, bytes.count et al. for single characters. Jun 12, 2024
Copy link
Member

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • What are the performances where you don't have the character in the sequence?
  • What are the performances with larger inputs? can you generate inputs of size 10k and with a lot of occurrences of the character, no occurrence at all, and sparse occurrences?
  • The timings that you report are for 4 statements. It would be better if we had single-case benchmarkings (where you perform only one count call).

Comment on lines 760 to 765
while (cursor < end_ptr) {
if (*cursor == p0) {
count += 1;
}
cursor += 1;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need a while loop or can you live with a for-loop here?

/* By unrolling in chunks of 32, the compiler can auto vectorize, resulting
in much better performance. */
while (cursor < unroll_end_ptr) {
for(size_t i=0; i<32; i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for(size_t i=0; i<32; i++) {
for(size_t i = 0; i < 32; i++) {

Let us keep the same style as before.

const STRINGLIB_CHAR *restrict cursor = s;
const STRINGLIB_CHAR *end_ptr = s + n;
const STRINGLIB_CHAR *unroll_end_ptr = end_ptr - 31;
/* By unrolling in chunks of 32, the compiler can auto vectorize, resulting
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 32 optimal for any supported architecture? or is it possible to use 64-bit chunks for 64-bit architecture?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are byte-chunks, not bitchunks. ARM64 and x86-64 have 16-byte (128-bit) vectors by default. Clang and GCC are able to use these properly. On other architectures the loop is simply unrolled.

MSVC does not unroll the loop however I see. That might have an impact on performance. I should test that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are byte-chunks, not bitchunks

Oh yes, sorry (well, my question remains the same actually).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On other architectures the loop is unrolled. Meaning it is going to be 32 compare and add instructions in a row.
This should be more optimal than looping, as the CPU can go on for a while until it hits a jump. Although the assembly does not look very elegant. Of course the performance impact of this can only be evaluated on these architectures, but I suspect it will be minimal.

Copy link
Member

@picnixz picnixz Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I was wondering whether we could have a macro defining the correct constant to use depending on the architecture. That way, it could be more or less optimized per architecture. But if we do not already have that information, let's keep your 32.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea. Let's see what happens when I get to the windows benchmarking. MSVC does not unroll the inner loop at all, so the performance is potentially going to be very poor. I was also thinking of enclosing the unrolled loop in compile guards and only allow it for architectures that are known to perform better this way. Anyway, I will get to that after some benchmarks in the coming days.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, for MSVC, you could unroll the loop manually actually. While I may understand that it's maybe an overkill, it might be worthwhile I'd say.

while (cursor < unroll_end_ptr) {
for(size_t i=0; i<32; i++) {
if (cursor[i] == p0) {
count += 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you put the count >= maxcount here, does the runtime increases a lot or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Then this PR makes no sense any more.

Because the current code tells the compiler that it:

  • It can read the next 32 bytes as these are all valid memory
  • Only counting of the character needs to be performed

If it needs to abort reading when the count == maxcount is made, it cannot use vectors to do the reading as 32 byte reads are not guaranteed.
So instead count is allowed to overshoot and a count >= maxcount check is placed outside the loop. This will mean that the function will read at most 31 bytes too much.

@rhpvorderman
Copy link
Contributor Author

Thank you for your very insightful comments @picnixz ! You are right this needs extensive benchmarks for all possible use cases. I will get back to this another day as I have also other tasks to attend to.

@rhpvorderman
Copy link
Contributor Author

Hmm. I did some further investigation of the code. Compilers can also optimize without all the hints, provided the maxcount check is not done.
It turns that is only needed in the case for replacing characters. So the code could also be optimized by special casing it. That would not require an extensive performance impact search, as the code remains mostly the same.

@rhpvorderman
Copy link
Contributor Author

rhpvorderman commented Jun 12, 2024

There we go. Before:

./python -m timeit -s "seq='TTTATGGTTATTTATATTTATTTATTTTTGAGATGGAGTTTTGCTCTTGCTGCCTAGGCTGGAGTGCAATGGCACGATCTCGGCTCACTGCAACCTCCGCCTCCCAGGTTCAAGCGATTCTCCTGCCTCAGCCTCCTGAGTAGCTGGGATT'" "seq.count('A')"
2000000 loops, best of 5: 103 nsec per loop

After:

./python -m timeit -s "seq='TTTATGGTTATTTATATTTATTTATTTTTGAGATGGAGTTTTGCTCTTGCTGCCTAGGCTGGAGTGCAATGGCACGATCTCGGCTCACTGCAACCTCCGCCTCCCAGGTTCAAGCGATTCTCCTGCCTCAGCCTCCTGAGTAGCTGGGATT'" "seq.count('A')"
5000000 loops, best of 5: 55.2 nsec per loop

@picnixz So sorry for letting you review all the unrolled code. Turns out a simple copy paste and special casing was enough 😅 . I hope I did not waste your time.

EDIT: On the upside, an evaluation of all possible platforms is not needed! This code should never run slower on any platform.

@@ -753,6 +753,22 @@ STRINGLIB(count_char)(const STRINGLIB_CHAR *s, Py_ssize_t n,
}


static inline Py_ssize_t
STRINGLIB(count_char_no_maximum)(const STRINGLIB_CHAR *s, Py_ssize_t n,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make that function private (I don't think it should be exposed except in this module).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is private, as it is static. The STRINGLIB macro is to prevent name clobbering. This function will be generated for STRINGLIB_CHAR==Py_UCS1, Py_UCS2 and PyUCS4. I think keeping it this way is correct. But I may be wrong of course. How do you suggest making it private?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh just by adding an underscore before its name (I should have been clearer when I said "private", I meant it in the naming but I think we don't care about underscores in C files).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we don't use underscore prefix in Python for static functions. Moreover, the macro adds a prefix such as ucs1lib_.

@@ -753,6 +753,22 @@ STRINGLIB(count_char)(const STRINGLIB_CHAR *s, Py_ssize_t n,
}


static inline Py_ssize_t
STRINGLIB(count_char_no_maximum)(const STRINGLIB_CHAR *s, Py_ssize_t n,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we don't use underscore prefix in Python for static functions. Moreover, the macro adds a prefix such as ucs1lib_.

@vstinner
Copy link
Member

cc @serhiy-storchaka @pitrou: This change looks very promising.

Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@rhpvorderman rhpvorderman changed the title gh-120397: improve the speed of str.count, bytes.count et al. for single characters. gh-120397: improve the speed of str.count, bytes.count et al. for single characters by about 2x. Jun 13, 2024
Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vstinner vstinner enabled auto-merge (squash) June 13, 2024 09:40
auto-merge was automatically disabled June 13, 2024 12:06

Head branch was pushed to by a user without write access

@@ -773,6 +789,9 @@ FASTSEARCH(const STRINGLIB_CHAR* s, Py_ssize_t n,
else if (mode == FAST_RSEARCH)
return STRINGLIB(rfind_char)(s, n, p[0]);
else {
if (maxcount == PY_SSIZE_T_MAX) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (maxcount == PY_SSIZE_T_MAX) {
if (maxcount >= n) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maxcount is only used in the replace function, it is very unlikely that this condition will ever be triggered.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, but there's no reason to check for PY_SSIZE_T_MAX specifically when this works as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are correct. However, this function needs some refactoring, as this maxcount provision is only there for replace. Replace for single characters is special.cased elsewhere, so maxcount is actually always Pyssize_t_max I think. I want to revisit this at a later point.

@rhpvorderman
Copy link
Contributor Author

@vstinner, so I had to botch the automerge. I made a few mistakes when implementing the suggestions and just after the push my attention was required elsewhere. All tests pass now.

@vstinner vstinner merged commit 2078eb4 into python:main Jun 13, 2024
36 checks passed
@vstinner
Copy link
Member

Merged, thank you.

@rhpvorderman
Copy link
Contributor Author

Thank you for the review and the merging. It was a pleasant process. I think I will make more of these "making CPython faster, one function at a time" PRs. If it is preferred that I bundle these, please let me know.

@vstinner
Copy link
Member

I prefer to do it one function per change, as you can see it's already complicated to change a single function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants