gh-120196: Faster ascii_decode and find_max_char implementations #120212

rhpvorderman · 2024-06-07T10:22:34Z

One byte find_max_char. This now follows the following algorithm.

Find the start where the first aligned load can take place.
Bitwise OR all characters together up to the aligned start and check with ASCII_MASK.
At aligned start, start loading size_t units in groups of 4. Bitwise OR together and check with ASCII_MASK.
When there are less than 4 * size_t units left, get the remaining size_t units and the remainder of the single char units and bitwise OR them all together.

This methodology has the following advantages:

The number of checks is significantly reduced. (Both ASCII_MASK check and loop iterator check).
Due to the extremely low cost of a OR operation handling 4x size_t chunks of data is barely more expensive than single size_t s. As a result the penalty for not exiting earlier if a byte larger than 127 is found is minimal.
The compiler is able to vectorize the bitwise OR on the 32 KB chunks. Please note that this only works for GCC 14.1 and higher.

Two and four byte find_max_char.

Increase the unroll size from 4 to 8. This means loading 16 and 32 bytes simultaneously for Py_UCS2 and Py_UCS4. GCC 14.1 can now vectorize the load and the bitwise OR, which should result in faster speed.

ascii_decode:

The same advantages as decribed in find_max_char. This function also performs some copying. Using the restrict qualifier on the pointers allows GCC to do a vector store operation. Unfortunately GCC < 14.1 loads the value once for the bitwise OR and once for the load store. Only the load store is vectorized in GCC < 14.1. Luckily, since both loads are from the same cache line and no other loads are performed in between, the latency this introduces should be near zero.

So from GCC 14.1 onwards the performance should improve. My debian system has GCC 12.2 but that also should give some improvements.

If only I could benchmark: https://discuss.python.org/t/building-python-takes-longer-than-an-hour-due-to-freezing-objects-and-still-counting/55144

Issue: Faster decode and find_max_char implementations. #120196

…x_char

erlend-aasland · 2024-06-07T10:46:15Z

cc. @eendebakpt: you might be interested in this PR.

eendebakpt · 2024-06-07T12:08:47Z

@rhpvorderman Thanks for making the PR. I will look at it the coming days. Which python operations do you expect (or hope) that will benefit from the PR? A (micro) benchmark for this would be nice

rhpvorderman · 2024-06-07T12:17:57Z

I did some benchmarking. As expected short decodings do not have much benefit. I tested on reading individual lines from a fairly large ASCII file. (FASTQ format, 1.5 GB)
The result before

Benchmark 1: ./python decode_test_lines_ascii.py ~/test/5millionreads_R1.fastq
  Time (mean ± σ):      1.567 s ±  0.019 s    [User: 1.339 s, System: 0.228 s]
  Range (min … max):    1.543 s …  1.597 s    10 runs
Benchmark 1: ./python decode_test_blocks_ascii.py ~/test/5millionreads_R1.fastq
  Time (mean ± σ):     789.3 ms ±   7.4 ms    [User: 535.3 ms, System: 253.9 ms]
  Range (min … max):   776.0 ms … 803.8 ms    10 runs

after

Benchmark 1: ./python decode_test_lines_ascii.py ~/test/5millionreads_R1.fastq
  Time (mean ± σ):      1.519 s ±  0.014 s    [User: 1.271 s, System: 0.247 s]
  Range (min … max):    1.504 s …  1.550 s    10 runs
Benchmark 1: ./python decode_test_blocks_ascii.py ~/test/5millionreads_R1.fastq
  Time (mean ± σ):     730.3 ms ±   5.8 ms    [User: 490.0 ms, System: 240.1 ms]
  Range (min … max):   723.4 ms … 739.9 ms    10 runs

The file had a pattern of line lengths of 30, 152, 2 152. As is visible, the PR barely has an impact on run time decoding individual lines.

When decoding large blocks of text (4KB per iteration) the performance benefit is clearly visible.

This would mostly benefit users therefore that

Know how to make python perform well and operate on blocks of text
Have no clue or don't care for their use case and read the entire contents of the file in one go.

I have only benchmarked ascii_decode here, but the benefits for find_max_char should be similar. Negligible for small strings, more substantial for bigger data.

Objects/unicodeobject.c

eendebakpt · 2024-06-07T12:17:53Z

Objects/unicodeobject.c

@@ -4710,7 +4712,25 @@ ascii_decode(const char *start, const char *end, Py_UCS1 *dest)
        /* Help allocation */
        const char *_p = p;
        Py_UCS1 * q = dest;
-        while (_p + SIZEOF_SIZE_T <= end) {
+        while (_p <= unroll_end) {
+            const size_t *restrict __p = (const size_t *)_p;


Is the restrict keyword and the assignment to value0, ..., value3 required for the compiler? If not, then this code can be written more compact.

In general I think the compiler can judge from the function that no writes happen in either __p or _q so pointer aliasing is not problematic. But I'd rather make this explicit rather than relying on the compiler to deduce this. MSVC and Clang do this scalarly despite the hint. GCC correctly vectorizes the load store and does not need the hint. I cannot judge all possible compilers out there. I think it is more likely that clang and MSVC will do the correct thing in the future with the hint than without it.

It would only save the restrict keyword as the (size_t *) cast is still needed, as that makes the rest of the code more readable. Are the 8 characters extra width that problematic?

eendebakpt · 2024-06-10T19:13:43Z

Objects/stringlib/find_max_char.h

+    size_t value = 0;
+    while (_p < aligned_end) {
+        value |= *_p;
+        _p += 1;
+    }
+    p = (const unsigned char *)_p;
+    while (p < _end) {
+        value |= *p;
+        p += 1;
+    }


We could combine these two loops into a single loop right? Worst case we do a single loop over 32-1=31 bytes instead of two while loops (the loop over _p (7 steps) and the loop over p (3 steps)).

The more optimal solution is to do an unaligned load of a size_t at (end - SIZEOF_SIZE_T). That would remove the entire last while loop. I do not know if all architectures support unaligned loads however. I guess some could theoretically abort?

I single loop over 31 bytes taking 31 steps is less optimal than 2 while loops taking 10 steps in total. (Or even 7 in the case of 64-bit size_t, that would be the majority of platforms).

You can use the memcpy trick to do an unaligned load, it will be optimized by any reasonable compiler, for example:

size_t value; memcpy(&value, end - sizeof(value), sizeof(value));

That's a very nice suggestion! That would simplify the code a lot.

Objects/stringlib/find_max_char.h

eendebakpt · 2024-06-10T19:28:10Z

Objects/stringlib/find_max_char.h

@@ -20,23 +20,45 @@ Py_LOCAL_INLINE(Py_UCS4)
 STRINGLIB(find_max_char)(const STRINGLIB_CHAR *begin, const STRINGLIB_CHAR *end)
 {
    const unsigned char *p = (const unsigned char *) begin;
+    const unsigned char *_end = (const unsigned char *)end;
+    const size_t *aligned_end = (const size_t *)(_end - SIZEOF_SIZE_T);


Just checking whether I understand correctly: aligned_end is not aligned, but good enough to serve as the end of the loop over aligned values. To make it really aligned we would need to do something like aligned_end = _Py_SIZE_ROUND_DOWN(_end, SIZEOF_SIZE_T)?

Yes that is the gist of it. I tried something similar to _Py_SIZE_ROUND_DOWN and got segfaults so I opted for this simpler tried and true solution. The name aligned_end is probably not the best, but I can't think of a better name now.

Should be something like

Py_ssize_t n = end - begin; const STRINGLIB_CHAR *p = begin; const STRINGLIB_CHAR *aligned_end = begin + _Py_SIZE_ROUND_DOWN(n, ALIGNOF_SIZE_T);

Would be interested in understanding why you got a segfault though.

Misc/NEWS.d/next/Core and Builtins/2024-06-07-13-29-57.gh-issue-120196.uf2pIh.rst

Co-authored-by: Pieter Eendebak <[email protected]>

eendebakpt

The PR itself looks good. I am not sure the performance gain is enough to make this worthwhile, but I will let a core developer decide on this.

@pitrou As original author of this part of the code, could you review?

rhpvorderman · 2024-06-11T09:58:45Z

I am not sure the performance gain is enough to make this worthwhile,

Multiplied by the times that python decodes utf-8 text I think the total power saving worldwide would amount to something significant.

pitrou · 2024-06-11T11:52:15Z

Objects/unicodeobject.c

@@ -4700,6 +4700,8 @@ static Py_ssize_t
 ascii_decode(const char *start, const char *end, Py_UCS1 *dest)
 {
    const char *p = start;
+    const char *size_t_end  = end - SIZEOF_SIZE_T;
+    const char *unrolled_end = end - (4 * SIZEOF_SIZE_T);


@encukou Should we still use macros such as SIZEOF_SIZE_T and ALIGNOF_SIZE_T, or should we prefer sizeof and _Alignof?

IMO, we should prefer sizeof and _Alignof here.
Things are a bit different in public headers (where sizeof is OK but _Alignof isn't) or the preprocessor (in #if, neither work).

(There is no C API WG formal guideline yet; this is a personal recommendation. If anyone wants a wider discussion, do that in Discourse.)

pitrou · 2024-06-11T11:55:47Z

Objects/unicodeobject.c

@@ -4700,6 +4700,8 @@ static Py_ssize_t
 ascii_decode(const char *start, const char *end, Py_UCS1 *dest)
 {
    const char *p = start;
+    const char *size_t_end  = end - SIZEOF_SIZE_T;
+    const char *unrolled_end = end - (4 * SIZEOF_SIZE_T);

 #if SIZEOF_SIZE_T <= SIZEOF_VOID_P


Is CPython supported on some platform where size_t is strictly greater than void*? I know this PR doesn't touch this conditional, but I'm curious. @encukou

I'm not aware of any platform where sizeof(size_t) != sizeof(void*).

This is not really about supported platforms (i.e. where we run the tests), or currently supported platforms.
We should aim for standard C whenever it's reasonable; if a speedup needs an extra assumption it should document that with this kind of #if.

pitrou · 2024-06-11T12:02:40Z

When decoding large blocks of text (4KB per iteration) the performance benefit is clearly visible.

But it's still quite minor. In one instance (git main), you read your ASCII file at 1.9 GB/s. In the other instance (this PR), you read it at 2 GB/s. It's already unlikely that your processing pipeline is bottlenecked by the file read phase, so this difference will not really be noticeable in an actual use case.

Thoughts? @Fidget-Spinner @corona10 @markshannon

Fidget-Spinner · 2024-06-11T12:36:55Z

Sorry I'm not an expert on low-level OS or architecture internals, so I can't help here.

rhpvorderman · 2024-06-11T12:39:59Z

But it's still quite minor. In one instance (git main), you read your ASCII file at 1.9 GB/s. In the other instance (this PR), you read it at 2 GB/s.

I should note that performance should be even better at GCC 14, due to better autovectorization. That compiler is not widely shipped yet, and benchmarks are at GCC12.

pitrou · 2024-06-11T12:51:47Z

For another perspective, on a pure micro-benchmark, Python 3.12 can already decode ASCII bytes at 17 GB/s.

$ python3.12 -m timeit -s "s = b'x' * (1000**2)" "s.decode('ascii')"
5000 loops, best of 5: 56.7 usec per loop

I find it unlikely that going faster than 17 GB/s will make a meaningful difference on real-world workloads.

rhpvorderman · 2024-06-11T13:03:04Z

On my PC before:

$ ./python -m timeit -s "s = b'x' * (1000**2)" "s.decode('ascii')"
5000 loops, best of 5: 67.8 usec per loop

after:

./python -m timeit -s "s = b'x' * (1000**2)" "s.decode('ascii')"
5000 loops, best of 5: 49.3 usec per loop

On your not real-world microbenchmark it is 28% less runtime.
I work with 300GB+ ASCII files regularly, in bioinformatics. That is quite niche, sure, but it is a real-world work load.

EDIT: I managed to compile with GCC14:

./python -m timeit -s "s = b'x' * (1000**2)" "s.decode('ascii')"
5000 loops, best of 5: 45.2 usec per loop

rhpvorderman · 2024-06-11T17:39:25Z

import io
import sys

if __name__ == "__main__":
    file = sys.argv[1]
    encoding = sys.argv[2]
    block_size = io.DEFAULT_BUFFER_SIZE
    total = 0
    with open(file, "rt", encoding=encoding) as f:
        while True:
            block = f.read(block_size)
            if not block:
                break
            total += block.count("\n") 
    print(total)

I compiled python with GCC 14 this for both the main branch and this branch.
before:

Benchmark 1: ./python ../count_newlines.py ~/test/5millionreads_R1.fastq utf-8
  Time (mean ± σ):      1.335 s ±  0.008 s    [User: 0.974 s, System: 0.361 s]
  Range (min … max):    1.324 s …  1.346 s    5 runs
Benchmark 1: ./python ../count_newlines.py ~/test/5millionreads_R1.fastq latin-1
  Time (mean ± σ):      1.333 s ±  0.004 s    [User: 0.989 s, System: 0.344 s]
  Range (min … max):    1.328 s …  1.338 s    5 runs
 Benchmark 1: ./python ../count_newlines.py ~/test/5millionreads_R1.fastq ascii
  Time (mean ± σ):      1.331 s ±  0.003 s    [User: 0.974 s, System: 0.357 s]
  Range (min … max):    1.327 s …  1.333 s    5 runs

After:

Benchmark 1: ./python ../count_newlines.py ~/test/5millionreads_R1.fastq utf-8
  Time (mean ± σ):      1.317 s ±  0.005 s    [User: 0.971 s, System: 0.346 s]
  Range (min … max):    1.313 s …  1.324 s    5 runs

Benchmark 1: ./python ../count_newlines.py ~/test/5millionreads_R1.fastq latin-1
  Time (mean ± σ):      1.290 s ±  0.004 s    [User: 0.935 s, System: 0.353 s]
  Range (min … max):    1.284 s …  1.293 s    5 runs
  
Benchmark 1: ./python ../count_newlines.py ~/test/5millionreads_R1.fastq ascii
  Time (mean ± σ):      1.284 s ±  0.004 s    [User: 0.938 s, System: 0.346 s]
  Range (min … max):    1.280 s …  1.290 s    5 runs

The performance improvement is minor, but measurable about 3%.

picnixz

Here are some comments (since you are making some micro-optimizations, I took the liberty of having perhaps micro-optimization suggestions but I'm not sure whether they're significant or not).

picnixz · 2024-06-13T09:56:14Z

Objects/stringlib/find_max_char.h

@@ -69,13 +91,15 @@ STRINGLIB(find_max_char)(const STRINGLIB_CHAR *begin, const STRINGLIB_CHAR *end)
    Py_UCS4 mask;
    Py_ssize_t n = end - begin;
    const STRINGLIB_CHAR *p = begin;
-    const STRINGLIB_CHAR *unrolled_end = begin + _Py_SIZE_ROUND_DOWN(n, 4);
+    const STRINGLIB_CHAR *unrolled_end = begin + _Py_SIZE_ROUND_DOWN(n, 8);


As in the other PR, maybe the 4/8 choice can be chosen at compile time depending on the architecture.

99% of production work were performance matters will be done on ARM64 (with 16-byte neon vectors) and X86-64 (with SSE2 vectors) other platforms will not be hurt by this decision. I think there is no reason to complicate the build with choices like this.

picnixz · 2024-06-13T09:56:49Z

Objects/stringlib/find_max_char.h

-        STRINGLIB_CHAR bits = p[0] | p[1] | p[2] | p[3];
+        /* Loading 8 values at once allows platforms that have 16-byte vectors
+           to do a vector load and vector bitwise OR. */
+        STRINGLIB_CHAR bits = p[0] | p[1] | p[2] | p[3] | p[4] | p[5] | p[6] | p[7];


and here, you would have some #if arch == ... to choose between 4 or 8 values.

picnixz · 2024-06-13T10:39:25Z

Objects/unicodeobject.c

+            _p += (4 * SIZEOF_SIZE_T);
+            q += (4 * SIZEOF_SIZE_T);
+        }
+        while (_p <= size_t_end) {


You technically have 3 blocks of SIZEOF_SIZE_T until you reach size_t_end. In this case, I'd suggest creating a macro which is the unrolling of that (I'm not sure the compiler knows this or not, but maybe it does).

Good suggestion to check if the compiler unrolls this. After all, I removed the if statement and only check at the end.

Unfortunately this leads to very suboptimal assembly. So the extra while loop is useful here.

picnixz · 2024-06-13T10:41:12Z

Objects/unicodeobject.c

+                }
+                _p += (4 * SIZEOF_SIZE_T);
+            }
+            while (_p <= size_t_end) {


Again, here you have only 3 chunks so maybe check whether you can unroll the loop once again (actually, just check whether the assembly with -O2 or -O3 unrolls stuff).

picnixz · 2024-06-13T10:43:02Z

Objects/stringlib/find_max_char.h

-        STRINGLIB_CHAR bits = p[0] | p[1] | p[2] | p[3];
+        /* Loading 8 values at once allows platforms that have 16-byte vectors
+           to do a vector load and vector bitwise OR. */
+        STRINGLIB_CHAR bits = p[0] | p[1] | p[2] | p[3] | p[4] | p[5] | p[6] | p[7];


I feel that those vector load could be macros, for clarity purposes. It would then be optimized by the compiler but having them as macros might be helpful for future work.

I would to not have a macro, the code is good as it is.

picnixz · 2024-06-13T10:43:33Z

Objects/stringlib/find_max_char.h

+        _p += 1;
+    }
+    p = (const unsigned char *)_p;
+    while (p < _end) {


This can be done in a for-loop instead.

The original code is a while loop. Any particular reason why a for loop is preferable?

I would just say "for clarity" but let's leave it like this if it was the original code (less changes in this sense).

picnixz · 2024-06-13T10:46:16Z

Objects/stringlib/find_max_char.h

@@ -20,23 +20,45 @@ Py_LOCAL_INLINE(Py_UCS4)
 STRINGLIB(find_max_char)(const STRINGLIB_CHAR *begin, const STRINGLIB_CHAR *end)
 {
    const unsigned char *p = (const unsigned char *) begin;
+    const unsigned char *_end = (const unsigned char *)end;
+    const size_t *aligned_end = (const size_t *)(_end - SIZEOF_SIZE_T);


Should be something like

Py_ssize_t n = end - begin; const STRINGLIB_CHAR *p = begin; const STRINGLIB_CHAR *aligned_end = begin + _Py_SIZE_ROUND_DOWN(n, ALIGNOF_SIZE_T);

Would be interested in understanding why you got a segfault though.

vstinner · 2024-06-14T11:47:29Z

Objects/bytes_methods.c

+{
+    const char *p = cptr;
+    const char *end = p + len;
+    Py_ssize_t max_char = stringlib_find_max_char(cptr, end);


Can you please revert this change to make the PR easier to review? You can open a separated PR for that.

Done. #120497

rhpvorderman · 2024-06-14T11:58:07Z

@pitrou, thanks to your comment I simplified the code by only using unaligned loads with memcpy. In general on x86-64 and ARM64 unaligned loads only occur a penalty when it is across a cacheline boundary, but that is 1 cycle of latency. On other platforms I do not worry so much. If unaligned loads are illegal, memcpy won't translate into an unaligned load.

I found that bytes.isascii was not using find_max_char, which is a pity, so I remedied that too.

ascii_decode becomes much simpler when simply using memcpy and not worrying about unaligned loads and stores.

After these changes data volume is not an issue any more, also line by line reading and decoding small amounts of data is faster.

before:

./python -m timeit -s 'data=b"1234567890123456789012345678901"' 'data.decode("latin-1")'
5000000 loops, best of 5: 70.5 nsec per loop

after:

./python -m timeit -s 'data=b"1234567890123456789012345678901"' 'data.decode("latin-1")'
5000000 loops, best of 5: 62.4 nsec per loop

That really is the effect of the unaligned load at the end. Unlike my initial attempt, this also makes line by line reading about 3% faster.

I also notice that for larger volumes latin-1 now tends to outstrip ascii decoding on my PC (very much n=1, I know).

Since latin-1 decoding and ascii decoding have about the same speed, and latin-1 decoding actually being faster for larger volumes , I do wonder if the default latin-1 strategy, which is copying+finding max char, is better than the ascii decode strategy , copying while looking for max char.
In theory there could be a single function: find_non_ascii_char which returns the exact offset of a non_ascii_char. This can then be used by both ascii_decode and find_max_char. That would further reduce the amount of code in the codebase.

This reverts commit 1ce308e.

vstinner · 2024-06-17T10:27:07Z

Objects/stringlib/find_max_char.h

+    const unsigned char *size_t_end = _end - SIZEOF_SIZE_T;
+    const unsigned char *unrolled_end = _end - (4 * SIZEOF_SIZE_T - 1);
+    while (p < unrolled_end) {
+        /* Test chunks of 32 as more granularity limits compiler optimization */


32 what? the number of bytes depend on size_t size in bytes.

Whoops, will write a better comment. This is a "programmer talking to self" one.

Objects/stringlib/find_max_char.h

rhpvorderman · 2024-06-17T11:36:57Z

I have been reflecting a bit on the code and I understand this code change is controversial, it is adding a lot of lines for a limited speed gain. Especially the find_max_char code, while faster, ~~doesn't look much better~~ looks worse.
However I feel that the ascii_decode function that uses unaligned loads is an improvement over the current version. It does not require alignment of both src and dest pointers to use the optimal code path, it is less nested and shorter. I could break that out in a separate PR.

vstinner · 2024-09-10T13:04:41Z

@pitrou:

I find it unlikely that going faster than 17 GB/s will make a meaningful difference on real-world workloads.

I concur with Antoine: I'm not convinced that this PR is needed. It makes the code more complex, so harder to maintain, and I'm not impressed by the benchmark results. It's up to 1.37x faster on a micro-benchmark, ok, but when the workload is only "decode very long ASCII strings".

vstinner · 2024-09-10T13:07:46Z

Objects/stringlib/find_max_char.h

-        if (*p++ & 0x80)
+    const unsigned char *size_t_end = _end - SIZEOF_SIZE_T;
+    const unsigned char *unrolled_end = _end - (4 * SIZEOF_SIZE_T - 1);
+    while (p < unrolled_end) {


I would prefer to start processing bytes until we reach an address aligned of SIZEOF_SIZE_T, and then use size_t* pointers instead of memcpy().

Ok, good to know thanks. I will close another PR that simplifies the code by using this technique to allow unaligned loads:
#123895

rhpvorderman · 2024-09-11T07:02:26Z

I concur with Antoine: I'm not convinced that this PR is needed. It makes the code more complex, so harder to maintain, and I'm not impressed by the benchmark results. It's up to 1.37x faster on #120212 (comment), ok, but when the workload is only "decode very long ASCII strings".

Thanks for elaborating on this. Indeed 3% faster in typical use cases (reading line by line) only becomes notable when processing files that span multiple gigabytes. (I regularly work with those, but that contorts my view on normal workloads).

I will close this then.

vstinner · 2024-09-11T09:51:09Z

I will close this then.

I would suggest you to write an optimized ASCII decoder on PyPI which uses SIMD or something like that to control exactly the instructions and really get the best from the CPU. But using explicitly SIMD is too low-level for CPython :-(

rhpvorderman · 2024-09-11T10:29:19Z

Have already done that 😅 ... I am just backporting some ideas to cpython 😉 . But I will reserve that in the future for things that make a really noticable difference. Like #120398. Rather than using your precious review time for small incremental gains. Sorry for that.

vstinner · 2024-09-11T10:36:07Z

Have already done that 😅

What is the project name?

rhpvorderman · 2024-09-11T10:46:26Z

The project dnaio can be used to parse a format called FASTQ, which is by definition all ASCII.

This project reads 128 KiB buffers and copies ASCII strings from then. (Typical length is 150bp for DNA sequences). Rather than invoking PyUnicode_DecodeASCII which does an individual check per string, the 128KiB buffer is checked in its entirety with this code: https://github.com/marcelm/dnaio/blob/main/src/dnaio/ascii_check.h.
As a result there is a guarantee that any string copied from the buffer is ascii and PyUnicode_New can be used with a memcpy.

The ASCII checking code used SIMD at first, but compilers are clever enough and it is much more portable to write code that can be optimized by the compiler. In this case GCC can optimize the 4 size_t OR calls into 2 __m128i OR calls. Since most CPUs can execute multiple OR operations in one clock tick, the unrolled code is also a lot faster on compilers that do not autovectorize as well.

EDIT: Because everything should be ASCII, there is also no need to provide an early exit within the for loop. When a bigger than ASCII character is found an error should be raised, but that error does not have to be raised within nanoseconds.

rhpvorderman added 6 commits June 7, 2024 11:13

Write an unrolled version of find_max_char

a54bf86

Unroll 8 characters rather than 4 to improve vectorization in find_ma…

71ff457

…x_char

Bigger chunk ascii decoding

98a6449

Fix whitespace issues

10527c6

Fix compiler warnings

f0f4139

Prevent re-read of same value

849a068

bedevere-app bot added the awaiting review label Jun 7, 2024

bedevere-app bot mentioned this pull request Jun 7, 2024

Faster decode and find_max_char implementations. #120196

Closed

rhpvorderman added 2 commits June 7, 2024 12:33

Simplify find_max_char initialization

fad19a0

Add missing increments

f04bb2c

erlend-aasland added the performance Performance or resource usage label Jun 7, 2024

Add blurb

cd0fc5e

eendebakpt reviewed Jun 7, 2024

View reviewed changes

Replace unroll_end with unrolled_end

8f0fd56

eendebakpt reviewed Jun 10, 2024

View reviewed changes

Misc/NEWS.d/next/Core and Builtins/2024-06-07-13-29-57.gh-issue-120196.uf2pIh.rst Outdated Show resolved Hide resolved

rhpvorderman and others added 2 commits June 11, 2024 08:58

Reword burb

37aee7a

Co-authored-by: Pieter Eendebak <[email protected]>

Update Objects/stringlib/find_max_char.h

a6fc417

Co-authored-by: Pieter Eendebak <[email protected]>

eendebakpt reviewed Jun 11, 2024

View reviewed changes

pitrou reviewed Jun 11, 2024

View reviewed changes

Merge branch 'main' into MICROOPTIMIZATIONS

104ca62

picnixz reviewed Jun 13, 2024

View reviewed changes

rhpvorderman added 7 commits June 14, 2024 09:57

Merge branch 'main' into MICROOPTIMIZATIONS

d465517

Reuse find_max_char for bytes objects

1ce308e

Simplify the find_max_char function by loading unaligned

48f1e84

Allow optimized unaligned loads and simplify ascii_decode

21de804

Add loop for more optimal assembly

1ec2113

Also perform an unalgined load at the end

0258ae0

Fix compiler warning

ec76b74

vstinner reviewed Jun 14, 2024

View reviewed changes

Revert "Reuse find_max_char for bytes objects"

845eb4e

This reverts commit 1ce308e.

vstinner reviewed Jun 17, 2024

View reviewed changes

rhpvorderman added 2 commits June 17, 2024 13:22

Merge branch 'main' into MICROOPTIMIZATIONS

f8cc68d

Improve comments

89ab2c9

picnixz mentioned this pull request Aug 13, 2024

Improve performance of find_max_char #122901

Open

Merge branch 'main' into MICROOPTIMIZATIONS

002b7ec

This was referenced Sep 10, 2024

Simplifying ascii-decode #123894

Closed

GH-123894: Simplify ascii decode using unaligned loads #123895

Closed

vstinner reviewed Sep 10, 2024

View reviewed changes

rhpvorderman closed this Sep 11, 2024

rhpvorderman mentioned this pull request Sep 16, 2024

Faster ASCII and possibly UTF-8 decoding of text files with TextIOWrapper #101289

Closed

gh-120196: Faster ascii_decode and find_max_char implementations #120212

gh-120196: Faster ascii_decode and find_max_char implementations #120212

Conversation

rhpvorderman commented Jun 7, 2024 • edited by bedevere-app bot Loading

erlend-aasland commented Jun 7, 2024

eendebakpt commented Jun 7, 2024

rhpvorderman commented Jun 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eendebakpt left a comment

Choose a reason for hiding this comment

rhpvorderman commented Jun 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou commented Jun 11, 2024

Fidget-Spinner commented Jun 11, 2024

rhpvorderman commented Jun 11, 2024

pitrou commented Jun 11, 2024

rhpvorderman commented Jun 11, 2024 • edited Loading

rhpvorderman commented Jun 11, 2024

picnixz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhpvorderman commented Jun 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhpvorderman commented Jun 17, 2024 • edited Loading

vstinner commented Sep 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhpvorderman commented Sep 11, 2024

vstinner commented Sep 11, 2024

rhpvorderman commented Sep 11, 2024

vstinner commented Sep 11, 2024

rhpvorderman commented Sep 11, 2024 • edited Loading

rhpvorderman commented Jun 7, 2024 •

edited by bedevere-app bot

Loading

rhpvorderman commented Jun 7, 2024 •

edited

Loading

rhpvorderman commented Jun 11, 2024 •

edited

Loading

rhpvorderman commented Jun 17, 2024 •

edited

Loading

rhpvorderman commented Sep 11, 2024 •

edited

Loading