-
Notifications
You must be signed in to change notification settings - Fork 13.3k
improve case conversion happy path #97046
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve case conversion happy path #97046
Conversation
Hey! It looks like you've submitted a new PR for the library teams! If this PR contains changes to any Examples of
|
(rust-highfive has picked a reviewer for you, use r? to override) |
This comment has been minimized.
This comment has been minimized.
6e85b6a
to
49657b1
Compare
I have benchmarked this on unicode heavy texts too (a 3MB Russian text file) and have found marginally better performance in this change:
So I am not worried about the initial ascii checks being negative to performance in strictly non-ascii contexts |
Could you maybe try German or something like that. And put umlauts somewhere at the end? (Or just use random garbled text where the start is only ASCII and the end contains unicode |
This is basically when my original benchmarks do. Features 2 copies of Macbeth, one untouched only ascii, the other features 16 bytes of non-ascii at the end. This was to test how well it handled pure ascii with the seamless break out:
Performance it expected to be somewhere in the middle if the text contains a unicode character half way through. Which is exactly what we see in the following results:
|
r? @thomcc |
Nice benches, but what about short strings? |
Fair point. While still maybe unnatural, I ran the following benchmark ascii.split('.').map(str::to_lowercase).collect::<Vec<_>>() over my large text files, this way it's tested more against shorter random length strings. I got the following results:
This most likely would improve by reducing the magic number Reducing the magic number from 16 to 8 I got minimal improvements (-25%) over the small string bench, but significant regression in the large string bench (+90%) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mostly looks okay to me, and I expect it to be an improvement, but I'm not sure that uppercasing strings over 100 bytes long is actually the common case.
Do you have benchmarks of the current version? |
Long ascii string
Long string with unicode half way through
Short ascii strings
|
This looks good to me, then. No rollup because of possible perf impact if things change cases. @bors r+ rollup=never |
Wait, maybe squash 10 commits? |
bors sleep. |
43d1d20
to
d0f9930
Compare
@bors r+ rollup=never |
📌 Commit d0f9930 has been approved by |
☀️ Test successful - checks-actions |
Finished benchmarking commit (1851f08): comparison url. Instruction count
Max RSS (memory usage)Results
CyclesThis benchmark run did not return any relevant results for this metric. If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf. @rustbot label: -perf-regression Footnotes |
Someone shared the source code for Go's string case conversion.
It features a hot path for ascii-only strings (although I assume for reasons specific to go, they've opted for a read safe hot loop).
I've borrowed these ideas and also kept our existing code to provide a fast path + seamless utf-8 correct path fallback.
(Naive) Benchmarks can be found here https://github.com/conradludgate/case-conv
For the cases where non-ascii is found near the start, the performance of this algorithm does fall back to original speeds and has not had any measurable speed loss