-
-
Notifications
You must be signed in to change notification settings - Fork 31.7k
gh-130167: Optimise textwrap.dedent()
#131919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: Marius Juston <[email protected]>
Open question: by switching to Do we want to restrict the removed whitespace to just tabs and spaces, and update the documentation as appropriate, or is this change acceptable? Previously, there was inconsistent behaviour around Windows-style newlines, which this PR does resolve.
A |
Can you use |
Misc/NEWS.d/next/Library/2025-03-30-19-55-10.gh-issue-131792.NNjzFA.rst
Outdated
Show resolved
Hide resolved
Done, and I reran the benchmarks, now showing a 2.7x improvement. |
This behaviour change is also in the PR by @Marius-Juston. I have no strong opinion on whether the behaviour change is acceptable or not, but it should be in the news entry (and perhaps even in the whats new?). |
Lib/textwrap.py
Outdated
if not text: | ||
return text | ||
|
||
# If the input is entirely whitespace, return normalized lines. | ||
if text.isspace(): | ||
return '\n' * text.count('\n') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The paths are here because otherwise non_blank_lines
is empty (not for performance I would say, the cases are rare).
It is possible though to get rid of these cases by writing
def dedent(text):
lines = text.split('\n')
# Get length of leading whitespace, inspired by ``os.path.commonprefix()``.
non_blank_lines = [l for l in lines if l and not l.isspace()]
l1 = min(non_blank_lines, default='hello')
l2 = max(non_blank_lines, default='world')
for margin, c in enumerate(l1):
if c != l2[margin] or c not in ' \t':
break
return '\n'.join([l[margin:] if not l.isspace() else ''
for l in lines])
Not sure whether this makes the code nice and small, or part of some obfuscated code contest :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using:
def dedent(text):
lines = text.split('\n')
non_blank_lines = [l for l in lines if l and not l.isspace()]
l1 = min(non_blank_lines, default='')
l2 = max(non_blank_lines, default='')
margin = 0
for margin, c in enumerate(l1):
if c != l2[margin] or c not in ' \t':
break
return '\n'.join([l[margin:] if not l.isspace() else '' for l in lines])
I get the following benchmark results:
Benchmark | textwrap | with whitespace checks | proposed (no branches) |
---|---|---|---|
raw_text: no prefix | 12.1 ms | 4.81 ms: 2.52x faster | 4.71 ms: 2.58x faster |
raw_text: "abc \t" | 12.7 ms | 4.81 ms: 2.63x faster | 4.50 ms: 2.81x faster |
raw_text: " " | 19.4 ms | 5.77 ms: 3.37x faster | 5.67 ms: 3.43x faster |
raw_text: "\t" | 18.4 ms | 5.66 ms: 3.25x faster | 5.54 ms: 3.32x faster |
raw_text: " \t abc" | 21.0 ms | 5.75 ms: 3.65x faster | 5.42 ms: 3.88x faster |
raw_text: " \t abc \t " | 24.1 ms | 6.12 ms: 3.94x faster | 5.76 ms: 4.18x faster |
raw_text: 1000 spaces | 338 ms | 43.5 ms: 7.76x faster | 42.7 ms: 7.91x faster |
Basic indented text with empty lines | 4.66 us | 2.92 us: 1.60x faster | 3.08 us: 1.52x faster |
Text with mixed indentation and blank lines | 4.28 us | 2.99 us: 1.43x faster | 3.07 us: 1.39x faster |
No indentation (edge case) | 2.66 us | 1.95 us: 1.37x faster | 2.05 us: 1.30x faster |
Only blank lines | 1.09 us | 299 ns: 3.64x faster | 1.56 us: 1.43x slower |
Edge case: No common prefix to remove | 1.79 us | 1.71 us: 1.05x faster | not significant |
Edge case: Single indented line | 2.90 us | 1.67 us: 1.74x faster | 1.68 us: 1.73x faster |
Edge case: Single indented line only | 704 ns | 260 ns: 2.71x faster | 1.00 us: 1.42x slower |
Edge case: Empty text | 478 ns | 130 ns: 3.67x faster | 1.01 us: 2.11x slower |
no_indent | 10.4 us | 3.68 us: 2.81x faster | 3.97 us: 2.61x faster |
spaces | 18.2 us | 4.76 us: 3.81x faster | 4.57 us: 3.98x faster |
mixed | 27.5 us | 18.0 us: 1.53x faster | 16.9 us: 1.63x faster |
large_text | 379 us | 109 us: 3.47x faster | 117 us: 3.23x faster |
whitespace_only | 1.23 us | 299 ns: 4.12x faster | 1.40 us: 1.14x slower |
Geometric mean | (ref) | 2.69x faster | 1.94x faster |
Notably, the synthetic benchmarks contain quite a lot of whitespace-only cases: Only blank lines
, Edge case: Single indented line only
, Edge case: Empty text
, and whitespace_only
. The text-heavy benchmarks don't seem to change, and some have slight improvements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking really good! I am having a lot of difficulties improving this further. The only possible other optimizations I can think of are optimizing the min
/ max
computation to maybe perform the computation as it is computing the non_blank_lines and maybe cache the computation of not l.isspace()
since that is done twice. I was trying to implement something like that but did not have much success there since the list comprehension is C optimized as well as the min
and max
operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caching would only make sense for lines that are very very long and for which we need to iterate multiple times over the same line. Now, for caching, we could either use a custom str class to cache the call or cache the results themselves in an other list and use it as a mask (as for numpy arrays) but I don't know if we're gaining something (we need to test that).
isspace = [l.isspace() for l in lines]
non_blank_lines = [lines[i] for i, v in enumerate(isspace) if v]
...
return '\n'.join([l[margin:] if not isspace[i] else '' for i, l in enumerate(lines)])
I'm almost certain we can compute isspace
and non_blank_lines
simultaneously using some itertools
recipes but I don't know how much we gain. In the worst case, we need to iterate over all the lines at least 3 + 3 (min/max + compute margin). Currently, we already iterate over the lines at least 1 + 2 + 1 + 1 times (1 for non_blank_lines
, 2 for min/max
, 1 for margin, and 1 for the final merge) so we're already doing lots of passes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I benchmarked a version using itertools.compress
, and found it was slower overall (1.1x):
# Get length of leading whitespace, inspired by ``os.path.commonprefix()``.
non_blank_idx = [l and not l.isspace() for l in lines]
non_blank_lines = list(compress(lines, non_blank_idx))
l1 = min(non_blank_lines, default='')
l2 = max(non_blank_lines, default='')
margin = 0
for margin, c in enumerate(l1):
if c != l2[margin] or c not in ' \t':
break
return '\n'.join([l[margin:] if non_blank_idx[i] else '' for i, l in enumerate(lines)])
I suspect that the isspace()
calls are quite cheap, as I assume they exit on the first non-space character. I also prefer the simplicity of the current implementation. What would help is a C-level min-max function that returns both the min and max from one iteration, but that's out of scope for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they exit on the first non-space character
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would help is a C-level min-max function that returns both the min and max from one iteration, but that's out of scope for this PR.
Agreed! I was just thinking about potentially creating this function. If I were to try this do you happen to know where the min
and max
implementation is in this repo? Should be pretty simple to implement the min-max version from that.
@AA-Turner, would it be possible to get your benchmarking script, please, so that I can use that as a comparison? |
$ python bm_dedent.py --fast --quiet --warmups 3 --output <filename>.json --quiet
$ python -m pyperf compare_to <old>.json <new>.json --table --table-format md
|
Thanks all! A |
textwrap.dedent()
textwrap.dedent()
This changes the exception type when bad type is passed: Before: >>> import textwrap
>>> textwrap.dedent(5)
Traceback (most recent call last):
File "<python-input-1>", line 1, in <module>
textwrap.dedent(5)
~~~~~~~~~~~~~~~^^^
File "/usr/lib64/python3.14/textwrap.py", line 435, in dedent
text = _whitespace_only_re.sub('', text)
TypeError: expected string or bytes-like object, got 'int' After: >>> import textwrap
>>> textwrap.dedent(5)
Traceback (most recent call last):
File "<python-input-1>", line 1, in <module>
textwrap.dedent(5)
~~~~~~~~~~~~~~~^^^
File "/usr/lib64/python3.14/textwrap.py", line 432, in dedent
lines = text.split('\n')
^^^^^^^^^^
AttributeError: 'int' object has no attribute 'split' The exception context itself could be better in both cases, but arguably, TypeError is better here. This breaks tests assumption in https://github.com/laurent-laporte-pro/deprecated/blob/3f67ca71143479ef7ce71786538b2bc0ef9ee4b4/tests/test_deprecated.py#L201 (Originally from #107369 (comment)) |
Co-authored-by: Marius Juston <[email protected]> Co-authored-by: Pieter Eendebak <[email protected]> Co-authored-by: Bénédikt Tran <[email protected]>
Oh that's indeed an oversight. TypeError is a better error so let's wrap the call and reraise a TypeError instead if we catch an AttributeError cc @AA-Turner |
I'll prepare a PR. |
Alternative to #131792, created as a new PR as I don't want to push commits to @Marius-Juston's main branch, which that PR is created from.
A
re
uses #130167