gh-130167: Optimise `textwrap.dedent()` #131919

AA-Turner · 2025-03-30T20:24:38Z

Alternative to #131792, created as a new PR as I don't want to push commits to @Marius-Juston's main branch, which that PR is created from.

A

Issue: Improve speed of stdlib functions by replacing re uses #130167

Co-authored-by: Marius Juston <[email protected]>

AA-Turner · 2025-03-30T20:31:00Z

Open question: by switching to str.isspace() over the regular expression approach, we now strip '\v'. '\f', and '\r'. The documentation notes that the function Remove[s] any common leading whitespace from every line in text, so this could be seen as a bugfix, but is an observable change in behaviour.

Do we want to restrict the removed whitespace to just tabs and spaces, and update the documentation as appropriate, or is this change acceptable? Previously, there was inconsistent behaviour around Windows-style newlines, which this PR does resolve.

Benchmark	dedent_textwrap	dedent_newer
raw_text: no prefix	12.1 ms	4.81 ms: 2.52x faster
raw_text: "abc \t"	12.7 ms	4.81 ms: 2.63x faster
raw_text: " "	19.4 ms	5.77 ms: 3.37x faster
raw_text: "\t"	18.4 ms	5.66 ms: 3.25x faster
raw_text: "\t abc"	21.0 ms	5.75 ms: 3.65x faster
raw_text: "\t abc \t "	24.1 ms	6.12 ms: 3.94x faster
raw_text: 1000 spaces	338 ms	43.5 ms: 7.76x faster
Basic indented text with empty lines	4.66 us	2.92 us: 1.60x faster
Text with mixed indentation and blank lines	4.28 us	2.99 us: 1.43x faster
No indentation (edge case)	2.66 us	1.95 us: 1.37x faster
Only blank lines	1.09 us	299 ns: 3.64x faster
Edge case: No common prefix to remove	1.79 us	1.71 us: 1.05x faster
Edge case: Single indented line	2.90 us	1.67 us: 1.74x faster
Edge case: Single indented line only	704 ns	260 ns: 2.71x faster
Edge case: Empty text	478 ns	130 ns: 3.67x faster
no_indent	10.4 us	3.68 us: 2.81x faster
spaces	18.2 us	4.76 us: 3.81x faster
mixed	27.5 us	18.0 us: 1.53x faster
large_text	379 us	109 us: 3.47x faster
whitespace_only	1.23 us	299 ns: 4.12x faster
Geometric mean	(ref)	2.69x faster

A

picnixz · 2025-03-30T20:47:39Z

Can you use python -m pyperf compare_to dedent_textwrap.json dedent_new.json --table so that we have a markdown table (I think that's the option)

Lib/textwrap.py

Misc/NEWS.d/next/Library/2025-03-30-19-55-10.gh-issue-131792.NNjzFA.rst

Lib/textwrap.py

AA-Turner · 2025-03-30T21:20:20Z

Can you use python -m pyperf compare_to dedent_textwrap.json dedent_new.json --table so that we have a markdown table (I think that's the option)

Done, and I reran the benchmarks, now showing a 2.7x improvement.

Lib/textwrap.py

eendebakpt · 2025-03-30T21:36:00Z

Open question: by switching to str.isspace() over the regular expression approach, we now strip '\v'. '\f', and '\r'. The documentation notes that the function Remove[s] any common leading whitespace from every line in text, so this could be seen as a bugfix, but is an observable change in behaviour.

This behaviour change is also in the PR by @Marius-Juston. I have no strong opinion on whether the behaviour change is acceptable or not, but it should be in the news entry (and perhaps even in the whats new?).

eendebakpt · 2025-03-30T21:43:18Z

Lib/textwrap.py

+    if not text:
+        return text
+
+    # If the input is entirely whitespace, return normalized lines.
+    if text.isspace():
+        return '\n' * text.count('\n')


The paths are here because otherwise non_blank_lines is empty (not for performance I would say, the cases are rare).

It is possible though to get rid of these cases by writing

def dedent(text): lines = text.split('\n') # Get length of leading whitespace, inspired by ``os.path.commonprefix()``. non_blank_lines = [l for l in lines if l and not l.isspace()] l1 = min(non_blank_lines, default='hello') l2 = max(non_blank_lines, default='world') for margin, c in enumerate(l1): if c != l2[margin] or c not in ' \t': break return '\n'.join([l[margin:] if not l.isspace() else '' for l in lines])

Not sure whether this makes the code nice and small, or part of some obfuscated code contest :-)

Using:

def dedent(text): lines = text.split('\n') non_blank_lines = [l for l in lines if l and not l.isspace()] l1 = min(non_blank_lines, default='') l2 = max(non_blank_lines, default='') margin = 0 for margin, c in enumerate(l1): if c != l2[margin] or c not in ' \t': break return '\n'.join([l[margin:] if not l.isspace() else '' for l in lines])

I get the following benchmark results:

Benchmark textwrap with whitespace checks proposed (no branches)

raw_text: no prefix 12.1 ms 4.81 ms: 2.52x faster 4.71 ms: 2.58x faster

raw_text: "abc \t" 12.7 ms 4.81 ms: 2.63x faster 4.50 ms: 2.81x faster

raw_text: " " 19.4 ms 5.77 ms: 3.37x faster 5.67 ms: 3.43x faster

raw_text: "\t" 18.4 ms 5.66 ms: 3.25x faster 5.54 ms: 3.32x faster

raw_text: " \t abc" 21.0 ms 5.75 ms: 3.65x faster 5.42 ms: 3.88x faster

raw_text: " \t abc \t " 24.1 ms 6.12 ms: 3.94x faster 5.76 ms: 4.18x faster

raw_text: 1000 spaces 338 ms 43.5 ms: 7.76x faster 42.7 ms: 7.91x faster

Basic indented text with empty lines 4.66 us 2.92 us: 1.60x faster 3.08 us: 1.52x faster

Text with mixed indentation and blank lines 4.28 us 2.99 us: 1.43x faster 3.07 us: 1.39x faster

No indentation (edge case) 2.66 us 1.95 us: 1.37x faster 2.05 us: 1.30x faster

Only blank lines 1.09 us 299 ns: 3.64x faster 1.56 us: 1.43x slower

Edge case: No common prefix to remove 1.79 us 1.71 us: 1.05x faster not significant

Edge case: Single indented line 2.90 us 1.67 us: 1.74x faster 1.68 us: 1.73x faster

Edge case: Single indented line only 704 ns 260 ns: 2.71x faster 1.00 us: 1.42x slower

Edge case: Empty text 478 ns 130 ns: 3.67x faster 1.01 us: 2.11x slower

no_indent 10.4 us 3.68 us: 2.81x faster 3.97 us: 2.61x faster

spaces 18.2 us 4.76 us: 3.81x faster 4.57 us: 3.98x faster

mixed 27.5 us 18.0 us: 1.53x faster 16.9 us: 1.63x faster

large_text 379 us 109 us: 3.47x faster 117 us: 3.23x faster

whitespace_only 1.23 us 299 ns: 4.12x faster 1.40 us: 1.14x slower

Geometric mean (ref) 2.69x faster 1.94x faster

Notably, the synthetic benchmarks contain quite a lot of whitespace-only cases: Only blank lines, Edge case: Single indented line only, Edge case: Empty text, and whitespace_only. The text-heavy benchmarks don't seem to change, and some have slight improvements.

This is looking really good! I am having a lot of difficulties improving this further. The only possible other optimizations I can think of are optimizing the min / max computation to maybe perform the computation as it is computing the non_blank_lines and maybe cache the computation of not l.isspace() since that is done twice. I was trying to implement something like that but did not have much success there since the list comprehension is C optimized as well as the min and max operations.

Caching would only make sense for lines that are very very long and for which we need to iterate multiple times over the same line. Now, for caching, we could either use a custom str class to cache the call or cache the results themselves in an other list and use it as a mask (as for numpy arrays) but I don't know if we're gaining something (we need to test that).

isspace = [l.isspace() for l in lines] non_blank_lines = [lines[i] for i, v in enumerate(isspace) if v] ... return '\n'.join([l[margin:] if not isspace[i] else '' for i, l in enumerate(lines)])

I'm almost certain we can compute isspace and non_blank_lines simultaneously using some itertools recipes but I don't know how much we gain. In the worst case, we need to iterate over all the lines at least 3 + 3 (min/max + compute margin). Currently, we already iterate over the lines at least 1 + 2 + 1 + 1 times (1 for non_blank_lines, 2 for min/max, 1 for margin, and 1 for the final merge) so we're already doing lots of passes.

I benchmarked a version using itertools.compress, and found it was slower overall (1.1x):

# Get length of leading whitespace, inspired by ``os.path.commonprefix()``. non_blank_idx = [l and not l.isspace() for l in lines] non_blank_lines = list(compress(lines, non_blank_idx)) l1 = min(non_blank_lines, default='') l2 = max(non_blank_lines, default='') margin = 0 for margin, c in enumerate(l1): if c != l2[margin] or c not in ' \t': break return '\n'.join([l[margin:] if non_blank_idx[i] else '' for i, l in enumerate(lines)])

I suspect that the isspace() calls are quite cheap, as I assume they exit on the first non-space character. I also prefer the simplicity of the current implementation. What would help is a C-level min-max function that returns both the min and max from one iteration, but that's out of scope for this PR.

they exit on the first non-space character

yes

What would help is a C-level min-max function that returns both the min and max from one iteration, but that's out of scope for this PR.

Agreed! I was just thinking about potentially creating this function. If I were to try this do you happen to know where the min and max implementation is in this repo? Should be pretty simple to implement the min-max version from that.

…eter Eendebak

Marius-Juston · 2025-03-30T22:12:51Z

@AA-Turner, would it be possible to get your benchmarking script, please, so that I can use that as a comparison?

AA-Turner · 2025-03-30T22:15:19Z

$ python bm_dedent.py --fast --quiet --warmups 3 --output <filename>.json --quiet
$ python -m pyperf compare_to <old>.json <new>.json --table --table-format md

bm_dedent.py

from textwrap import indent, dedent as textwrap_dedent

import pyperf

def setup_cases():
    with open("Objects/unicodeobject.c") as f:
        raw_text = f.read()

    return [
        ('raw_text: no prefix', raw_text),
        (r'raw_text: "abc  \t"', indent(raw_text, 'abc  \t')),
        ('raw_text: "    "', indent(raw_text, '    ')),
        (r'raw_text: "\t"', indent(raw_text, '\t')),
        (r'raw_text: " \t  abc"', indent(raw_text, ' \t  abc')),
        (r'raw_text: " \t  abc  \t  "', indent(raw_text, ' \t  abc  \t  ')),
        ("raw_text: 1000 spaces", indent(raw_text, ' ' * 1000)),
        (
            'Basic indented text with empty lines',
            """
        def hello():
            print("Hello, world!")


        """,
        ),
        (
            'Text with mixed indentation and blank lines',
            """
        This is a test.

        Another line.

    """,
        ),
        (
            'No indentation (edge case)',
            """No indents here.
    Just normal text.

    With empty lines.""",
        ),
        (
            'Only blank lines',
            """


    """,
        ),
        (
            'Edge case: No common prefix to remove',
            """hello
        world
    """,
        ),
        (
            'Edge case: Single indented line',
            """    Only one indented line""",
        ),
        (
            'Edge case: Single indented line only',
            """    """,
        ),
        (
            'Edge case: Empty text',
            """""",
        ),
        ('no_indent',textwrap_dedent(textwrap_dedent.__doc__)),
        ('spaces','\n'.join(['   ' + l for l in textwrap_dedent(textwrap_dedent.__doc__).split('\n')])),
        ('mixed','\n'.join(['   Hello space', '\tHello tab', 'Hello'] * 20)),
        ('large_text','\n'.join([textwrap_dedent.__doc__] * 40)),
        ('whitespace_only','\n'.join(['  ', '\t', ''])),
    ]


if __name__ == '__main__':
    runner = pyperf.Runner()
    args = runner.parse_args()

    for name, text in setup_cases():
        runner.bench_func(name or 'default', textwrap_dedent, text)

AA-Turner · 2025-03-31T00:11:49Z

Thanks all!

A

Misc/NEWS.d/next/Library/2025-03-30-19-55-10.gh-issue-131792.NNjzFA.rst

hroncok · 2025-04-15T12:44:00Z

This changes the exception type when bad type is passed:

Before:

>>> import textwrap
>>> textwrap.dedent(5)
Traceback (most recent call last):
  File "<python-input-1>", line 1, in <module>
    textwrap.dedent(5)
    ~~~~~~~~~~~~~~~^^^
  File "/usr/lib64/python3.14/textwrap.py", line 435, in dedent
    text = _whitespace_only_re.sub('', text)
TypeError: expected string or bytes-like object, got 'int'

After:

>>> import textwrap
>>> textwrap.dedent(5)
Traceback (most recent call last):
  File "<python-input-1>", line 1, in <module>
    textwrap.dedent(5)
    ~~~~~~~~~~~~~~~^^^
  File "/usr/lib64/python3.14/textwrap.py", line 432, in dedent
    lines = text.split('\n')
            ^^^^^^^^^^
AttributeError: 'int' object has no attribute 'split'

The exception context itself could be better in both cases, but arguably, TypeError is better here. This breaks tests assumption in https://github.com/laurent-laporte-pro/deprecated/blob/3f67ca71143479ef7ce71786538b2bc0ef9ee4b4/tests/test_deprecated.py#L201

(Originally from #107369 (comment))

Co-authored-by: Marius Juston <[email protected]> Co-authored-by: Pieter Eendebak <[email protected]> Co-authored-by: Bénédikt Tran <[email protected]>

picnixz · 2025-04-17T14:09:44Z

Oh that's indeed an oversight. TypeError is a better error so let's wrap the call and reraise a TypeError instead if we catch an AttributeError

cc @AA-Turner

AA-Turner · 2025-04-17T14:11:06Z

I'll prepare a PR.

AA-Turner and others added 3 commits March 30, 2025 20:29

Add whitespace tests

be59df8

Optimise textwrap.dedent()

fec6717

Co-authored-by: Marius Juston <[email protected]>

Further micro-optimisation for entirely blank input

19e76d3

bedevere-app bot added the awaiting core review label Mar 30, 2025

bedevere-app bot mentioned this pull request Mar 30, 2025

Improve speed of textwrap.dedent #131791

Closed

AA-Turner mentioned this pull request Mar 30, 2025

gh-131791: Improve speed of textwrap.dedent by replacing re #131792

Closed

whitespace

850c07a

AA-Turner requested review from picnixz, eendebakpt and ZeroIntensity March 30, 2025 20:31

picnixz reviewed Mar 30, 2025

View reviewed changes

Lib/textwrap.py Outdated Show resolved Hide resolved

Misc/NEWS.d/next/Library/2025-03-30-19-55-10.gh-issue-131792.NNjzFA.rst Outdated Show resolved Hide resolved

Lib/textwrap.py Outdated Show resolved Hide resolved

Lib/textwrap.py Outdated Show resolved Hide resolved

AA-Turner added 2 commits March 30, 2025 22:09

Use margin as the loop index variable

796fb3e

Add full stops

ecc9cfc

eendebakpt reviewed Mar 30, 2025

View reviewed changes

Lib/textwrap.py Outdated Show resolved Hide resolved

eendebakpt reviewed Mar 30, 2025

View reviewed changes

AA-Turner added 2 commits March 30, 2025 22:43

Further micro-optimisation suggested by Pieter Eendebak

311ac87

Remove early-exit check for whitespace-only input, as suggested by Pi…

f2050ae

…eter Eendebak

AA-Turner added 3 commits March 31, 2025 01:02

More test updates

1f73bcb

Update NEWS

cc67165

Update NEWS

bf58dfa

AA-Turner enabled auto-merge (squash) March 31, 2025 00:12

AA-Turner changed the title ~~gh-131791: Optimise textwrap.dedent()~~ gh-130167: Optimise textwrap.dedent() Mar 31, 2025

bedevere-app bot mentioned this pull request Mar 31, 2025

Improve speed of stdlib functions by replacing re uses #130167

Open

AA-Turner disabled auto-merge March 31, 2025 00:13

AA-Turner enabled auto-merge (squash) March 31, 2025 00:14

picnixz reviewed Mar 31, 2025

View reviewed changes

Misc/NEWS.d/next/Library/2025-03-30-19-55-10.gh-issue-131792.NNjzFA.rst Show resolved Hide resolved

AA-Turner merged commit 6aa88a2 into python:main Mar 31, 2025
39 checks passed

bedevere-app bot removed the awaiting core review label Mar 31, 2025

AA-Turner deleted the tw-dedent branch March 31, 2025 00:41

AA-Turner mentioned this pull request Mar 31, 2025

gh-107369: Optimise textwrap.indent() #131923

Merged

Marius-Juston mentioned this pull request Mar 31, 2025

Improve import time of various stdlib modules #118761

Open

methane mentioned this pull request Apr 15, 2025

Optimize textwrap.indent() #107369

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-130167: Optimise `textwrap.dedent()` #131919

gh-130167: Optimise `textwrap.dedent()` #131919

AA-Turner commented Mar 30, 2025 •

edited

Loading

AA-Turner commented Mar 30, 2025 •

edited

Loading

picnixz commented Mar 30, 2025

AA-Turner commented Mar 30, 2025

eendebakpt commented Mar 30, 2025

eendebakpt Mar 30, 2025

AA-Turner Mar 30, 2025 •

edited

Loading

Marius-Juston Mar 30, 2025

picnixz Mar 30, 2025

AA-Turner Mar 30, 2025

picnixz Mar 30, 2025

Marius-Juston Mar 31, 2025

Marius-Juston commented Mar 30, 2025

AA-Turner commented Mar 30, 2025 •

edited

Loading

AA-Turner commented Mar 31, 2025

hroncok commented Apr 15, 2025

picnixz commented Apr 17, 2025

AA-Turner commented Apr 17, 2025

gh-130167: Optimise textwrap.dedent() #131919

gh-130167: Optimise textwrap.dedent() #131919

Conversation

AA-Turner commented Mar 30, 2025 • edited Loading

AA-Turner commented Mar 30, 2025 • edited Loading

picnixz commented Mar 30, 2025

AA-Turner commented Mar 30, 2025

eendebakpt commented Mar 30, 2025

eendebakpt Mar 30, 2025

Choose a reason for hiding this comment

AA-Turner Mar 30, 2025 • edited Loading

Choose a reason for hiding this comment

Marius-Juston Mar 30, 2025

Choose a reason for hiding this comment

picnixz Mar 30, 2025

Choose a reason for hiding this comment

AA-Turner Mar 30, 2025

Choose a reason for hiding this comment

picnixz Mar 30, 2025

Choose a reason for hiding this comment

Marius-Juston Mar 31, 2025

Choose a reason for hiding this comment

Marius-Juston commented Mar 30, 2025

AA-Turner commented Mar 30, 2025 • edited Loading

AA-Turner commented Mar 31, 2025

hroncok commented Apr 15, 2025

picnixz commented Apr 17, 2025

AA-Turner commented Apr 17, 2025

gh-130167: Optimise `textwrap.dedent()` #131919

gh-130167: Optimise `textwrap.dedent()` #131919

AA-Turner commented Mar 30, 2025 •

edited

Loading

AA-Turner commented Mar 30, 2025 •

edited

Loading

AA-Turner Mar 30, 2025 •

edited

Loading

AA-Turner commented Mar 30, 2025 •

edited

Loading