gh-107369: optimize textwrap.indent() #107374

methane · 2023-07-28T05:58:20Z

indent()-ing Object/unicodeobject.c (15332 lines) about 25% faster.

Issue: Optimize textwrap.indent() #107369

eendebakpt

Looks good! Using str.split for the predicate instead of line.strip might change something for input that is not str, but I think this is ok.

serhiy-storchaka

lstrip is faster for non-indented lines.

I wonder whether the following variants can be faster for some input and for how wide category of input.

def predicate(line):
    return line and (not line[0].isspace() or line.lstrip())

or

predicate = re.compile(r'\S').search

methane · 2023-07-28T16:51:29Z

_has_nonspace = re.compile(r'\S').search in global and predicate = _has_nonspace -- 3.5ms
str.rstrip = 1.95ms
str.lstrip = 2.03ms
lambda x: not x.isspace() = 2.07ms

Since we use splitlines(keepends=True), we can use just not x.isspace(). (no empty line is guaranteed. "".splitlines(keepends=True) == [] and "foo\n".splitlines(True) == ['foo\n']).
But it is a bit tricky and has relatively high cognitive load.

In case of unicodeobject.c, rstrip is bit faster. But it may be because most lines are indented already.

So I chose str.lstrip here, as Serhiy suggested.

serhiy-storchaka · 2023-07-28T18:52:30Z

Now that you mention it, I can see that using isspace() is the most obvious way to do this. Why I did not see it earlier?

We want to test whether the line has any non-space character. bool(line.strip()) is actually a tricky way -- we strips the line from spaces and if the rest is not empty string, then the original line has non-space characters too. not line.isspace() is a straightforward way -- it asks the opposite question (is the line only contains space characters?) and negates the result.

Algorithmically, isspace() looks more preferable, because it does not create a string. But on practice it may not matter in common cases. Did you compare variants with different inputs? For example Misc/NEWS.d/3.8.0a1.rst may show a very different result.

Lib/textwrap.py

methane · 2023-07-29T02:18:13Z

Now that you mention it, I can see that using isspace() is the most obvious way to do this. Why I did not see it earlier?

Because "".isspace() is False. We need to guarantee that "" is not used here.
x and not x.isspace() would be bit obvious, but little slower.

Algorithmically, isspace() looks more preferable, because it does not create a string. But on practice it may not matter in common cases. Did you compare variants with different inputs? For example Misc/NEWS.d/3.8.0a1.rst may show a very different result.

lstrip() is slow when every line has long indent. But Misc/NEWS.d/3.8.0a1.rst has almost no indents.

With 4c6a46a and https://gist.github.com/methane/5c6153c564d9508199a81c48d33161eb

> ./python.exe bench_indent.py Misc/NEWS.d/3.8.0a1.rst
filename='Misc/NEWS.d/3.8.0a1.rst' 8978 lines.
                   lstrip: 0.736msec
          not x.isspace(): 0.877msec
    x and not x.isspace(): 0.929msec

> ./python.exe bench_indent.py Objects/unicodeobject.c
filename='Objects/unicodeobject.c' 15332 lines.
                   lstrip: 1.812msec
          not x.isspace(): 1.877msec
    x and not x.isspace(): 1.970msec

If I add text = textwrap.indent(text, " "*32) before bench:

> ./python.exe bench_indent.py Objects/unicodeobject.c
filename='Objects/unicodeobject.c' 15332 lines.
                   lstrip: 2.259msec
          not x.isspace(): 2.356msec
    x and not x.isspace(): 2.437msec

methane · 2023-07-29T02:46:45Z

To maximize performance, we can stop using lambda by...:

    if predicate is None:
        for line in text.splitlines(True):
            if not line.isspace():
                prefixed_lines.append(prefix)
            prefixed_lines.append(line)
    else:
        for line in text.splitlines(True):
            if predicate(line):
                prefixed_lines.append(prefix)
            prefixed_lines.append(line)

filename='Objects/unicodeobject.c' 15332 lines.
                     None: 1.604msec
                   lstrip: 1.826msec
          not x.isspace(): 1.883msec

serhiy-storchaka

Thank you for your research Inada-san. Which to use here, lstrip or isspace, I leave up to you. It does not really matter in most cases.

picnixz · 2023-07-29T12:01:19Z

For very long texts, I think changing

prefixed_lines = []
for line in text.splitlines(True):
    if not line.isspace():
        prefixed_lines.append(prefix)
    prefixed_lines.append(line)

into the following may improve the overall performances

prefixed_lines = []
append_line = prefixed_lines.append
for line in text.splitlines(True):
    if not line.isspace():
        append_line(prefix)
    append_line(line)

EDIT: After a more careful benchmarking, this does not seem to bring more improvements. However, not using a lambda function seems to be better.

methane added 2 commits July 28, 2023 13:36

optimize textwrap.indent()

94ab051

Add NEWS

8c5896c

bedevere-bot mentioned this pull request Jul 28, 2023

Optimize textwrap.indent() #107369

Closed

bedevere-bot added the awaiting core review label Jul 28, 2023

methane added performance Performance or resource usage stdlib Python modules in the Lib dir labels Jul 28, 2023

Add what's new entry

6ee731c

eendebakpt approved these changes Jul 28, 2023

View reviewed changes

serhiy-storchaka reviewed Jul 28, 2023

View reviewed changes

Use lstrip instead of strip

fad98a2

eendebakpt reviewed Jul 28, 2023

View reviewed changes

Lib/textwrap.py Outdated Show resolved Hide resolved

avoid temporary tuple.

4c6a46a

methane added 2 commits July 29, 2023 12:34

use str.isspace instead of lstrip

5e60878

add comment about splitlines(True)

16e3dbd

serhiy-storchaka approved these changes Jul 29, 2023

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting core review labels Jul 29, 2023

25% -> 30%

734fd01

methane enabled auto-merge (squash) July 29, 2023 06:03

methane merged commit 37551c9 into python:main Jul 29, 2023

methane deleted the opt-textwrap-indent branch July 29, 2023 06:37

bedevere-bot removed the awaiting merge label Jul 29, 2023

This was referenced Jul 29, 2023

Optimize textwrap.indent a bit more. #107424

Closed

gh-107424: avoid using lambda functions in textwrap.indent() #107426

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-107369: optimize textwrap.indent() #107374

gh-107369: optimize textwrap.indent() #107374

methane commented Jul 28, 2023 •

edited by bedevere-bot

Loading

eendebakpt left a comment

serhiy-storchaka left a comment

methane commented Jul 28, 2023 •

edited

Loading

serhiy-storchaka commented Jul 28, 2023

methane commented Jul 29, 2023

methane commented Jul 29, 2023

serhiy-storchaka left a comment

picnixz commented Jul 29, 2023 •

edited

Loading

gh-107369: optimize textwrap.indent() #107374

gh-107369: optimize textwrap.indent() #107374

Conversation

methane commented Jul 28, 2023 • edited by bedevere-bot Loading

eendebakpt left a comment

Choose a reason for hiding this comment

serhiy-storchaka left a comment

Choose a reason for hiding this comment

methane commented Jul 28, 2023 • edited Loading

serhiy-storchaka commented Jul 28, 2023

methane commented Jul 29, 2023

methane commented Jul 29, 2023

serhiy-storchaka left a comment

Choose a reason for hiding this comment

picnixz commented Jul 29, 2023 • edited Loading

methane commented Jul 28, 2023 •

edited by bedevere-bot

Loading

methane commented Jul 28, 2023 •

edited

Loading

picnixz commented Jul 29, 2023 •

edited

Loading