Skip to content

gh-131791: Improve speed of textwrap.dedent by replacing re #131792

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 18 commits into from
Closed
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
82 changes: 44 additions & 38 deletions Lib/textwrap.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
# Written by Greg Ward <[email protected]>

import re
import os

__all__ = ['TextWrapper', 'wrap', 'fill', 'dedent', 'indent', 'shorten']

Expand Down Expand Up @@ -413,9 +414,6 @@ def shorten(text, width, **kwargs):

# -- Loosely related functionality -------------------------------------

_whitespace_only_re = re.compile('^[ \t]+$', re.MULTILINE)
_leading_whitespace_re = re.compile('(^[ \t]*)(?:[^ \t\n])', re.MULTILINE)

def dedent(text):
"""Remove any common leading whitespace from every line in `text`.

Expand All @@ -429,42 +427,50 @@ def dedent(text):

Entirely blank lines are normalized to a newline character.
"""
# Look for the longest leading string of spaces and tabs common to
# all lines.
margin = None
text = _whitespace_only_re.sub('', text)
indents = _leading_whitespace_re.findall(text)
for indent in indents:
if margin is None:
margin = indent

# Current line more deeply indented than previous winner:
# no change (previous winner is still on top).
elif indent.startswith(margin):
pass

# Current line consistent with and no deeper than previous winner:
# it's the new winner.
elif margin.startswith(indent):
margin = indent

# Find the largest common whitespace between current line and previous
# winner.
else:
for i, (x, y) in enumerate(zip(margin, indent)):
if x != y:
margin = margin[:i]
break

# sanity check (testing/debugging only)
if 0 and margin:
for line in text.split("\n"):
assert not line or line.startswith(margin), \
"line = %r, margin = %r" % (line, margin)
# Fast paths for empty or simple text
if not text:
return text

if margin:
text = re.sub(r'(?m)^' + margin, '', text)
return text
if "\n" not in text:
return text # Single line has no dedent

# Split text into lines, preserving line endings
lines = text.splitlines(keepends=True)

# Process in a single pass to find:
# 1. Leading whitespace of non-blank lines
# 2. Whether a line has zero leading whitespace (optimization)
non_blank_whites = []
has_zero_margin = False

for line in lines:
stripped = line.strip()
if stripped: # Non-blank line
leading = line[:len(line) - len(line.lstrip())]
non_blank_whites.append(leading)
# Early detection of zero margin case
if not leading:
has_zero_margin = True
break # No need to check more lines

# If all lines are blank, normalize them
if not non_blank_whites:
# Preallocate result list
return "".join(["\n" if line.endswith("\n") else "" for line in lines])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How often does this occur? You could leave it out as the os.path.commonprefix([]) is '', which gives margin_len=0 so the case is correctly handled. It makes the code smaller.


# Skip commonprefix calculation if we already know there's no margin
if has_zero_margin:
margin_len = 0
else:
common = os.path.commonprefix(non_blank_whites)
margin_len = len(common)

# No common margin case - just normalize blank lines
if margin_len == 0:
return "".join([line if line.strip() else "\n" if line.endswith("\n") else "" for line in lines])

# Apply margin removal (most common case) with minimal operations
return "".join([line[margin_len:] if line.strip() else "\n" if line.endswith("\n") else "" for line in lines])


def indent(text, prefix, predicate=None):
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Optimized :func: ``textwrap.dedent``. It is now 4x faster than before for large inputs. Function now has the command argument to remove all common prefixes as well with ``only_whitespace`` instead of just whitespaces.
Patch by Marius Juston.
Loading