difflib._check_types allows string inputs instead of sequences of strings as documented #115801

lampwins · 2024-02-22T03:26:39Z

Bug report

Bug description:

Both difflib.unified_diff and difflib.context_diff document that a and b input arguments are to be lists of strings. These functions perform argument type checking by way of difflib._check_types, however this function allows a and b to be direct str arguments. Technically this does not cause a failure in difflib.unified_diff (and I assume the same is true for context_diff but I have not tested it), however, for very large strings a and/or b, difflib.unified_diff is exponentially slower to calculate the diff, because the underlying SequenceMatcher is optimized to compare two lists of string, not two sequences of chars.

We can obviously see that the implementation of _check_types uses a seemingly naive type check of the first element in a and b which will pass for both the documented input of a list of strings, but also for a positive length string itself.

def _check_types(a, b, *args):
    # Checking types is weird, but the alternative is garbled output when
    # someone passes mixed bytes and str to {unified,context}_diff(). E.g.
    # without this check, passing filenames as bytes results in output like
    #   --- b'oldfile.txt'
    #   +++ b'newfile.txt'
    # because of how str.format() incorporates bytes objects.
    if a and not isinstance(a[0], str):
        raise TypeError('lines to compare must be str, not %s (%r)' %
                        (type(a[0]).__name__, a[0]))
    if b and not isinstance(b[0], str):
        raise TypeError('lines to compare must be str, not %s (%r)' %
                        (type(b[0]).__name__, b[0]))
    for arg in args:
        if not isinstance(arg, str):
            raise TypeError('all arguments must be str, not: %r' % (arg,))

This would be rather trivial to fix in _check_types but I want to first make sure that the documentation of a list of string is to be considered correct behavior?

CPython versions tested on:

3.8, 3.9, 3.10, 3.11, 3.12, 3.13, CPython main branch

Operating systems tested on:

Linux, macOS

Linked PRs

gh-115801: Only allow sequence of strings as input for difflib.unified_diff #118333

The text was updated successfully, but these errors were encountered:

…d_diff (GH-118333)

serhiy-storchaka · 2024-06-10T11:12:18Z

I agree that passing strings instead of sequences of strings is most likely a user bug. But there is a tiny chance that this is used intentionally, so it is safer to not change this in maintained releases.

…unified_diff (pythonGH-118333)

lampwins added the type-bug An unexpected behavior, bug, or error label Feb 22, 2024

bedevere-app bot mentioned this issue Apr 26, 2024

gh-115801: Only allow sequence of strings as input for difflib.unified_diff #118333

Merged

serhiy-storchaka pushed a commit that referenced this issue Jun 10, 2024

gh-115801: Only allow sequence of strings as input for difflib.unifie…

c3b6dbf

…d_diff (GH-118333)

serhiy-storchaka closed this as completed Jun 10, 2024

mrahtz pushed a commit to mrahtz/cpython that referenced this issue Jun 30, 2024

pythongh-115801: Only allow sequence of strings as input for difflib.…

c793adb

…unified_diff (pythonGH-118333)

noahbkim pushed a commit to hudson-trading/cpython that referenced this issue Jul 11, 2024

pythongh-115801: Only allow sequence of strings as input for difflib.…

1918756

…unified_diff (pythonGH-118333)

estyxx pushed a commit to estyxx/cpython that referenced this issue Jul 17, 2024

pythongh-115801: Only allow sequence of strings as input for difflib.…

896d135

…unified_diff (pythonGH-118333)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

difflib._check_types allows string inputs instead of sequences of strings as documented #115801

difflib._check_types allows string inputs instead of sequences of strings as documented #115801

lampwins commented Feb 22, 2024 •

edited by bedevere-app bot

Loading

serhiy-storchaka commented Jun 10, 2024

difflib._check_types allows string inputs instead of sequences of strings as documented #115801

difflib._check_types allows string inputs instead of sequences of strings as documented #115801

Comments

lampwins commented Feb 22, 2024 • edited by bedevere-app bot Loading

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

Linked PRs

serhiy-storchaka commented Jun 10, 2024

lampwins commented Feb 22, 2024 •

edited by bedevere-app bot

Loading