You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Both difflib.unified_diff and difflib.context_diff document that a and b input arguments are to be lists of strings. These functions perform argument type checking by way of difflib._check_types, however this function allows a and b to be direct str arguments. Technically this does not cause a failure in difflib.unified_diff (and I assume the same is true for context_diff but I have not tested it), however, for very large strings a and/or b, difflib.unified_diff is exponentially slower to calculate the diff, because the underlying SequenceMatcher is optimized to compare two lists of string, not two sequences of chars.
We can obviously see that the implementation of _check_types uses a seemingly naive type check of the first element in a and b which will pass for both the documented input of a list of strings, but also for a positive length string itself.
def_check_types(a, b, *args):
# Checking types is weird, but the alternative is garbled output when# someone passes mixed bytes and str to {unified,context}_diff(). E.g.# without this check, passing filenames as bytes results in output like# --- b'oldfile.txt'# +++ b'newfile.txt'# because of how str.format() incorporates bytes objects.ifaandnotisinstance(a[0], str):
raiseTypeError('lines to compare must be str, not %s (%r)'%
(type(a[0]).__name__, a[0]))
ifbandnotisinstance(b[0], str):
raiseTypeError('lines to compare must be str, not %s (%r)'%
(type(b[0]).__name__, b[0]))
forarginargs:
ifnotisinstance(arg, str):
raiseTypeError('all arguments must be str, not: %r'% (arg,))
This would be rather trivial to fix in _check_types but I want to first make sure that the documentation of a list of string is to be considered correct behavior?
CPython versions tested on:
3.8, 3.9, 3.10, 3.11, 3.12, 3.13, CPython main branch
I agree that passing strings instead of sequences of strings is most likely a user bug. But there is a tiny chance that this is used intentionally, so it is safer to not change this in maintained releases.
Bug report
Bug description:
Both
difflib.unified_diff
anddifflib.context_diff
document thata
andb
input arguments are to be lists of strings. These functions perform argument type checking by way ofdifflib._check_types
, however this function allowsa
andb
to be directstr
arguments. Technically this does not cause a failure indifflib.unified_diff
(and I assume the same is true forcontext_diff
but I have not tested it), however, for very large stringsa
and/orb
,difflib.unified_diff
is exponentially slower to calculate the diff, because the underlyingSequenceMatcher
is optimized to compare two lists of string, not two sequences of chars.We can obviously see that the implementation of
_check_types
uses a seemingly naive type check of the first element ina
andb
which will pass for both the documented input of a list of strings, but also for a positive length string itself.This would be rather trivial to fix in
_check_types
but I want to first make sure that the documentation of a list of string is to be considered correct behavior?CPython versions tested on:
3.8, 3.9, 3.10, 3.11, 3.12, 3.13, CPython main branch
Operating systems tested on:
Linux, macOS
Linked PRs
The text was updated successfully, but these errors were encountered: