gh-88500: Reduce memory use of urllib.unquote
#96763
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
urllib.unquote_to_bytes
andurllib.unquote
could both potentially generateO(len(string))
intermediatebytes
orstr
objects while computing the unquoted final result depending on the input provided. As Python objects are relatively large, this could consume a lot of ram.This switches the implementation to using an expanding
bytearray
and a generator internally instead of precomputedsplit()
style operations.Microbenchmarks with some antagonistic inputs like
mess = "\u0141%%%20a%fe"*1000
show this is 10-20% slower for unquote and unquote_to_bytes and no different for typical inputs that are short or lack much unicode or % escaping. But the functions are already quite fast anyways so not a big deal. The slowdown scales consistently linear with input size as expected.Memory usage observed manually using
/usr/bin/time -v
onpython -m timeit
runs of larger inputs. Unittesting memory consumption is difficult and does not seem worthwhile.Observed memory usage is ~1/2 for
unquote()
and <1/3 forunquote_to_bytes()
usingpython -m timeit -s 'from urllib.parse import unquote, unquote_to_bytes; v="\u0141%01\u0161%20"*500_000' 'unquote_to_bytes(v)'
as a test.Closes #88500.