Skip to content

gh-88500: Reduce memory use of urllib.unquote #96763

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Dec 11, 2022

Conversation

gpshead
Copy link
Member

@gpshead gpshead commented Sep 12, 2022

urllib.unquote_to_bytes and urllib.unquote could both potentially generate O(len(string)) intermediate bytes or str objects while computing the unquoted final result depending on the input provided. As Python objects are relatively large, this could consume a lot of ram.

This switches the implementation to using an expanding bytearray and a generator internally instead of precomputed split() style operations.

Microbenchmarks with some antagonistic inputs like mess = "\u0141%%%20a%fe"*1000 show this is 10-20% slower for unquote and unquote_to_bytes and no different for typical inputs that are short or lack much unicode or % escaping. But the functions are already quite fast anyways so not a big deal. The slowdown scales consistently linear with input size as expected.

Memory usage observed manually using /usr/bin/time -v on python -m timeit runs of larger inputs. Unittesting memory consumption is difficult and does not seem worthwhile.

Observed memory usage is ~1/2 for unquote() and <1/3 for unquote_to_bytes() using python -m timeit -s 'from urllib.parse import unquote, unquote_to_bytes; v="\u0141%01\u0161%20"*500_000' 'unquote_to_bytes(v)' as a test.

Closes #88500.

`urllib.unquote_to_bytes` and `urllib.unquote` could both potentially
generate `O(len(string))` intermediate `bytes` or `str` objects while
computing the unquoted final result depending on the input provided. As
Python objects are relatively large, this could consume a lot of ram.

This switches the implementation to using an expanding `bytearray` and a
generator internally instead of precomputed `split()` style operations.
@gpshead
Copy link
Member Author

gpshead commented Sep 12, 2022

Microbenchmarks with some antagonistic inputs like mess = "\u0141%%%20a%fe"*1000 show this is 10-20% slower for unquote and unquote_to_bytes and no different for typical inputs that are short or lack much unicode or % escaping. But the functions are already quite fast anyways so not a big deal. The slowdown scales consistently linear with input size as expected.

Memory usage observed manually using /usr/bin/time -v on python -m timeit runs of larger inputs. Unittesting memory consumption is difficult and does not seem worthwhile.

Memory usage is ~1/2 for unquote() and <1/3 for unquote_to_bytes() using python -m timeit -s 'from urllib.parse import unquote, unquote_to_bytes; v="\u0141%01\u0161%20"*500_000' 'unquote_to_bytes(v)' as a test.

@gpshead gpshead added type-feature A feature request or enhancement performance Performance or resource usage stdlib Python modules in the Lib dir labels Sep 16, 2022
@gpshead gpshead marked this pull request as ready for review September 16, 2022 08:28
@gpshead
Copy link
Member Author

gpshead commented Oct 1, 2022

any thoughts from reviewers?

@gpshead gpshead requested review from ambv and removed request for ethanfurman and sweeneyde November 11, 2022 09:21
@gpshead gpshead merged commit 2e279e8 into python:main Dec 11, 2022
@gpshead gpshead deleted the gh/88500/unquote_mem_use branch December 11, 2022 00:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir type-feature A feature request or enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reduce memory usage of urllib.unquote and unquote_to_bytes
2 participants