gh-88500: Reduce memory use of `urllib.unquote` #96763

gpshead · 2022-09-12T08:10:19Z

urllib.unquote_to_bytes and urllib.unquote could both potentially generate O(len(string)) intermediate bytes or str objects while computing the unquoted final result depending on the input provided. As Python objects are relatively large, this could consume a lot of ram.

This switches the implementation to using an expanding bytearray and a generator internally instead of precomputed split() style operations.

Microbenchmarks with some antagonistic inputs like mess = "\u0141%%%20a%fe"*1000 show this is 10-20% slower for unquote and unquote_to_bytes and no different for typical inputs that are short or lack much unicode or % escaping. But the functions are already quite fast anyways so not a big deal. The slowdown scales consistently linear with input size as expected.

Memory usage observed manually using /usr/bin/time -v on python -m timeit runs of larger inputs. Unittesting memory consumption is difficult and does not seem worthwhile.

Observed memory usage is ~1/2 for unquote() and <1/3 for unquote_to_bytes() using python -m timeit -s 'from urllib.parse import unquote, unquote_to_bytes; v="\u0141%01\u0161%20"*500_000' 'unquote_to_bytes(v)' as a test.

Issue: Reduce memory usage of urllib.unquote and unquote_to_bytes #88500

Closes #88500.

`urllib.unquote_to_bytes` and `urllib.unquote` could both potentially generate `O(len(string))` intermediate `bytes` or `str` objects while computing the unquoted final result depending on the input provided. As Python objects are relatively large, this could consume a lot of ram. This switches the implementation to using an expanding `bytearray` and a generator internally instead of precomputed `split()` style operations.

gpshead · 2022-09-12T23:47:54Z

Microbenchmarks with some antagonistic inputs like mess = "\u0141%%%20a%fe"*1000 show this is 10-20% slower for unquote and unquote_to_bytes and no different for typical inputs that are short or lack much unicode or % escaping. But the functions are already quite fast anyways so not a big deal. The slowdown scales consistently linear with input size as expected.

Memory usage observed manually using /usr/bin/time -v on python -m timeit runs of larger inputs. Unittesting memory consumption is difficult and does not seem worthwhile.

Memory usage is ~1/2 for unquote() and <1/3 for unquote_to_bytes() using python -m timeit -s 'from urllib.parse import unquote, unquote_to_bytes; v="\u0141%01\u0161%20"*500_000' 'unquote_to_bytes(v)' as a test.

gpshead · 2022-10-01T18:45:50Z

any thoughts from reviewers?

gpshead self-assigned this Sep 12, 2022

bedevere-bot added the awaiting core review label Sep 12, 2022

gpshead added 3 commits September 16, 2022 07:28

Rename the internal implementation function.

142a7e2

Merge branch 'main' into gh/88500/unquote_mem_use

a81248d

NEWS entry, keep old var name for a smaller diff.

4ffac18

gpshead added type-feature A feature request or enhancement performance Performance or resource usage stdlib Python modules in the Lib dir labels Sep 16, 2022

gpshead requested review from ethanfurman and sweeneyde September 16, 2022 08:24

gpshead marked this pull request as ready for review September 16, 2022 08:28

Merge branch 'main' into gh/88500/unquote_mem_use

b87595b

bedevere-bot mentioned this pull request Nov 11, 2022

Reduce memory usage of urllib.unquote and unquote_to_bytes #88500

Closed

gpshead requested review from ambv and removed request for ethanfurman and sweeneyde November 11, 2022 09:21

gpshead merged commit 2e279e8 into python:main Dec 11, 2022

bedevere-bot removed the awaiting core review label Dec 11, 2022

gpshead deleted the gh/88500/unquote_mem_use branch December 11, 2022 00:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-88500: Reduce memory use of `urllib.unquote` #96763

gh-88500: Reduce memory use of `urllib.unquote` #96763

gpshead commented Sep 12, 2022 •

edited

Loading

gpshead commented Sep 12, 2022 •

edited

Loading

gpshead commented Oct 1, 2022

gh-88500: Reduce memory use of urllib.unquote #96763

gh-88500: Reduce memory use of urllib.unquote #96763

Conversation

gpshead commented Sep 12, 2022 • edited Loading

gpshead commented Sep 12, 2022 • edited Loading

gpshead commented Oct 1, 2022

gh-88500: Reduce memory use of `urllib.unquote` #96763

gh-88500: Reduce memory use of `urllib.unquote` #96763

gpshead commented Sep 12, 2022 •

edited

Loading

gpshead commented Sep 12, 2022 •

edited

Loading