Skip to content

Never try to decode bytes as UTF-8 #14

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 28, 2020

Conversation

WGH-
Copy link
Contributor

@WGH- WGH- commented May 9, 2020

There're two issues in the existing code.

First, this heuristic often fails on short random strings,
making their type unreliable. For example, "t" (transaction ID) in
DHT protocol is just two bytes.

Second, the code assumed that they keys of dictionaries are always
UTF-8 encoded. Although that's true in many BitTorrent-related protocols,
bencoding doesn't mandate it, and exceptions do in fact exist:
see #6.

Also, change the code to consistently return bytes/str in all cases,
leaving support for passing encoding str/unicode strings, which are
transparently encoded to UTF-8.

There're two issues in the existing code.

First, this heuristic often fails on short random strings,
making their type unreliable. For example, "t" (transaction ID) in
DHT protocol is just two bytes.

Second, the code assumed that they keys of dictionaries are always
UTF-8 encoded. Although that's true in many BitTorrent-related protocols,
bencoding doesn't mandate it, and exceptions do in fact exist:
see fuzeman#6.

Also, change the code to consistently return bytes/str in all cases,
leaving support for passing encoding str/unicode strings, which are
transparently encoded to UTF-8.
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.03%) to 91.304% when pulling 44230fa on WGH-:remove-utf8-sniffing into e8290df on fuzeman:master.

2 similar comments
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.03%) to 91.304% when pulling 44230fa on WGH-:remove-utf8-sniffing into e8290df on fuzeman:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.03%) to 91.304% when pulling 44230fa on WGH-:remove-utf8-sniffing into e8290df on fuzeman:master.

@fuzeman fuzeman merged commit feabc11 into fuzeman:master May 28, 2020
@fuzeman
Copy link
Owner

fuzeman commented May 28, 2020

Thanks for the contribution.

Note: I've made some minor changes to pass flake8, but it should functionally still be the same.

@sximba
Copy link

sximba commented Jun 11, 2020

I know this has been merged but I just thought I should comment here.

I know that this is technically correct, according to the bencode spec. However I believe this should not have been merged in this form. This behaviour should have been put behind a flag so that opting to the old behaviour is still possible. And maybe even allow passing the encoding scheme to try to decode the strings. The old behaviour was clearly working for a lot of people and was an issue for 1 person in 2018. Also, if the content being decode is heavily nested it will be inefficient to parse after and convert both the keys and values from bytes to strings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants