Don't choke on (legitimately) invalidly encoded Unicode paths #467

nvie · 2016-06-07T08:06:37Z

We've come across path names that contain bytes that are invalid in UTF-8 encoded strings, even though they're very rare. My assumption here is these commits have been created by an old (buggy?) version of Git, and now live in the tree objects with this data. Since we return only unicode strings for the a_path and b_path properties, we're not able to decode this string and thus choke when asking for the diff.

This PR fixes that by using "replace" semantics when decoding. This will effectively replace the illegal bytes \200 (or \x80) by \ufffd (= �).

Follow-up discussion

However, this also means that if you would want to git-blame this file, there's no good way of referencing this path, since it's inherently a bytes path. Normally, when we pass unicode paths to git-blame via GitPython's blame API, the paths get converted to UTF-8 right before issuing the external command. But there's no way of getting the original bytes back after the "replace" operation happened.

Example:

The input path is b'illegal-\x80.txt' (containing illegal byte \x80)
When decoded to UTF-8 with characters replaced, we get the unicode string u'illegal-\ufffd.txt' (= "illegal-�.txt")
When encoding that in UTF-8, we find b'illegal-\xef\xbf\xbd.txt'

When we next pass illegal-\xef\xbf\xbd.txt to git-blame, it will not be able to find this path. Perhaps it would be a good idea to not only return the decoded path strings, but also provide access to the raw bytes found, i.e. by exposing a_rawpath and b_rawpath, which would always be bytes? That way, you could still have the friendly "unicode paths" for most use cases, but use bytes if you need to speak the language of Git more accurately.

Byron · 2016-06-14T05:36:56Z

@nvie Thanks for bringing this up ! I believe it's one of the worst decisions made in GitPython to attempt to decode everything using UTF-8 (or sometimes using whatever encoding Git hints us at). Especially for paths, doing this is plain wrong as the filesystem may use arbitrary encodings. This is the reason Rust uses OsStrings to represent these, as opposed to Strings which need to be valid UTF-8.
Unfortunately, back in the days Rust wasn't on my Radar, and also I didn't know what I know now.

That said, fixing the API in that regard might be a somewhat major undertaking certainly worth making. Without a major version upgrade, the proposed adjustment might be good enough.

The question is, what you as an API user think about these options - I'd be happy to hear your thoughts.

nvie · 2016-06-14T07:16:16Z

I don't think it's too bad in practice. In most parts where paths hit the UI, you'd want to use a decode(errors='replace')'ed version of the raw bytes anyway. If there are invalidly encoded bytes, it's really just fine to see the � chars in there. So I don't think it's too bad that the result of a Diff returns unicode strings for the paths. In practice, they work in >99.9% of all cases.

However, sometimes there just is a need to use the raw bytes, especially when the result of command A (say, a diff), defines the input of command B (say, a blame). Only in these cases, you need to have the exact bytes. I would still show the decode(errors='replace')'ed version of the path in the UI, but in order to ask Git for a blame on that file, I need access to the raw version.

I'll have a stab at adding those extra raw paths as properties.

Previously, the following fields on Diff instances were assumed to be passed in as unicode strings: - `a_path` - `b_path` - `rename_from` - `rename_to` However, since Git natively records paths as bytes, these may potentially not have a valid unicode representation. This patch changes the Diff instance to instead take the following equivalent fields that should be raw bytes instead: - `a_rawpath` - `b_rawpath` - `raw_rename_from` - `raw_rename_to` NOTE ON BACKWARD COMPATIBILITY: The original `a_path`, `b_path`, etc. fields are still available as properties (rather than slots). These properties now dynamically decode the raw bytes into a unicode string (performing the potentially destructive operation of replacing invalid unicode chars by "�"'s). This means that all code using Diffs should remain backward compatible. The only exception is when people would manually construct Diff instances by calling the constructor directly, in which case they should now pass in bytes rather than unicode strings. See also the discussion on #467

yarikoptic · 2016-09-01T20:47:40Z

I guess related: #505 -- I have tried to run tests using TMPDIR with unicode in it... actually may be it is only marginally related since paths in that case are unicode outside of the repository, not within (as it is discussed here), correct?

Don't choke on (legitimately) invalidly encoded Unicode paths

200d3c6

nvie assigned Byron Jun 7, 2016

Byron added the acknowledged label Jun 14, 2016

Byron added this to the v2.0.6 - Bugfixes milestone Jun 14, 2016

Byron merged commit d22c40b into master Jun 14, 2016

Byron deleted the fix-dont-choke-on-invalid-unicode-paths branch June 14, 2016 05:38

nvie mentioned this pull request Jun 14, 2016

Store raw path bytes in Diff instances #474

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't choke on (legitimately) invalidly encoded Unicode paths #467

Don't choke on (legitimately) invalidly encoded Unicode paths #467

nvie commented Jun 7, 2016 •

edited

Loading

Byron commented Jun 14, 2016

nvie commented Jun 14, 2016 •

edited

Loading

yarikoptic commented Sep 1, 2016 •

edited

Loading

Don't choke on (legitimately) invalidly encoded Unicode paths #467

Don't choke on (legitimately) invalidly encoded Unicode paths #467

Conversation

nvie commented Jun 7, 2016 • edited Loading

Follow-up discussion

Byron commented Jun 14, 2016

nvie commented Jun 14, 2016 • edited Loading

yarikoptic commented Sep 1, 2016 • edited Loading

nvie commented Jun 7, 2016 •

edited

Loading

nvie commented Jun 14, 2016 •

edited

Loading

yarikoptic commented Sep 1, 2016 •

edited

Loading