Skip to content

Don't choke on (legitimately) invalidly encoded Unicode paths #467

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 14, 2016

Conversation

nvie
Copy link
Contributor

@nvie nvie commented Jun 7, 2016

We've come across path names that contain bytes that are invalid in UTF-8 encoded strings, even though they're very rare. My assumption here is these commits have been created by an old (buggy?) version of Git, and now live in the tree objects with this data. Since we return only unicode strings for the a_path and b_path properties, we're not able to decode this string and thus choke when asking for the diff.

This PR fixes that by using "replace" semantics when decoding. This will effectively replace the illegal bytes \200 (or \x80) by \ufffd (= �).

Follow-up discussion

However, this also means that if you would want to git-blame this file, there's no good way of referencing this path, since it's inherently a bytes path. Normally, when we pass unicode paths to git-blame via GitPython's blame API, the paths get converted to UTF-8 right before issuing the external command. But there's no way of getting the original bytes back after the "replace" operation happened.

Example:

  • The input path is b'illegal-\x80.txt' (containing illegal byte \x80)
  • When decoded to UTF-8 with characters replaced, we get the unicode string u'illegal-\ufffd.txt' (= "illegal-�.txt")
  • When encoding that in UTF-8, we find b'illegal-\xef\xbf\xbd.txt'

When we next pass illegal-\xef\xbf\xbd.txt to git-blame, it will not be able to find this path. Perhaps it would be a good idea to not only return the decoded path strings, but also provide access to the raw bytes found, i.e. by exposing a_rawpath and b_rawpath, which would always be bytes? That way, you could still have the friendly "unicode paths" for most use cases, but use bytes if you need to speak the language of Git more accurately.

@Byron Byron added this to the v2.0.6 - Bugfixes milestone Jun 14, 2016
@Byron
Copy link
Member

Byron commented Jun 14, 2016

@nvie Thanks for bringing this up ! I believe it's one of the worst decisions made in GitPython to attempt to decode everything using UTF-8 (or sometimes using whatever encoding Git hints us at). Especially for paths, doing this is plain wrong as the filesystem may use arbitrary encodings. This is the reason Rust uses OsStrings to represent these, as opposed to Strings which need to be valid UTF-8.
Unfortunately, back in the days Rust wasn't on my Radar, and also I didn't know what I know now.

That said, fixing the API in that regard might be a somewhat major undertaking certainly worth making. Without a major version upgrade, the proposed adjustment might be good enough.

The question is, what you as an API user think about these options - I'd be happy to hear your thoughts.

@Byron Byron merged commit d22c40b into master Jun 14, 2016
@Byron Byron deleted the fix-dont-choke-on-invalid-unicode-paths branch June 14, 2016 05:38
@nvie
Copy link
Contributor Author

nvie commented Jun 14, 2016

I don't think it's too bad in practice. In most parts where paths hit the UI, you'd want to use a decode(errors='replace')'ed version of the raw bytes anyway. If there are invalidly encoded bytes, it's really just fine to see the � chars in there. So I don't think it's too bad that the result of a Diff returns unicode strings for the paths. In practice, they work in >99.9% of all cases.

However, sometimes there just is a need to use the raw bytes, especially when the result of command A (say, a diff), defines the input of command B (say, a blame). Only in these cases, you need to have the exact bytes. I would still show the decode(errors='replace')'ed version of the path in the UI, but in order to ask Git for a blame on that file, I need access to the raw version.

I'll have a stab at adding those extra raw paths as properties.

nvie added a commit that referenced this pull request Jun 14, 2016
Previously, the following fields on Diff instances were assumed to be
passed in as unicode strings:

  - `a_path`
  - `b_path`
  - `rename_from`
  - `rename_to`

However, since Git natively records paths as bytes, these may
potentially not have a valid unicode representation.

This patch changes the Diff instance to instead take the following
equivalent fields that should be raw bytes instead:

  - `a_rawpath`
  - `b_rawpath`
  - `raw_rename_from`
  - `raw_rename_to`

NOTE ON BACKWARD COMPATIBILITY:
The original `a_path`, `b_path`, etc. fields are still available as
properties (rather than slots).  These properties now dynamically decode
the raw bytes into a unicode string (performing the potentially
destructive operation of replacing invalid unicode chars by "�"'s).
This means that all code using Diffs should remain backward compatible.
The only exception is when people would manually construct Diff
instances by calling the constructor directly, in which case they should
now pass in bytes rather than unicode strings.

See also the discussion on
#467
@yarikoptic
Copy link
Contributor

yarikoptic commented Sep 1, 2016

I guess related: #505 -- I have tried to run tests using TMPDIR with unicode in it... actually may be it is only marginally related since paths in that case are unicode outside of the repository, not within (as it is discussed here), correct?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants