-
-
Notifications
You must be signed in to change notification settings - Fork 31.9k
gh-85287: Change codecs to raise precise UnicodeEncodeError and UnicodeDecodeError #113674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
Lib/encodings/idna.py
Outdated
@@ -156,7 +170,7 @@ def encode(self, input, errors='strict'): | |||
|
|||
if errors != 'strict': | |||
# IDNA is quite clear that implementations must be strict | |||
raise UnicodeError("unsupported error handling "+errors) | |||
raise UnicodeEncodeError("idna", input, 0, 1, f"unsupported error handling {errors}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise UnicodeEncodeError("idna", input, 0, 1, f"unsupported error handling {errors}") | |
raise UnicodeEncodeError("idna", input, 0, 0, f"unsupported error handling {errors}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UnicodeEncodeError implies that there is an error in the encoded string.
Other codecs usually raise LookupError for unknown error handlers, but here an exception is raised even for known error handlers.
Perhaps ValueError better suits here. UnicodeError is a subclass of ValueError, so we can keep it for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the cases where there was not an issue with the input string back to UnicodeError
.
Lib/encodings/idna.py
Outdated
for label in labels[:-1]: | ||
if not (0 < len(label) < 64): | ||
raise UnicodeError("label empty or too long") | ||
raise UnicodeEncodeError("idna", input, index, index+len(label), "label empty or too long") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise UnicodeEncodeError("idna", input, index, index+len(label), "label empty or too long") | |
raise UnicodeEncodeError( | |
"idna", input, index, index+len(label), | |
"label empty or too long") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree we using absolute offset.
But it still reports weird range for empty label. I think that it is better to include an opening or closing dot.
Also, I think that it is better to use different messages for empty and too long labels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Split out the empty vs too long cases. Note that UnicodeEncodeError
takes a python-style [start, end) range, but outputs a (start)-(end-1) inclusive range in the error message, so using something like 0, 0,
gives the weird-looking 0--1
error message whereas 0, 1,
gives position 0
.
Lib/encodings/idna.py
Outdated
if len(labels[-1]) >= 64: | ||
raise UnicodeError("label too long") | ||
raise UnicodeEncodeError("idna", input, index, len(input), "label too long") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise UnicodeEncodeError("idna", input, index, len(input), "label too long") | |
raise UnicodeEncodeError("idna", input, index, len(input), | |
"label too long") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the linebreaks of some of the longer exceptions to look like this
Thanks for the reviews @methane and @serhiy-storchaka, comments addressed and ready for another look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot simply translate offsets from a label to the full input, because the label is transformed several times during encoding: some characters are ignored, others are combined in the normalization or transformed in the punycode encoding. It is difficult to track offsets of original characters that cannot be encoded. So I suggest to simply report the range of the whole label that cannot be encoded. The main goal is changing the type of the exception to more specific, and it should be enough for now. Later we can narrow the range of error if it is useful and possible.
I decided to add some tests for the error offset in one more hope to get the offsets figured out properly, or at least document the current behavior. They seem to be working now at least for those cases, but I'm not an expert in IDNA or Punycode so I'm not sure if there are important edgecases I'm missing. My goal was to avoid the need to revisit this in the future to further tighten up the offsets, but @serhiy-storchaka let me know if you'd still prefer I switch it to use the whole range of the problematic label. Ready for another look @methane @serhiy-storchaka |
Modules/cjkcodecs/multibytecodec.c
Outdated
excobj = PyObject_CallFunction(PyExc_UnicodeEncodeError, | ||
"ssnns", | ||
ctx->codec->encoding, | ||
PyUnicode_AsUTF8(inbuf), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need copy inbuf
? If no, use just inbuf
with O
format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done 👍 I think I tried this originally, but I must have been doing something different because it seems to work fine now.
Modules/cjkcodecs/multibytecodec.c
Outdated
(const char *)buf->inbuf_top, bufsize, | ||
0, bufsize, "pending buffer overflow"); | ||
PyErr_SetObject(PyExc_UnicodeDecodeError, excobj); | ||
goto errorexit; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here.
Modules/cjkcodecs/multibytecodec.c
Outdated
ctx->codec->encoding, | ||
PyUnicode_AsUTF8(inbuf), | ||
inpos, datalen, | ||
"pending buffer overflow"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing if (excobj == NULL) goto errorexit;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, for all new and pre-existing code
Modules/cjkcodecs/multibytecodec.c
Outdated
0, pendingsize, | ||
"pending buffer too large"); | ||
PyErr_SetObject(PyExc_UnicodeEncodeError, excobj); | ||
goto errorexit; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here.
This comment was marked as off-topic.
This comment was marked as off-topic.
Thanks @methane , ready for another look |
Modules/cjkcodecs/multibytecodec.c
Outdated
0, sizeof(statebytes), | ||
"pending buffer too large"); | ||
if (excobj == NULL) goto errorexit; | ||
PyErr_SetObject(PyExc_UnicodeEncodeError, excobj); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel this is valid UnicodeEncodeError.
Should we really avoid UnicodeError here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If UnicodeEncodeError
is because the user passed an input that directly causes the encoding problem, rather than any problem that occurs during encoding, then UnicodeError
makes sense here. Changed this one back to UnicodeError
.
Comment addressed @methane , thanks, ready for review @serhiy-storchaka I think the new IDNA errors work now at least for the main cases, but let me know if you still want me to go back and change the IDNA encoding errors to report the whole label as the problematic range |
punycode looks ugly. Although it doesn't have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. But I will wait for a few weeks to have chance other core developer review my changes.
… UnicodeDecodeError (python#113674) Co-authored-by: Inada Naoki <[email protected]>
… UnicodeDecodeError (python#113674) Co-authored-by: Inada Naoki <[email protected]>
… UnicodeDecodeError (python#113674) Co-authored-by: Inada Naoki <[email protected]>
Change codecs to raise
UnicodeEncodeError
andUnicodeDecodeError
, depending on what operation the API user is attempting, instead of justUnicodeError
. This involved providing more information to the exceptions, including the problematic string and start and end offsets within the string, which are not always relevant depending on the nature of the exception.Some of the codecs will perform multiple encodes and decodes during an operation as part of their implementation. In these cases, I've opted to throw the exception that matches the API function the error is occurring in.
A tricky thing to watch out for while reviewing is that
UnicodeEncodeError
takes astr
andUnicodeDecodeError
takes abytes
, which is supposed to be the original object being en/decoded. However, most of these codec functions modify their arguments, so by the time the exception block is reached, the original argument is no longer available and needs to be reconstructed. Alternatively, I could change the PR to save the argument, but I don't think it's worth the extra memory just for the exception case (may be more pythonic though).Another thing to watch for in the review is to make sure I'm not doing anything in the exception blocks that could trigger more en/decoding exceptions when I'm reconstructing the argument strings.