-
-
Notifications
You must be signed in to change notification settings - Fork 32k
gh-98188: Fix EmailMessage.get_payload to decode data #127547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Fix `email.message.EmailMessage.get_payload` failing to decode data when there is a trailing whitespace following the `<mechanism>`. For backward compatibility, `str(cte_header)` still returns the original value; `get_payload` uses `cte_header.cte` to retrieve the parsed CTE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this. I'm wondering a little bit about the wisdom of using the cte if there is extra text, but since I made the decision to expose it as the 'cte' attribute even if the header is defective, I guess it does make sense to go ahead and use it for the decoding. Or, at least, it is more consistent to do so, and that would follow the principle of least surprise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like I forgot to click request changes when I submitted the review.
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some additional comments. Depending on whether additional junk after a known mechanism should be eagerly rejected or not, the NEWS entry would need to be amended and a What's New entry should be added.
Me thinking loud:
Current behaviour
- "base64 " is not recognized and the payload is not decoded properly
- "base64 some text" is not recognized and the payload is not decoded properly
Proposed behaviour
- "base64 " is recognized as "base64": ok for this
- "base64 some text" is recognized as "base64" and has a defect due to "some text"
I suggest rejecting "base64 some text" altogether without recognizing the "base64" mechanism at all. Ignoring whitespaces is probably fine but I'd prefer notifying the user that junk text was added and not expected (without trying to decode the email). But if @bitdancer is fine with ignoring the additional junk, I'm also ok.
Misc/NEWS.d/next/Library/2024-12-03-14-45-16.gh-issue-98188.GX9i2b.rst
Outdated
Show resolved
Hide resolved
…9i2b.rst Co-authored-by: Bénédikt Tran <[email protected]>
I have made the requested changes; please review again |
Thanks for making the requested changes! @bitdancer: please review the changes made to this pull request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks @RanKKI for the PR, and @bitdancer for merging it 🌮🎉.. I'm working now to backport this PR to: 3.12, 3.13. |
…value has extra text (pythonGH-127547) Up to this point message handling has been very strict with regards to content encoding values: mixed case was accepted, but trailing blanks or other text would cause decoding failure, even if the first token was a valid encoding. By Postel's Rule we should go ahead and decode as long as we can recognize that first token. We have not thought of any security or backward compatibility concerns with this fix. This fix does introduce a new technique/pattern to the Message code: we look to see if the header has a 'cte' attribute, and if so we use that. This effectively promotes the header API exposed by HeaderRegistry to an API that any header parser "should" support. This seems like a reasonable thing to do. It is not, however, a requirement, as the string value of the header is still used if there is no cte attribute. The full fix (ignore any trailing blanks or blank-separated trailing text) applies only to the non-compat32 API. compat32 is only fixed to the extent that it now ignores trailing spaces. Note that the HeaderRegistry parsing still records a HeaderDefect if there is extra text. (cherry picked from commit a62ba52) Co-authored-by: RanKKI <[email protected]> Co-authored-by: Bénédikt Tran <[email protected]>
…value has extra text (pythonGH-127547) Up to this point message handling has been very strict with regards to content encoding values: mixed case was accepted, but trailing blanks or other text would cause decoding failure, even if the first token was a valid encoding. By Postel's Rule we should go ahead and decode as long as we can recognize that first token. We have not thought of any security or backward compatibility concerns with this fix. This fix does introduce a new technique/pattern to the Message code: we look to see if the header has a 'cte' attribute, and if so we use that. This effectively promotes the header API exposed by HeaderRegistry to an API that any header parser "should" support. This seems like a reasonable thing to do. It is not, however, a requirement, as the string value of the header is still used if there is no cte attribute. The full fix (ignore any trailing blanks or blank-separated trailing text) applies only to the non-compat32 API. compat32 is only fixed to the extent that it now ignores trailing spaces. Note that the HeaderRegistry parsing still records a HeaderDefect if there is extra text. (cherry picked from commit a62ba52) Co-authored-by: RanKKI <[email protected]> Co-authored-by: Bénédikt Tran <[email protected]>
GH-128528 is a backport of this pull request to the 3.13 branch. |
GH-128529 is a backport of this pull request to the 3.12 branch. |
…value has extra text (python#127547) Up to this point message handling has been very strict with regards to content encoding values: mixed case was accepted, but trailing blanks or other text would cause decoding failure, even if the first token was a valid encoding. By Postel's Rule we should go ahead and decode as long as we can recognize that first token. We have not thought of any security or backward compatibility concerns with this fix. This fix does introduce a new technique/pattern to the Message code: we look to see if the header has a 'cte' attribute, and if so we use that. This effectively promotes the header API exposed by HeaderRegistry to an API that any header parser "should" support. This seems like a reasonable thing to do. It is not, however, a requirement, as the string value of the header is still used if there is no cte attribute. The full fix (ignore any trailing blanks or blank-separated trailing text) applies only to the non-compat32 API. compat32 is only fixed to the extent that it now ignores trailing spaces. Note that the HeaderRegistry parsing still records a HeaderDefect if there is extra text. Co-authored-by: Bénédikt Tran <[email protected]>
… value has extra text (GH-127547) (#128528) gh-98188: Fix EmailMessage.get_payload to decode data when CTE value has extra text (GH-127547) Up to this point message handling has been very strict with regards to content encoding values: mixed case was accepted, but trailing blanks or other text would cause decoding failure, even if the first token was a valid encoding. By Postel's Rule we should go ahead and decode as long as we can recognize that first token. We have not thought of any security or backward compatibility concerns with this fix. This fix does introduce a new technique/pattern to the Message code: we look to see if the header has a 'cte' attribute, and if so we use that. This effectively promotes the header API exposed by HeaderRegistry to an API that any header parser "should" support. This seems like a reasonable thing to do. It is not, however, a requirement, as the string value of the header is still used if there is no cte attribute. The full fix (ignore any trailing blanks or blank-separated trailing text) applies only to the non-compat32 API. compat32 is only fixed to the extent that it now ignores trailing spaces. Note that the HeaderRegistry parsing still records a HeaderDefect if there is extra text. (cherry picked from commit a62ba52) Co-authored-by: RanKKI <[email protected]> Co-authored-by: Bénédikt Tran <[email protected]>
… value has extra text (GH-127547) (#128529) gh-98188: Fix EmailMessage.get_payload to decode data when CTE value has extra text (GH-127547) Up to this point message handling has been very strict with regards to content encoding values: mixed case was accepted, but trailing blanks or other text would cause decoding failure, even if the first token was a valid encoding. By Postel's Rule we should go ahead and decode as long as we can recognize that first token. We have not thought of any security or backward compatibility concerns with this fix. This fix does introduce a new technique/pattern to the Message code: we look to see if the header has a 'cte' attribute, and if so we use that. This effectively promotes the header API exposed by HeaderRegistry to an API that any header parser "should" support. This seems like a reasonable thing to do. It is not, however, a requirement, as the string value of the header is still used if there is no cte attribute. The full fix (ignore any trailing blanks or blank-separated trailing text) applies only to the non-compat32 API. compat32 is only fixed to the extent that it now ignores trailing spaces. Note that the HeaderRegistry parsing still records a HeaderDefect if there is extra text. (cherry picked from commit a62ba52) Co-authored-by: RanKKI <[email protected]> Co-authored-by: Bénédikt Tran <[email protected]>
|
…value has extra text (python#127547) Up to this point message handling has been very strict with regards to content encoding values: mixed case was accepted, but trailing blanks or other text would cause decoding failure, even if the first token was a valid encoding. By Postel's Rule we should go ahead and decode as long as we can recognize that first token. We have not thought of any security or backward compatibility concerns with this fix. This fix does introduce a new technique/pattern to the Message code: we look to see if the header has a 'cte' attribute, and if so we use that. This effectively promotes the header API exposed by HeaderRegistry to an API that any header parser "should" support. This seems like a reasonable thing to do. It is not, however, a requirement, as the string value of the header is still used if there is no cte attribute. The full fix (ignore any trailing blanks or blank-separated trailing text) applies only to the non-compat32 API. compat32 is only fixed to the extent that it now ignores trailing spaces. Note that the HeaderRegistry parsing still records a HeaderDefect if there is extra text. Co-authored-by: Bénédikt Tran <[email protected]>
Fix
email.message.EmailMessage.get_payload
failing to decode data when there is trailing whitespace and/or extra text following the<mechanism>
ofContent-Transfer-Encoding
The header.defects attribute does have an
InvalidHeaderDefect
error, butheader.cte
is still a valid mechanism. Therefore, it is better to decode the content even if there is an error.The fix in ietf-tools/mailarchive#3550 overrides the
__str__
method to return theself.cte
, which resolves this issue. However, it might have some backward compatibility issues. So, it is better to ensurestr(header)
still returns the original value while usingheader.cte
to retrieve the parsed CTE in theget_payload(decode=True)
method.The output of
msg.get_payload(decode=True)
isb'Hello. Testing'
after this fix