Skip to content

gh-98188: Fix EmailMessage.get_payload to decode data #127547

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jan 6, 2025

Conversation

RanKKI
Copy link
Contributor

@RanKKI RanKKI commented Dec 3, 2024

Fix email.message.EmailMessage.get_payload failing to decode data when there is trailing whitespace and/or extra text following the <mechanism> of Content-Transfer-Encoding

>>> msg = email.message_from_string(textwrap.dedent("""\
... Content-Transfer-Encoding: base64 some text
... 
... SGVsbG8uIFRlc3Rpbmc=
... """), policy=policy.default)
>>> msg.get_payload(decode=True)
b'SGVsbG8uIFRlc3Rpbmc=\n'
>>> header = msg.get("content-transfer-encoding")
>>> print(f'"{header.cte}"')
"base64"
>>> print(f'"{str(header)}"')
"base64 some text"
>>> header.defects
(InvalidHeaderDefect('Extra text after content transfer encoding'),)

The header.defects attribute does have an InvalidHeaderDefect error, but header.cte is still a valid mechanism. Therefore, it is better to decode the content even if there is an error.

The fix in ietf-tools/mailarchive#3550 overrides the __str__ method to return the self.cte, which resolves this issue. However, it might have some backward compatibility issues. So, it is better to ensure str(header) still returns the original value while using header.cte to retrieve the parsed CTE in the get_payload(decode=True) method.

The output of msg.get_payload(decode=True) is b'Hello. Testing' after this fix

Fix `email.message.EmailMessage.get_payload` failing
to decode data when there is a trailing whitespace
following the `<mechanism>`.

For backward compatibility, `str(cte_header)` still
returns the original value; `get_payload` uses `cte_header.cte`
to retrieve the parsed CTE.
@ZeroIntensity ZeroIntensity added needs backport to 3.12 only security fixes needs backport to 3.13 bugs and security fixes labels Dec 3, 2024
@ZeroIntensity ZeroIntensity requested a review from picnixz December 3, 2024 14:58
Copy link
Member

@bitdancer bitdancer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this. I'm wondering a little bit about the wisdom of using the cte if there is extra text, but since I made the decision to expose it as the 'cte' attribute even if the header is defective, I guess it does make sense to go ahead and use it for the decoding. Or, at least, it is more consistent to do so, and that would follow the principle of least surprise.

Copy link
Member

@bitdancer bitdancer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like I forgot to click request changes when I submitted the review.

@bedevere-app
Copy link

bedevere-app bot commented Dec 16, 2024

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

Copy link
Member

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some additional comments. Depending on whether additional junk after a known mechanism should be eagerly rejected or not, the NEWS entry would need to be amended and a What's New entry should be added.

Me thinking loud:

Current behaviour

  • "base64 " is not recognized and the payload is not decoded properly
  • "base64 some text" is not recognized and the payload is not decoded properly

Proposed behaviour

  • "base64 " is recognized as "base64": ok for this
  • "base64 some text" is recognized as "base64" and has a defect due to "some text"

I suggest rejecting "base64 some text" altogether without recognizing the "base64" mechanism at all. Ignoring whitespaces is probably fine but I'd prefer notifying the user that junk text was added and not expected (without trying to decode the email). But if @bitdancer is fine with ignoring the additional junk, I'm also ok.

@RanKKI
Copy link
Contributor Author

RanKKI commented Dec 22, 2024

I have made the requested changes; please review again

@bedevere-app
Copy link

bedevere-app bot commented Dec 22, 2024

Thanks for making the requested changes!

@bitdancer: please review the changes made to this pull request.

@bedevere-app bedevere-app bot requested a review from bitdancer December 22, 2024 12:11
Copy link
Member

@bitdancer bitdancer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bitdancer bitdancer merged commit a62ba52 into python:main Jan 6, 2025
40 of 42 checks passed
@miss-islington-app
Copy link

Thanks @RanKKI for the PR, and @bitdancer for merging it 🌮🎉.. I'm working now to backport this PR to: 3.12, 3.13.
🐍🍒⛏🤖

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Jan 6, 2025
…value has extra text (pythonGH-127547)

Up to this point message handling has been very strict with regards to content encoding values: mixed case was accepted, but trailing blanks or other text would cause decoding failure, even if the first token was a valid encoding.  By Postel's Rule we should go ahead and decode as long as we can recognize that first token.  We have not thought of any security or backward compatibility concerns with this fix.

This fix does introduce a new technique/pattern to the Message code: we look to see if the header has a 'cte' attribute, and if so we use that.  This effectively promotes the header API exposed by HeaderRegistry to an API that any header parser "should" support.  This seems like a reasonable thing to do.  It is not, however, a requirement, as the string value of the header is still used if there is no cte attribute.

The full fix (ignore any trailing blanks or blank-separated trailing text) applies only to the non-compat32 API.  compat32 is only fixed to the extent that it now ignores trailing spaces.  Note that the HeaderRegistry parsing still records a HeaderDefect if there is extra text.

(cherry picked from commit a62ba52)

Co-authored-by: RanKKI <[email protected]>
Co-authored-by: Bénédikt Tran <[email protected]>
miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Jan 6, 2025
…value has extra text (pythonGH-127547)

Up to this point message handling has been very strict with regards to content encoding values: mixed case was accepted, but trailing blanks or other text would cause decoding failure, even if the first token was a valid encoding.  By Postel's Rule we should go ahead and decode as long as we can recognize that first token.  We have not thought of any security or backward compatibility concerns with this fix.

This fix does introduce a new technique/pattern to the Message code: we look to see if the header has a 'cte' attribute, and if so we use that.  This effectively promotes the header API exposed by HeaderRegistry to an API that any header parser "should" support.  This seems like a reasonable thing to do.  It is not, however, a requirement, as the string value of the header is still used if there is no cte attribute.

The full fix (ignore any trailing blanks or blank-separated trailing text) applies only to the non-compat32 API.  compat32 is only fixed to the extent that it now ignores trailing spaces.  Note that the HeaderRegistry parsing still records a HeaderDefect if there is extra text.

(cherry picked from commit a62ba52)

Co-authored-by: RanKKI <[email protected]>
Co-authored-by: Bénédikt Tran <[email protected]>
@bedevere-app
Copy link

bedevere-app bot commented Jan 6, 2025

GH-128528 is a backport of this pull request to the 3.13 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.13 bugs and security fixes label Jan 6, 2025
@bedevere-app
Copy link

bedevere-app bot commented Jan 6, 2025

GH-128529 is a backport of this pull request to the 3.12 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.12 only security fixes label Jan 6, 2025
srinivasreddy pushed a commit to srinivasreddy/cpython that referenced this pull request Jan 6, 2025
…value has extra text (python#127547)

Up to this point message handling has been very strict with regards to content encoding values: mixed case was accepted, but trailing blanks or other text would cause decoding failure, even if the first token was a valid encoding.  By Postel's Rule we should go ahead and decode as long as we can recognize that first token.  We have not thought of any security or backward compatibility concerns with this fix.

This fix does introduce a new technique/pattern to the Message code: we look to see if the header has a 'cte' attribute, and if so we use that.  This effectively promotes the header API exposed by HeaderRegistry to an API that any header parser "should" support.  This seems like a reasonable thing to do.  It is not, however, a requirement, as the string value of the header is still used if there is no cte attribute.

The full fix (ignore any trailing blanks or blank-separated trailing text) applies only to the non-compat32 API.  compat32 is only fixed to the extent that it now ignores trailing spaces.  Note that the HeaderRegistry parsing still records a HeaderDefect if there is extra text.

Co-authored-by: Bénédikt Tran <[email protected]>
bitdancer pushed a commit that referenced this pull request Jan 7, 2025
… value has extra text (GH-127547) (#128528)

gh-98188: Fix EmailMessage.get_payload to decode data when CTE value has extra text (GH-127547)

Up to this point message handling has been very strict with regards to content encoding values: mixed case was accepted, but trailing blanks or other text would cause decoding failure, even if the first token was a valid encoding.  By Postel's Rule we should go ahead and decode as long as we can recognize that first token.  We have not thought of any security or backward compatibility concerns with this fix.

This fix does introduce a new technique/pattern to the Message code: we look to see if the header has a 'cte' attribute, and if so we use that.  This effectively promotes the header API exposed by HeaderRegistry to an API that any header parser "should" support.  This seems like a reasonable thing to do.  It is not, however, a requirement, as the string value of the header is still used if there is no cte attribute.

The full fix (ignore any trailing blanks or blank-separated trailing text) applies only to the non-compat32 API.  compat32 is only fixed to the extent that it now ignores trailing spaces.  Note that the HeaderRegistry parsing still records a HeaderDefect if there is extra text.

(cherry picked from commit a62ba52)

Co-authored-by: RanKKI <[email protected]>
Co-authored-by: Bénédikt Tran <[email protected]>
bitdancer pushed a commit that referenced this pull request Jan 7, 2025
… value has extra text (GH-127547) (#128529)

gh-98188: Fix EmailMessage.get_payload to decode data when CTE value has extra text (GH-127547)

Up to this point message handling has been very strict with regards to content encoding values: mixed case was accepted, but trailing blanks or other text would cause decoding failure, even if the first token was a valid encoding.  By Postel's Rule we should go ahead and decode as long as we can recognize that first token.  We have not thought of any security or backward compatibility concerns with this fix.

This fix does introduce a new technique/pattern to the Message code: we look to see if the header has a 'cte' attribute, and if so we use that.  This effectively promotes the header API exposed by HeaderRegistry to an API that any header parser "should" support.  This seems like a reasonable thing to do.  It is not, however, a requirement, as the string value of the header is still used if there is no cte attribute.

The full fix (ignore any trailing blanks or blank-separated trailing text) applies only to the non-compat32 API.  compat32 is only fixed to the extent that it now ignores trailing spaces.  Note that the HeaderRegistry parsing still records a HeaderDefect if there is extra text.

(cherry picked from commit a62ba52)

Co-authored-by: RanKKI <[email protected]>
Co-authored-by: Bénédikt Tran <[email protected]>
@bedevere-bot
Copy link

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

Hi! The buildbot ARM64 macOS 3.13 has failed when building commit ad3bbb6.

What do you need to do:

  1. Don't panic.
  2. Check the buildbot page in the devguide if you don't know what the buildbots are or how they work.
  3. Go to the page of the buildbot that failed (https://buildbot.python.org/#/builders/1404/builds/592) and take a look at the build logs.
  4. Check if the failure is related to this commit (ad3bbb6) or if it is a false positive.
  5. If the failure is related to this commit, please, reflect that on the issue and make a new Pull Request with a fix.

You can take a look at the buildbot page here:

https://buildbot.python.org/#/builders/1404/builds/592

Failed tests:

  • test_ssl

Failed subtests:

  • test_preauth_data_to_tls_server - test.test_ssl.TestPreHandshakeClose.test_preauth_data_to_tls_server

Summary of the results of the build (if available):

==

Click to see traceback logs
Traceback (most recent call last):
  File "/Users/buildbot/buildarea/3.13.pablogsal-macos-m1.macos-with-brew/build/Lib/test/test_ssl.py", line 5121, in test_preauth_data_to_tls_server
    self.assertIn("before TLS handshake with data", wrap_error.args[1])
    ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: 'before TLS handshake with data' not found in '[SSL] record layer failure (_ssl.c:1028)'

srinivasreddy pushed a commit to srinivasreddy/cpython that referenced this pull request Jan 8, 2025
…value has extra text (python#127547)

Up to this point message handling has been very strict with regards to content encoding values: mixed case was accepted, but trailing blanks or other text would cause decoding failure, even if the first token was a valid encoding.  By Postel's Rule we should go ahead and decode as long as we can recognize that first token.  We have not thought of any security or backward compatibility concerns with this fix.

This fix does introduce a new technique/pattern to the Message code: we look to see if the header has a 'cte' attribute, and if so we use that.  This effectively promotes the header API exposed by HeaderRegistry to an API that any header parser "should" support.  This seems like a reasonable thing to do.  It is not, however, a requirement, as the string value of the header is still used if there is no cte attribute.

The full fix (ignore any trailing blanks or blank-separated trailing text) applies only to the non-compat32 API.  compat32 is only fixed to the extent that it now ignores trailing spaces.  Note that the HeaderRegistry parsing still records a HeaderDefect if there is extra text.

Co-authored-by: Bénédikt Tran <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants