gh-98188: Fix EmailMessage.get_payload to decode data #127547

RanKKI · 2024-12-03T04:41:49Z

Fix email.message.EmailMessage.get_payload failing to decode data when there is trailing whitespace and/or extra text following the <mechanism> of Content-Transfer-Encoding

>>> msg = email.message_from_string(textwrap.dedent("""\
... Content-Transfer-Encoding: base64 some text
... 
... SGVsbG8uIFRlc3Rpbmc=
... """), policy=policy.default)
>>> msg.get_payload(decode=True)
b'SGVsbG8uIFRlc3Rpbmc=\n'
>>> header = msg.get("content-transfer-encoding")
>>> print(f'"{header.cte}"')
"base64"
>>> print(f'"{str(header)}"')
"base64 some text"
>>> header.defects
(InvalidHeaderDefect('Extra text after content transfer encoding'),)

The header.defects attribute does have an InvalidHeaderDefect error, but header.cte is still a valid mechanism. Therefore, it is better to decode the content even if there is an error.

The fix in ietf-tools/mailarchive#3550 overrides the __str__ method to return the self.cte, which resolves this issue. However, it might have some backward compatibility issues. So, it is better to ensure str(header) still returns the original value while using header.cte to retrieve the parsed CTE in the get_payload(decode=True) method.

The output of msg.get_payload(decode=True) is b'Hello. Testing' after this fix

Issue: email: get_payload(decode=True) doesn't handle Content-Transfer-Encoding with trailing white space #98188

Fix `email.message.EmailMessage.get_payload` failing to decode data when there is a trailing whitespace following the `<mechanism>`. For backward compatibility, `str(cte_header)` still returns the original value; `get_payload` uses `cte_header.cte` to retrieve the parsed CTE.

bitdancer

Thanks for doing this. I'm wondering a little bit about the wisdom of using the cte if there is extra text, but since I made the decision to expose it as the 'cte' attribute even if the header is defective, I guess it does make sense to go ahead and use it for the decoding. Or, at least, it is more consistent to do so, and that would follow the principle of least surprise.

Lib/email/message.py

Lib/test/test_email/test_message.py

bitdancer

Looks like I forgot to click request changes when I submitted the review.

bedevere-app · 2024-12-16T16:27:32Z

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

picnixz

Some additional comments. Depending on whether additional junk after a known mechanism should be eagerly rejected or not, the NEWS entry would need to be amended and a What's New entry should be added.

Me thinking loud:

Current behaviour

"base64 " is not recognized and the payload is not decoded properly
"base64 some text" is not recognized and the payload is not decoded properly

Proposed behaviour

"base64 " is recognized as "base64": ok for this
"base64 some text" is recognized as "base64" and has a defect due to "some text"

I suggest rejecting "base64 some text" altogether without recognizing the "base64" mechanism at all. Ignoring whitespaces is probably fine but I'd prefer notifying the user that junk text was added and not expected (without trying to decode the email). But if @bitdancer is fine with ignoring the additional junk, I'm also ok.

Misc/NEWS.d/next/Library/2024-12-03-14-45-16.gh-issue-98188.GX9i2b.rst

Lib/test/test_email/test_message.py

…9i2b.rst Co-authored-by: Bénédikt Tran <[email protected]>

RanKKI · 2024-12-22T12:11:28Z

I have made the requested changes; please review again

bedevere-app · 2024-12-22T12:11:33Z

Thanks for making the requested changes!

@bitdancer: please review the changes made to this pull request.

bitdancer

LGTM

miss-islington-app · 2025-01-06T01:32:19Z

Thanks @RanKKI for the PR, and @bitdancer for merging it 🌮🎉.. I'm working now to backport this PR to: 3.12, 3.13.
🐍🍒⛏🤖

…value has extra text (pythonGH-127547) Up to this point message handling has been very strict with regards to content encoding values: mixed case was accepted, but trailing blanks or other text would cause decoding failure, even if the first token was a valid encoding. By Postel's Rule we should go ahead and decode as long as we can recognize that first token. We have not thought of any security or backward compatibility concerns with this fix. This fix does introduce a new technique/pattern to the Message code: we look to see if the header has a 'cte' attribute, and if so we use that. This effectively promotes the header API exposed by HeaderRegistry to an API that any header parser "should" support. This seems like a reasonable thing to do. It is not, however, a requirement, as the string value of the header is still used if there is no cte attribute. The full fix (ignore any trailing blanks or blank-separated trailing text) applies only to the non-compat32 API. compat32 is only fixed to the extent that it now ignores trailing spaces. Note that the HeaderRegistry parsing still records a HeaderDefect if there is extra text. (cherry picked from commit a62ba52) Co-authored-by: RanKKI <[email protected]> Co-authored-by: Bénédikt Tran <[email protected]>

bedevere-app · 2025-01-06T01:32:31Z

GH-128528 is a backport of this pull request to the 3.13 branch.

bedevere-app · 2025-01-06T01:32:36Z

GH-128529 is a backport of this pull request to the 3.12 branch.

…value has extra text (python#127547) Up to this point message handling has been very strict with regards to content encoding values: mixed case was accepted, but trailing blanks or other text would cause decoding failure, even if the first token was a valid encoding. By Postel's Rule we should go ahead and decode as long as we can recognize that first token. We have not thought of any security or backward compatibility concerns with this fix. This fix does introduce a new technique/pattern to the Message code: we look to see if the header has a 'cte' attribute, and if so we use that. This effectively promotes the header API exposed by HeaderRegistry to an API that any header parser "should" support. This seems like a reasonable thing to do. It is not, however, a requirement, as the string value of the header is still used if there is no cte attribute. The full fix (ignore any trailing blanks or blank-separated trailing text) applies only to the non-compat32 API. compat32 is only fixed to the extent that it now ignores trailing spaces. Note that the HeaderRegistry parsing still records a HeaderDefect if there is extra text. Co-authored-by: Bénédikt Tran <[email protected]>

… value has extra text (GH-127547) (#128528) gh-98188: Fix EmailMessage.get_payload to decode data when CTE value has extra text (GH-127547) Up to this point message handling has been very strict with regards to content encoding values: mixed case was accepted, but trailing blanks or other text would cause decoding failure, even if the first token was a valid encoding. By Postel's Rule we should go ahead and decode as long as we can recognize that first token. We have not thought of any security or backward compatibility concerns with this fix. This fix does introduce a new technique/pattern to the Message code: we look to see if the header has a 'cte' attribute, and if so we use that. This effectively promotes the header API exposed by HeaderRegistry to an API that any header parser "should" support. This seems like a reasonable thing to do. It is not, however, a requirement, as the string value of the header is still used if there is no cte attribute. The full fix (ignore any trailing blanks or blank-separated trailing text) applies only to the non-compat32 API. compat32 is only fixed to the extent that it now ignores trailing spaces. Note that the HeaderRegistry parsing still records a HeaderDefect if there is extra text. (cherry picked from commit a62ba52) Co-authored-by: RanKKI <[email protected]> Co-authored-by: Bénédikt Tran <[email protected]>

… value has extra text (GH-127547) (#128529) gh-98188: Fix EmailMessage.get_payload to decode data when CTE value has extra text (GH-127547) Up to this point message handling has been very strict with regards to content encoding values: mixed case was accepted, but trailing blanks or other text would cause decoding failure, even if the first token was a valid encoding. By Postel's Rule we should go ahead and decode as long as we can recognize that first token. We have not thought of any security or backward compatibility concerns with this fix. This fix does introduce a new technique/pattern to the Message code: we look to see if the header has a 'cte' attribute, and if so we use that. This effectively promotes the header API exposed by HeaderRegistry to an API that any header parser "should" support. This seems like a reasonable thing to do. It is not, however, a requirement, as the string value of the header is still used if there is no cte attribute. The full fix (ignore any trailing blanks or blank-separated trailing text) applies only to the non-compat32 API. compat32 is only fixed to the extent that it now ignores trailing spaces. Note that the HeaderRegistry parsing still records a HeaderDefect if there is extra text. (cherry picked from commit a62ba52) Co-authored-by: RanKKI <[email protected]> Co-authored-by: Bénédikt Tran <[email protected]>

bedevere-bot · 2025-01-07T17:54:35Z

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

Hi! The buildbot ARM64 macOS 3.13 has failed when building commit ad3bbb6.

What do you need to do:

Don't panic.
Check the buildbot page in the devguide if you don't know what the buildbots are or how they work.
Go to the page of the buildbot that failed (https://buildbot.python.org/#/builders/1404/builds/592) and take a look at the build logs.
Check if the failure is related to this commit (ad3bbb6) or if it is a false positive.
If the failure is related to this commit, please, reflect that on the issue and make a new Pull Request with a fix.

You can take a look at the buildbot page here:

https://buildbot.python.org/#/builders/1404/builds/592

Failed tests:

test_ssl

Failed subtests:

test_preauth_data_to_tls_server - test.test_ssl.TestPreHandshakeClose.test_preauth_data_to_tls_server

Summary of the results of the build (if available):

==

Click to see traceback logs

Traceback (most recent call last):
  File "/Users/buildbot/buildarea/3.13.pablogsal-macos-m1.macos-with-brew/build/Lib/test/test_ssl.py", line 5121, in test_preauth_data_to_tls_server
    self.assertIn("before TLS handshake with data", wrap_error.args[1])
    ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: 'before TLS handshake with data' not found in '[SSL] record layer failure (_ssl.c:1028)'

…value has extra text (python#127547) Up to this point message handling has been very strict with regards to content encoding values: mixed case was accepted, but trailing blanks or other text would cause decoding failure, even if the first token was a valid encoding. By Postel's Rule we should go ahead and decode as long as we can recognize that first token. We have not thought of any security or backward compatibility concerns with this fix. This fix does introduce a new technique/pattern to the Message code: we look to see if the header has a 'cte' attribute, and if so we use that. This effectively promotes the header API exposed by HeaderRegistry to an API that any header parser "should" support. This seems like a reasonable thing to do. It is not, however, a requirement, as the string value of the header is still used if there is no cte attribute. The full fix (ignore any trailing blanks or blank-separated trailing text) applies only to the non-compat32 API. compat32 is only fixed to the extent that it now ignores trailing spaces. Note that the HeaderRegistry parsing still records a HeaderDefect if there is extra text. Co-authored-by: Bénédikt Tran <[email protected]>

RanKKI requested a review from a team as a code owner December 3, 2024 04:41

bedevere-app bot mentioned this pull request Dec 3, 2024

email: get_payload(decode=True) doesn't handle Content-Transfer-Encoding with trailing white space #98188

Closed

bedevere-app bot added the awaiting review label Dec 3, 2024

RanKKI added 2 commits December 3, 2024 15:42

docs: update NEWS.d

e655493

docs: update NEWS.d to fix linked method

6bf441b

ZeroIntensity added needs backport to 3.12 only security fixes needs backport to 3.13 bugs and security fixes labels Dec 3, 2024

ZeroIntensity requested a review from picnixz December 3, 2024 14:58

bitdancer reviewed Dec 16, 2024

View reviewed changes

bitdancer requested changes Dec 16, 2024

View reviewed changes

bedevere-app bot removed the awaiting review label Dec 16, 2024

bedevere-app bot added the awaiting changes label Dec 16, 2024

picnixz reviewed Dec 16, 2024

View reviewed changes

Misc/NEWS.d/next/Library/2024-12-03-14-45-16.gh-issue-98188.GX9i2b.rst Outdated Show resolved Hide resolved

Lib/test/test_email/test_message.py Outdated Show resolved Hide resolved

RanKKI and others added 3 commits December 19, 2024 18:04

Update Misc/NEWS.d/next/Library/2024-12-03-14-45-16.gh-issue-98188.GX…

53755c1

…9i2b.rst Co-authored-by: Bénédikt Tran <[email protected]>

refactor: move test cases

a81ad68

chore: remove unused import

6631ef6

bedevere-app bot added awaiting change review and removed awaiting changes labels Dec 22, 2024

bedevere-app bot requested a review from bitdancer December 22, 2024 12:11

Merge branch 'main' into fix-issue-98188

f85343e

bitdancer approved these changes Jan 6, 2025

View reviewed changes

bedevere-app bot added awaiting merge and removed awaiting change review labels Jan 6, 2025

bitdancer merged commit a62ba52 into python:main Jan 6, 2025
40 of 42 checks passed

bedevere-app bot removed the awaiting merge label Jan 6, 2025

bedevere-app bot removed the needs backport to 3.13 bugs and security fixes label Jan 6, 2025

bedevere-app bot removed the needs backport to 3.12 only security fixes label Jan 6, 2025

Uh oh!

gh-98188: Fix EmailMessage.get_payload to decode data #127547

gh-98188: Fix EmailMessage.get_payload to decode data #127547

Uh oh!

Conversation

RanKKI commented Dec 3, 2024 • edited by bedevere-app bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bitdancer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bitdancer left a comment

Choose a reason for hiding this comment

Uh oh!

bedevere-app bot commented Dec 16, 2024

Uh oh!

picnixz left a comment

Choose a reason for hiding this comment

Current behaviour

Proposed behaviour

Uh oh!

Uh oh!

Uh oh!

RanKKI commented Dec 22, 2024

Uh oh!

bedevere-app bot commented Dec 22, 2024

Uh oh!

bitdancer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

miss-islington-app bot commented Jan 6, 2025

Uh oh!

bedevere-app bot commented Jan 6, 2025

Uh oh!

bedevere-app bot commented Jan 6, 2025

Uh oh!

bedevere-bot commented Jan 7, 2025

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

Uh oh!

Uh oh!

RanKKI commented Dec 3, 2024 •

edited by bedevere-app bot

Loading