Skip to content

Make email/message.py read headers more robustly #123742

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
zsolt-sb opened this issue Sep 5, 2024 · 11 comments
Closed

Make email/message.py read headers more robustly #123742

zsolt-sb opened this issue Sep 5, 2024 · 11 comments
Labels
stdlib Python modules in the Lib dir topic-email type-bug An unexpected behavior, bug, or error

Comments

@zsolt-sb
Copy link

zsolt-sb commented Sep 5, 2024

Bug report

Bug description:

if variable 'attachment' is of type email.message.Message

attachment.get_payload(decode=True) 

Will yield the base64 content rather than the unecoded content if the sender provides the header 'base64 '

IE

attachment['Content-Transfer-Encoding']
'base64 '  <-notice the extra space

If we adjusted this code
https://github.com/python/cpython/blob/3.12/Lib/email/message.py#L290
to also strip(), this would resolve those edge cases where email clients erroneously add spaces.

This was an actual problem I have run into with a vendor.

CPython versions tested on:

3.9, 3.12

Operating systems tested on:

macOS

Linked PRs

@zsolt-sb zsolt-sb added the type-bug An unexpected behavior, bug, or error label Sep 5, 2024
@picnixz picnixz added stdlib Python modules in the Lib dir topic-email labels Sep 5, 2024
@picnixz
Copy link
Member

picnixz commented Sep 5, 2024

CPython versions tested on: 3.9

Does the behaviour change in 3.10, 3.11, 3.12, 3.13 or 3.14? (I don't know by heart every feature that has been added and whether this behaviour has been changed). In addition, I'm not sure whether this should be considered as a security issue or not (if this is not the case, 3.8 to 3.11 won't get any updates since they are security-only).

@zsolt-sb
Copy link
Author

zsolt-sb commented Sep 5, 2024

Ah yes, I was using 3.9, but the code example I am pointing to is 3.12. It has not changed since [3.9](https://github.com/python/cpython/blob/3.9/Lib/email/message.py#L258)

@rruuaanng
Copy link
Contributor

rruuaanng commented Sep 6, 2024

Bug report

Bug description:

if variable 'attachment' is of type email.message.Message

attachment.get_payload(decode=True) 

Will yield the base64 content rather than the unecoded content if the sender provides the header 'base64 '

IE

attachment['Content-Transfer-Encoding']
'base64 '  <-notice the extra space

If we adjusted this code https://github.com/python/cpython/blob/3.12/Lib/email/message.py#L290 to also strip(), this would resolve those edge cases where email clients erroneously add spaces.

This was an actual problem I have run into with a vendor.

CPython versions tested on:

3.9, 3.12

Operating systems tested on:

macOS

Bug report

Bug description:

if variable 'attachment' is of type email.message.Message

attachment.get_payload(decode=True) 

Will yield the base64 content rather than the unecoded content if the sender provides the header 'base64 '

IE

attachment['Content-Transfer-Encoding']
'base64 '  <-notice the extra space

If we adjusted this code https://github.com/python/cpython/blob/3.12/Lib/email/message.py#L290 to also strip(), this would resolve those edge cases where email clients erroneously add spaces.

This was an actual problem I have run into with a vendor.

CPython versions tested on:

3.9, 3.12

Operating systems tested on:

macOS

Could you show me the code that's causing the error?1

Footnotes

  1. Excessive verbose quoting hidden by @erlend-aasland

@zsolt-sb
Copy link
Author

zsolt-sb commented Sep 6, 2024

Could you show me the code that's causing the error?

Minimal python code

import email
import mimetypes
import os

with open('example.eml') as fp:
    msg = email.message_from_string(fp.read())
    counter = 1
    for part in msg.walk():
        if part.get_content_maintype() == 'multipart':
            continue
        filename = part.get_filename()
        if not filename:
            ext = mimetypes.guess_extension(part.get_content_type())
            filename = f'part-{counter:03d}{ext}'
        counter += 1
        with open(os.path.join('.', filename), 'wb') as fp:
            fp.write(part.get_payload(decode=True))

This code will write out the the attachment as well as the text. What is important here though is the actual data
Attached is a bare bones email file. Please remove the .txt once downloaded.

If you open with something like the thunderbird email client, you can successfully download report named TheReport.csv. If you run the above sample code you will see TheReport.csv will come back still base64 encoded. If you look into the file with vi, you can see the space after the base64 declaration.

example.eml.txt

@rruuaanng

This comment was marked as outdated.

@erlend-aasland
Copy link
Contributor

Instead of silently accepting and normalising ill-formatted fields, another option could be to raise an exception. What do you think, @picnixz?

@picnixz
Copy link
Member

picnixz commented Nov 12, 2024

Since the current behaviour is anyway buggy, whether we fix it or raise an exception or do something else shouldn't really break any existing code (existing code is already "broken" in some sense). So here are a few suggestions:

  • Let's have a look at whether the RFC for email and headers (if any) specify how to handle those cases. If the grammar does not allow for extra whitespaces or if one shouldn't sanitize the input, then I think we should not implicitly fix the input.

  • Note that we already do some normalization by lowercasing the name so it's not entirely inconsistent nor incoherent to also strip leading and trailing whitespaces.

  • More generally, sanitizing ill-formed fields may probably be what users expect. However, the boundary between what we clean up and what we don't should be well drawn and probably documented (just in case and because it could help debugging if someone looks at their input and wonder why this or that happened).


So, I have the following proposal: we either raise an exception saying that base64 is not recognized or we emit a warning saying that it's being ignored.

Now, I'm always pondering whether warnings are even used in practice. I feel that there are two categories of people: those for which warnings are ignored and those for which no warning should be emitted. So, emitting a warning for the first category is useless while for the other, an exception would be better.

One way to be flexible is to add a strict keyword which makes the method raise an exception if the CTE is not recognized and a normalize keyword which would normalize the headers as much as possible (in addition to lowercasing which will always be done for comparison purposes). But those additional parameters may not necessarily be changed in practice so we should take strict=False and normalize=True by default.

For instance:

def get_payload(self, i=None, decode=False, *, sanitize=True, strict=False):
	... 
	raw_cte = self.get('content-transfer-encoding')
    cte = str(raw_cte or '').lower()
	if sanitize:
		cte = cte.strip()
		# do other stuff in the future maybe
	...
	if cte == ...: ...
	elif cte in (...): ...
	elif cte in (...): ...
	elif raw_cte is not None:
		if strict:
			raise ValueError(f"unknown content-transfer-encoding: {raw_cte!r}")
		else:
			import warnings
			# maybe introduce a specific warning class instead of using UserWarning
			warnings.warn(f"unknown content-transfer-encoding: {raw_cte!r}")
	...

Emitting a warning may not be needed and we could just silently ignore the CTE. What do you think of this proposal @erlend-aasland?

@rruuaanng
Copy link
Contributor

That is to say, we should default that the message body must be correct, otherwise it will not conform to rfc, right?
(Instead of implicitly processing it.)

@picnixz
Copy link
Member

picnixz commented Nov 26, 2024

I don't know if the corresponding RFC (if any) specifies the allowed header formats. For instance, if the RFC allows for trailing whitespaces to exist and if it also says that they are to be ignored, then we can just call strip(). If it explicitly states that the header must follow some specific pattern, then training whitespaces should be warned (and stripped) or an exception should be raised.

From a practical PoV it makes sense to automatically strip them. It's the easiest and probably the most natural way to parse such header. But if we want to be RFC-compliant, we may need to do something else. I don't have time for checking the RFC so any decision should be made after the RFC is consulted (again, this becomes a non-issue if the RFC is vague enough, and we would likely choose auto-normalization for practical reasons).

I don't think we'll have an issue in the future because I don't see how a trailing whitespace should be considered (namely, what would base64 (with trailing whitespace) mean if not base64?). In addition, we already use a case-insensitive comparison.

Btw, it may be interesting to see how Perl deals with that (I usually use Perl for mail-related stuff).

@rruuaanng
Copy link
Contributor

As far as I remember, RFC822 describes everything about MIME. However, it's quite lenient regarding the handling of spaces, meaning the behavior is determined by the parser. I believe it's
essential to remove field spaces, as it can affect the parsing behavior.

@rruuaanng
Copy link
Contributor

We can close this issue as it is a duplicate of issue #98188, and the upstream has already resolved it. Thank you all for your attention to this matter :). @picnixz @erlend-aasland

@erlend-aasland erlend-aasland closed this as not planned Won't fix, can't repro, duplicate, stale Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir topic-email type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

4 participants