-
-
Notifications
You must be signed in to change notification settings - Fork 31.8k
[CVE-2023-27043] gh-102988: Reject malformed addresses in email.parseaddr() #111116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
663910d
730500c
1a377ec
da25ed3
2b199bb
9b95376
cf50482
0dd5745
637bdb8
8df7013
59da8f1
361b517
3105a75
fc6e86b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -58,13 +58,18 @@ of the new API. | |
begins with angle brackets, they are stripped off. | ||
|
||
|
||
.. function:: parseaddr(address) | ||
.. function:: parseaddr(address, *, strict=True) | ||
|
||
Parse address -- which should be the value of some address-containing field such | ||
as :mailheader:`To` or :mailheader:`Cc` -- into its constituent *realname* and | ||
*email address* parts. Returns a tuple of that information, unless the parse | ||
fails, in which case a 2-tuple of ``('', '')`` is returned. | ||
|
||
If *strict* is true, use a strict parser which rejects malformed inputs. | ||
vstinner marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
.. versionchanged:: 3.13 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If it is backported, it should be 3.12.1. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right, the version number should be updated in each backport. I removed the "needs backport" label to avoid automated backports. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Alpha versions do not exist for common user, so I would use the 3.12.x version in which the fix will occur in the main branch too. It will help users that drop support of 3.11. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with the concept but that depends on the backport situation. It is fair to start with 3.13 here and if we do decide to backport it further as a security fix later on, the docs can be updated to say 3.12.x. |
||
Add *strict* optional parameter and reject malformed inputs by default. | ||
|
||
|
||
.. function:: formataddr(pair, charset='utf-8') | ||
|
||
|
@@ -82,12 +87,15 @@ of the new API. | |
Added the *charset* option. | ||
|
||
|
||
.. function:: getaddresses(fieldvalues) | ||
.. function:: getaddresses(fieldvalues, *, strict=True) | ||
|
||
This method returns a list of 2-tuples of the form returned by ``parseaddr()``. | ||
*fieldvalues* is a sequence of header field values as might be returned by | ||
:meth:`Message.get_all <email.message.Message.get_all>`. Here's a simple | ||
example that gets all the recipients of a message:: | ||
:meth:`Message.get_all <email.message.Message.get_all>`. | ||
|
||
If *strict* is true, use a strict parser which rejects malformed inputs. | ||
|
||
Here's a simple example that gets all the recipients of a message:: | ||
|
||
from email.utils import getaddresses | ||
|
||
|
@@ -97,6 +105,9 @@ of the new API. | |
resent_ccs = msg.get_all('resent-cc', []) | ||
all_recipients = getaddresses(tos + ccs + resent_tos + resent_ccs) | ||
|
||
.. versionchanged:: 3.13 | ||
Add *strict* optional parameter and reject malformed inputs by default. | ||
|
||
|
||
.. function:: parsedate(date) | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -43,6 +43,7 @@ | |
specialsre = re.compile(r'[][\\()<>@,:;".]') | ||
escapesre = re.compile(r'[\\"]') | ||
|
||
|
||
def _has_surrogates(s): | ||
"""Return True if s contains surrogate-escaped binary data.""" | ||
# This check is based on the fact that unless there are surrogates, utf8 | ||
|
@@ -103,12 +104,127 @@ def formataddr(pair, charset='utf-8'): | |
return address | ||
|
||
|
||
def _iter_escaped_chars(addr): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It can use regular expressions: for m in re.finditer(r'(?s)\\?.', addr):
yield (m.begin(), m.group()) Perhaps functions that use But do what looks good to you, I can optimize it later. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This PR is based on PR #108250 which uses the following code to strip quoted real names: # When a comma is used in the Real Name part it is not a deliminator
# So strip those out before counting the commas
pattern = r'"[^"]*,[^"]*"'
...
for v in fieldvalues:
v = re.sub(pattern, '', v)
... But this regular expression doesn't handle escape quote character ( I'm not comfortable with regex to write a parser when the grammar is that complicated. For example, antislash can also be escaped: Also, my priority here is more to get a fix as soon as possible, since the vulnerability was reported in March, and I have to fix the CVE at work as soon as possible. |
||
pos = 0 | ||
escape = False | ||
for pos, ch in enumerate(addr): | ||
if escape: | ||
yield (pos, '\\' + ch) | ||
escape = False | ||
elif ch == '\\': | ||
escape = True | ||
else: | ||
yield (pos, ch) | ||
if escape: | ||
yield (pos, '\\') | ||
|
||
|
||
def _strip_quoted_realnames(addr): | ||
"""Strip real names between quotes.""" | ||
if '"' not in addr: | ||
# Fast path | ||
return addr | ||
|
||
start = 0 | ||
open_pos = None | ||
result = [] | ||
for pos, ch in _iter_escaped_chars(addr): | ||
if ch == '"': | ||
if open_pos is None: | ||
open_pos = pos | ||
else: | ||
if start != open_pos: | ||
result.append(addr[start:open_pos]) | ||
start = pos + 1 | ||
open_pos = None | ||
|
||
if start < len(addr): | ||
result.append(addr[start:]) | ||
|
||
return ''.join(result) | ||
vstinner marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
def getaddresses(fieldvalues): | ||
"""Return a list of (REALNAME, EMAIL) for each fieldvalue.""" | ||
all = COMMASPACE.join(str(v) for v in fieldvalues) | ||
a = _AddressList(all) | ||
return a.addresslist | ||
|
||
supports_strict_parsing = True | ||
|
||
def getaddresses(fieldvalues, *, strict=True): | ||
"""Return a list of (REALNAME, EMAIL) or ('','') for each fieldvalue. | ||
|
||
When parsing fails for a fieldvalue, a 2-tuple of ('', '') is returned in | ||
its place. | ||
|
||
If strict is true, use a strict parser which rejects malformed inputs. | ||
""" | ||
|
||
# If strict is true, if the resulting list of parsed addresses is greater | ||
# than the number of fieldvalues in the input list, a parsing error has | ||
# occurred and consequently a list containing a single empty 2-tuple [('', | ||
# '')] is returned in its place. This is done to avoid invalid output. | ||
# | ||
# Malformed input: getaddresses(['[email protected] <[email protected]>']) | ||
# Invalid output: [('', '[email protected]'), ('', '[email protected]')] | ||
# Safe output: [('', '')] | ||
|
||
if not strict: | ||
all = COMMASPACE.join(str(v) for v in fieldvalues) | ||
a = _AddressList(all) | ||
return a.addresslist | ||
|
||
fieldvalues = [str(v) for v in fieldvalues] | ||
fieldvalues = _pre_parse_validation(fieldvalues) | ||
addr = COMMASPACE.join(fieldvalues) | ||
a = _AddressList(addr) | ||
result = _post_parse_validation(a.addresslist) | ||
|
||
# Treat output as invalid if the number of addresses is not equal to the | ||
# expected number of addresses. | ||
n = 0 | ||
for v in fieldvalues: | ||
# When a comma is used in the Real Name part it is not a deliminator. | ||
# So strip those out before counting the commas. | ||
v = _strip_quoted_realnames(v) | ||
# Expected number of addresses: 1 + number of commas | ||
n += 1 + v.count(',') | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can there be escaped commas? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The parser doesn't support escaped commas: >>> email.utils.getaddresses([r'Real\, name <[email protected]>'])
[('', 'Real\\'), ('name', '[email protected]')] |
||
if len(result) != n: | ||
return [('', '')] | ||
|
||
return result | ||
|
||
|
||
def _check_parenthesis(addr): | ||
# Ignore parenthesis in quoted real names. | ||
addr = _strip_quoted_realnames(addr) | ||
|
||
opens = 0 | ||
for pos, ch in _iter_escaped_chars(addr): | ||
if ch == '(': | ||
opens += 1 | ||
elif ch == ')': | ||
opens -= 1 | ||
if opens < 0: | ||
return False | ||
return (opens == 0) | ||
|
||
|
||
def _pre_parse_validation(email_header_fields): | ||
accepted_values = [] | ||
for v in email_header_fields: | ||
if not _check_parenthesis(v): | ||
v = "('', '')" | ||
accepted_values.append(v) | ||
|
||
return accepted_values | ||
|
||
|
||
def _post_parse_validation(parsed_email_header_tuples): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. _post_parse_validation() output should be a list. There is no advantage to convert it to a generator. If _pre_parse_validation() becomes a generator, parseaddr() has to create a list anyway for |
||
accepted_values = [] | ||
# The parser would have parsed a correctly formatted domain-literal | ||
# The existence of an [ after parsing indicates a parsing failure | ||
for v in parsed_email_header_tuples: | ||
if '[' in v[1]: | ||
v = ('', '') | ||
accepted_values.append(v) | ||
|
||
return accepted_values | ||
|
||
|
||
def _format_timetuple_and_zone(timetuple, zone): | ||
|
@@ -207,16 +323,33 @@ def parsedate_to_datetime(data): | |
tzinfo=datetime.timezone(datetime.timedelta(seconds=tz))) | ||
|
||
|
||
def parseaddr(addr): | ||
def parseaddr(addr, *, strict=True): | ||
""" | ||
Parse addr into its constituent realname and email address parts. | ||
|
||
Return a tuple of realname and email address, unless the parse fails, in | ||
which case return a 2-tuple of ('', ''). | ||
|
||
If strict is True, use a strict parser which rejects malformed inputs. | ||
""" | ||
addrs = _AddressList(addr).addresslist | ||
if not addrs: | ||
return '', '' | ||
if not strict: | ||
addrs = _AddressList(addr).addresslist | ||
if not addrs: | ||
return ('', '') | ||
return addrs[0] | ||
|
||
if isinstance(addr, list): | ||
addr = addr[0] | ||
|
||
if not isinstance(addr, str): | ||
return ('', '') | ||
|
||
addr = _pre_parse_validation([addr])[0] | ||
addrs = _post_parse_validation(_AddressList(addr).addresslist) | ||
|
||
if not addrs or len(addrs) > 1: | ||
return ('', '') | ||
|
||
return addrs[0] | ||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't
strict=True
make this is a backwards-incompatible change?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind to elaborate? It's a keyword-only parameter. It sounds very unlikely that someone calls this function with strict=True since the parameter doesn't exist anymore. I don't see the addition of this parameter as a backward incompatible change.
The incompatible change is obviously the fact the emails accepted previously are now rejected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
If the default stayed
strict=False
, or to only warn by default on invalid addresses (for 2 releases), this would match the process in PEP 387. Withstrict=True
, it seems that you'll need a SC exception for backwards incompatible behaviour -- and that can take time.Or do you consider this a plain bug fix? In that case, why preserve the old behaviour?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change falls into the "unfortunately interesting" category of being unsure we're doing everything correctly so it provides an opt-out. If we have confidence in our change and test suite of valid and invalid email address coverage we could proceed without the behavior flag.
Do we have such confidence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have 0% confience that this change make the parser fully compatible with all RFC. I have 100% confidence that this change rejects email addresses which were accepted before.
Sorry, as I wrote before, the whole parser should be rewritten from scratch to use grammar suggested by RFC, rather than the current design.
This change only make the situation "less bad".
Also it's common that when Python becomes stricter, someone comes and explain their valid use case relying on the old behaviour.