bpo-36338: urllib.urlparse rejects invalid IPv6 addresses #16780

vstinner · 2019-10-14T14:01:49Z

The urllib.urlparse module now rejects invalid IPv6 addresses and
invalid port numbers when parsing an URL.

https://bugs.python.org/issue36338

vstinner · 2019-10-14T19:45:52Z

@corona10: I wrote the scope test differently for make it more readable.

corona10

@corona10: I wrote the scope test differently for make it more readable.

@vstinner Awesome!

vstinner · 2019-10-15T13:47:05Z

@vadmium, @zooba, @tirkarthi: Would you mind to review this change? What do you think of my approach using ipaddress and excluding some characters from the IPv6 scope ("zone identifier")?

vstinner · 2019-10-15T17:09:47Z

I modified my PR to also fix https://bugs.python.org/issue33342 : "urllib IPv6 parsing fails with special characters in passwords".

tirkarthi · 2019-10-15T17:47:21Z

Lib/test/test_urlparse.py

-        p = urllib.parse.urlsplit(url)
-        with self.assertRaisesRegex(ValueError, "out of range"):
-            p.port
+        with self.assertRaisesRegex(ValueError, "Port out of range 0-65535"):


This slightly backwards incompatible but I am okay with the intention of the PR in validating port during parsing instead of accessing the port attribute since it just means the URL is invalid and is known earlier.

I'm checking the port to reject [ and ] in the port number. Reject port number outside the [0; 65535] is a side effect. IMHO it's a good thing to reject an invalid URL, no?

Yes, to be more clear I am fine with the change. It's that previously port validation is done while accessing port attribute allowing invalid URL to be parsed but now it's done in parsing itself which is better as per this PR.

tirkarthi · 2019-10-15T17:54:24Z

Lib/urllib/parse.py

+    if not(host.startswith('[') and host.endswith(']')):
+        return False
+
+    parts = host[1:-1].split('%', 1)


Slightly offtopic but this just reminded me that ipaddress module doesn't support scope id in IPV6 address yet. Maybe once #13772 is merged we can just catch ValidationError and return False since the same validation would already be done in the parser to check for % in scope and remove validation here.

The allowed characters in the scope part is not well defined. I read https://tools.ietf.org/html/rfc4007 and https://tools.ietf.org/html/rfc6874 RFC 6874:

A <zone_id> SHOULD contain only ASCII characters classified as
"unreserved" for use in URIs [RFC3986]. This excludes characters
such as "]" or even "%" that would complicate parsing. However, the
syntax described below does allow such characters to be percent-
encoded, for compatibility with existing devices that use them.

If an operating system uses any other characters in zone or interface
identifiers that are not in the "unreserved" character set, they MUST
be represented using percent encoding [RFC3986].

So... is % allowed in an unquoted URL?

However, the syntax described below does allow such characters to be percent-
encoded, for compatibility with existing devices that use them.

@vstinner - I read the rfc6874 and the part that you quoted. If we want strict adherence to the RFC, I think, allowing % in unquoted URL is correct.

I looked at #13772 , which is trying to bring in the scope id to ipaddress module. There is an agreement and a documented statement that states: "If present, the scope ID must be non-empty, and may not contain %."

Let's keep the current change, and not allow '%', assuming that it wont be common for all practical purposes. It will be consistent within the standard library. If the allowance of '%' in scope-id is desired, that could be changed in ipaddress module, and once the ipaddress module scope-id support is merged, we could use the facility here.

I think, allowing % in unquoted URL is correct.

Sorry but I'm not used to the urllib module. Is urllib.parse.urlsplit() supposed to get a "quoted" or "unquoted" URL?

Is urllib.parse.urlsplit() supposed to get a "quoted" or "unquoted" URL?

It is supposed to get the unquoted URL. I relied on tests of urlparse to state this.

My reading of RFC 6874 and especially, the part quoted makes me think that 'percent-encoded' character like %40 or %25 could be present in the zone-id component beyond the first '%'. - If it is the case, the current implementation will False for it a valid IPv6 URL, but this is consistent with what ipaddress module return and it is documented by the ipaddress module.

I lack the experience to say something confidently about 'percent-encoded characters' in zone-id, and i think being consistent within modules and with documentation is the most appropriate thing to do.

I am +1 with committing this change. Please let me know you have hesitation or do you want (me/) us to research further.

urllsplit accepts unquoted URl and returns the components of the URI. The change proposed in this patch and tests is doing this accurately.

urlopen and open interface will quote the URL and percent-encode them for the special characters. (So, the examples of firefox and chromium presented in the discussion, I will expected those URIs to work in those browsers if they are percent-encoded.)

@orsenthil , @vstinner In #13772 I come to the decision, that % character shall not be allowed in <zone_id> part according to paragraph 5 of Section 11.2 RFC 4007:

An implementation MAY support other kinds of non-null strings as
<zone_id>. However, the strings must not conflict with the delimiter
character.

At the same time, Section 2 RFC 6874, especially paragraph 4, presumes, that <zone_id> part may contain % represented using percent encoding.

That is confusing and seems to conflict with RFC 4007.

That's why, I decided to assume, that RFC 6874 conserns only representing IPv6 Zone Identifiers in URI`s.

vstinner · 2019-10-18T13:12:49Z

@orsenthil: Would you mind to review this change? Any idea for the allowed characters in an IPv6 scope?

orsenthil · 2019-10-18T13:15:05Z

@vstinner - Sure, I will review. I will refer to the RFC for the valid characters for the IPv6 scope.

vstinner · 2019-10-21T09:33:29Z

I tried to allow [ and ] in the user:password part, but then the URL parser is confused by the URL:

http://[::1%sc[o]pe]

It reads it as IPv6 ::1%sc[o.

* bpo-36338: The urllib.urlparse module now rejects invalid IPv6 addresses and invalid port numbers when parsing an URL. * bpo-33342: Fix urlparse() for IPv6 address with user:password when user and/or password contain "[" and/or "]" characters.

vstinner · 2019-10-21T09:36:51Z

I rebased my PR to fix the merge conflict.

@orsenthil: Ping for review.

vstinner · 2019-10-21T09:43:10Z

Firefox doesn't seem to accept % in the IPv6 part of an URL. When I type the following URL, it opens Google with the URL as a search...

http://[::1%1]:8000/

vstinner · 2019-10-21T09:43:51Z

Firefox doesn't seem to accept % in the IPv6 part of an URL. When I type the following URL, it opens Google with the URL as a search...

Same behavior in Chromium.

orsenthil

LGTM. Thank you.

orsenthil · 2019-10-21T16:28:07Z

Ping for review.

Done. Thank you, Victor.

orsenthil · 2019-10-23T02:32:04Z

Firefox doesn't seem to accept % in the IPv6 part of an URL.

Same behavior in Chromium.

I expected these browsers to have percent-encode these and work.

https://bugzilla.mozilla.org/show_bug.cgi?id=700999

Also, "Microsoft Edge (as well as Microsoft Explorer) works well with link local IPV6 addresses."

This is the most relevant information I found

https://en.wikipedia.org/wiki/IPv6_address#Use_of_zone_indices_in_URIs

When used in uniform resource identifiers (URI), the use of the percent sign causes a syntax conflict, therefore it must be escaped via percent-encoding,[11] e.g.:

http://[fe80::1ff:fe23:4567:890a%25eth0]/

vstinner · 2019-10-23T12:30:51Z

The living URL Standard doesn't implement IPv6 scope on purpose:

Support for <zone_id> is intentionally omitted.

This comment points to https://www.w3.org/Bugs/Public/show_bug.cgi?id=27234#c2 which is a comment written by Ryan Sleevi at 2015-08-14:

Yes, we're especially not keen to support these in Chrome and have repeatedly decided not to. The platform-specific nature of <zone_id> makes it difficult to impossible to validate the well-formedness of the URL (see https://tools.ietf.org/html/rfc4007#section-11.2 , as referenced in 6874, to fully appreciate this special hell). Even if we could reliably parse these (from a URL spec standpoint), it then has to be handed 'somewhere', and that opens a new can of worms.

Even 6874 notes how unlikely it is to encounter these in practice

   Thus, URIs including a
   ZoneID are unlikely to be encountered in HTML documents.  However, if
   they do (for example, in a diagnostic script coded in HTML), it would
   be appropriate to treat them exactly as above.

Note that a 'dumb' parser may not be sufficient, as the Security Considerations of 6874 note:

   To limit this risk, implementations MUST NOT allow use of this format
   except for well-defined usages, such as sending to link-local
   addresses under prefix fe80::/10.  At the time of writing, this is
   the only well-defined usage known.

And also

   An HTTP client, proxy, or other intermediary MUST remove any ZoneID
   attached to an outgoing URI, as it has only local significance at the
   sending host.

This requires a transformative rewrite of any URLs going out the wire. That's pretty substantial. Anne, do you recall the bug talking about IP canonicalization (e.g. http://127.0.0.1 vs http://[::127.0.0.1] vs http://012345 and friends?) This is conceptually a similar issue - except it's explicitly required in the context of <zone_id> that the <zone_id> not be emitted.

There's also the issue that zone_id precludes/requires the use of APIs that user agents would otherwise prefer to avoid, in order to 'properly' handle the zone_id interpretation. For example, Chromium on some platforms uses a built in DNS resolver, and so our address lookup functions would need to define and support <zone_id>'s and map them to system concepts. In doing so, you could end up with weird situations where a URL works in Firefox but not Chrome, even though both 'hypothetically' supported <zone_id>'s, because FF may use an OS routine and Chrome may use a built-in routine and they diverge.

Overall, our internal consensus is that <zone_id>'s are bonkers on many grounds - the technical ambiguity (and RFC 6874 doesn't really resolve the ambiguity as much as it fully owns it and just says #YOLOSWAG) - and supporting them would add a lot of complexity for what is explicitly and admittedly a limited value use case.

Firefox feature request https://bugzilla.mozilla.org/show_bug.cgi?id=700999 has been rejected using this comment as well at 2015-08-14.

Currently, only Microsoft Edge supports IPv6 scope: Firefox and Chromium don't.

I suggest to follow Firefox, Chromium and living URL Standard example: don't support IPv6 scope.

My current implementation doesn't implement the RFC 6874 which suggests to use %25 between the IPv6 and the scope. For example address ::1 with scope eth0 should be written ::1%25eth0. This syntax is hard to read if you use numeric scopes which are common: ::1 with scope 2 should be written ::1%252 :-(

vstinner · 2019-10-23T12:35:03Z

It is supposed to get the unquoted URL. I relied on tests of urlparse to state this.

This makes me even more uncomfortable to support IPv6 scope: it is not well defined if urlsplit() is expected to be used on a quote or unquoted URL. This is a major difference for RFC 6874 which is tied to quoted characters. Not well defined means: we should not have to dig into tests to reverse engineer the "expected" function behavior. It should be well documented and well tested.

I mean that if someone wants to support IPv6 scope in URL, I suggest to first clarify what urlsplit() expects. IMHO fixing this is out of the scope of fixing https://bugs.python.org/issue36338 security vulnerability.

vstinner · 2021-09-21T21:58:22Z

I failed finding time to finish the PR. I prefer to abandon it.

the-knights-who-say-ni added the CLA signed label Oct 14, 2019

bedevere-bot added the awaiting core review label Oct 14, 2019

vstinner mentioned this pull request Oct 14, 2019

bpo-36338: Reject hostname with [ at position > 0 #14896

Closed

corona10 approved these changes Oct 15, 2019

View reviewed changes

tirkarthi reviewed Oct 15, 2019

View reviewed changes

orsenthil approved these changes Oct 21, 2019

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting core review labels Oct 21, 2019

miguendes mentioned this pull request Jul 10, 2021

gh-88037: Move port validation logic to parsing time #25774

Open

vstinner closed this Sep 21, 2021

vstinner deleted the urlparse_ipv6 branch September 21, 2021 21:58

sanebow mannequin mentioned this pull request Apr 10, 2022

urlparse of urllib returns wrong hostname #80519

Open

Uh oh!

bpo-36338: urllib.urlparse rejects invalid IPv6 addresses #16780

bpo-36338: urllib.urlparse rejects invalid IPv6 addresses #16780

Uh oh!

Conversation

vstinner commented Oct 14, 2019 • edited by bedevere-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vstinner commented Oct 14, 2019

Uh oh!

corona10 left a comment

Choose a reason for hiding this comment

Uh oh!

vstinner commented Oct 15, 2019

Uh oh!

vstinner commented Oct 15, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

orsenthil Oct 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vstinner commented Oct 18, 2019

Uh oh!

orsenthil commented Oct 18, 2019

Uh oh!

vstinner commented Oct 21, 2019

Uh oh!

vstinner commented Oct 21, 2019

Uh oh!

vstinner commented Oct 21, 2019

Uh oh!

vstinner commented Oct 21, 2019

Uh oh!

orsenthil left a comment

Choose a reason for hiding this comment

Uh oh!

orsenthil commented Oct 21, 2019

Uh oh!

orsenthil commented Oct 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vstinner commented Oct 23, 2019

Uh oh!

vstinner commented Oct 23, 2019

Uh oh!

vstinner commented Sep 21, 2021

Uh oh!

Uh oh!

vstinner commented Oct 14, 2019 •

edited by bedevere-bot

Loading

orsenthil Oct 23, 2019 •

edited

Loading

orsenthil commented Oct 23, 2019 •

edited

Loading