Skip to content

bpo-36338: urllib.urlparse rejects invalid IPv6 addresses #16780

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

bpo-36338: urllib.urlparse rejects invalid IPv6 addresses #16780

wants to merge 1 commit into from

Conversation

vstinner
Copy link
Member

@vstinner vstinner commented Oct 14, 2019

The urllib.urlparse module now rejects invalid IPv6 addresses and
invalid port numbers when parsing an URL.

https://bugs.python.org/issue36338

@vstinner
Copy link
Member Author

@corona10: I wrote the scope test differently for make it more readable.

Copy link
Member

@corona10 corona10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@corona10: I wrote the scope test differently for make it more readable.

@vstinner Awesome!

@vstinner
Copy link
Member Author

@vadmium, @zooba, @tirkarthi: Would you mind to review this change? What do you think of my approach using ipaddress and excluding some characters from the IPv6 scope ("zone identifier")?

@vstinner
Copy link
Member Author

I modified my PR to also fix https://bugs.python.org/issue33342 : "urllib IPv6 parsing fails with special characters in passwords".

p = urllib.parse.urlsplit(url)
with self.assertRaisesRegex(ValueError, "out of range"):
p.port
with self.assertRaisesRegex(ValueError, "Port out of range 0-65535"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This slightly backwards incompatible but I am okay with the intention of the PR in validating port during parsing instead of accessing the port attribute since it just means the URL is invalid and is known earlier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm checking the port to reject [ and ] in the port number. Reject port number outside the [0; 65535] is a side effect. IMHO it's a good thing to reject an invalid URL, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, to be more clear I am fine with the change. It's that previously port validation is done while accessing port attribute allowing invalid URL to be parsed but now it's done in parsing itself which is better as per this PR.

if not(host.startswith('[') and host.endswith(']')):
return False

parts = host[1:-1].split('%', 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly offtopic but this just reminded me that ipaddress module doesn't support scope id in IPV6 address yet. Maybe once #13772 is merged we can just catch ValidationError and return False since the same validation would already be done in the parser to check for % in scope and remove validation here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The allowed characters in the scope part is not well defined. I read https://tools.ietf.org/html/rfc4007 and https://tools.ietf.org/html/rfc6874 RFC 6874:

A <zone_id> SHOULD contain only ASCII characters classified as
"unreserved" for use in URIs [RFC3986]. This excludes characters
such as "]" or even "%" that would complicate parsing. However, the
syntax described below does allow such characters to be percent-
encoded, for compatibility with existing devices that use them.

If an operating system uses any other characters in zone or interface
identifiers that are not in the "unreserved" character set, they MUST
be represented using percent encoding [RFC3986].

So... is % allowed in an unquoted URL?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, the syntax described below does allow such characters to be percent-
encoded, for compatibility with existing devices that use them.

@vstinner - I read the rfc6874 and the part that you quoted. If we want strict adherence to the RFC, I think, allowing % in unquoted URL is correct.

I looked at #13772 , which is trying to bring in the scope id to ipaddress module. There is an agreement and a documented statement that states: "If present, the scope ID must be non-empty, and may not contain %."

Let's keep the current change, and not allow '%', assuming that it wont be common for all practical purposes. It will be consistent within the standard library. If the allowance of '%' in scope-id is desired, that could be changed in ipaddress module, and once the ipaddress module scope-id support is merged, we could use the facility here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, allowing % in unquoted URL is correct.

Sorry but I'm not used to the urllib module. Is urllib.parse.urlsplit() supposed to get a "quoted" or "unquoted" URL?

Copy link
Member

@orsenthil orsenthil Oct 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is urllib.parse.urlsplit() supposed to get a "quoted" or "unquoted" URL?

It is supposed to get the unquoted URL. I relied on tests of urlparse to state this.


My reading of RFC 6874 and especially, the part quoted makes me think that 'percent-encoded' character like %40 or %25 could be present in the zone-id component beyond the first '%'. - If it is the case, the current implementation will False for it a valid IPv6 URL, but this is consistent with what ipaddress module return and it is documented by the ipaddress module.

I lack the experience to say something confidently about 'percent-encoded characters' in zone-id, and i think being consistent within modules and with documentation is the most appropriate thing to do.

I am +1 with committing this change. Please let me know you have hesitation or do you want (me/) us to research further.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. urllsplit accepts unquoted URl and returns the components of the URI. The change proposed in this patch and tests is doing this accurately.
  2. urlopen and open interface will quote the URL and percent-encode them for the special characters. (So, the examples of firefox and chromium presented in the discussion, I will expected those URIs to work in those browsers if they are percent-encoded.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@orsenthil , @vstinner In #13772 I come to the decision, that % character shall not be allowed in <zone_id> part according to paragraph 5 of Section 11.2 RFC 4007:

An implementation MAY support other kinds of non-null strings as
<zone_id>. However, the strings must not conflict with the delimiter
character
.

At the same time, Section 2 RFC 6874, especially paragraph 4, presumes, that <zone_id> part may contain % represented using percent encoding.

That is confusing and seems to conflict with RFC 4007.

That's why, I decided to assume, that RFC 6874 conserns only representing IPv6 Zone Identifiers in URI`s.

@vstinner
Copy link
Member Author

@orsenthil: Would you mind to review this change? Any idea for the allowed characters in an IPv6 scope?

@orsenthil
Copy link
Member

@vstinner - Sure, I will review. I will refer to the RFC for the valid characters for the IPv6 scope.

@vstinner
Copy link
Member Author

I tried to allow [ and ] in the user:password part, but then the URL parser is confused by the URL:

http://[::1%sc[o]pe]

It reads it as IPv6 ::1%sc[o.

* bpo-36338: The urllib.urlparse module now rejects invalid IPv6
  addresses and invalid port numbers when parsing an URL.
* bpo-33342: Fix urlparse() for IPv6 address with user:password when
  user and/or password contain "[" and/or "]" characters.
@vstinner
Copy link
Member Author

I rebased my PR to fix the merge conflict.

@orsenthil: Ping for review.

@vstinner
Copy link
Member Author

Firefox doesn't seem to accept % in the IPv6 part of an URL. When I type the following URL, it opens Google with the URL as a search...

http://[::1%1]:8000/

@vstinner
Copy link
Member Author

Firefox doesn't seem to accept % in the IPv6 part of an URL. When I type the following URL, it opens Google with the URL as a search...

Same behavior in Chromium.

Copy link
Member

@orsenthil orsenthil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you.

@orsenthil
Copy link
Member

Ping for review.

Done. Thank you, Victor.

@orsenthil
Copy link
Member

orsenthil commented Oct 23, 2019

Firefox doesn't seem to accept % in the IPv6 part of an URL.

Same behavior in Chromium.

I expected these browsers to have percent-encode these and work.

https://bugzilla.mozilla.org/show_bug.cgi?id=700999

Also, "Microsoft Edge (as well as Microsoft Explorer) works well with link local IPV6 addresses."

This is the most relevant information I found

https://en.wikipedia.org/wiki/IPv6_address#Use_of_zone_indices_in_URIs

When used in uniform resource identifiers (URI), the use of the percent sign causes a syntax conflict, therefore it must be escaped via percent-encoding,[11] e.g.:

http://[fe80::1ff:fe23:4567:890a%25eth0]/

@vstinner
Copy link
Member Author

The living URL Standard doesn't implement IPv6 scope on purpose:

Support for <zone_id> is intentionally omitted.

This comment points to https://www.w3.org/Bugs/Public/show_bug.cgi?id=27234#c2 which is a comment written by Ryan Sleevi at 2015-08-14:

Yes, we're especially not keen to support these in Chrome and have repeatedly decided not to. The platform-specific nature of <zone_id> makes it difficult to impossible to validate the well-formedness of the URL (see https://tools.ietf.org/html/rfc4007#section-11.2 , as referenced in 6874, to fully appreciate this special hell). Even if we could reliably parse these (from a URL spec standpoint), it then has to be handed 'somewhere', and that opens a new can of worms.

Even 6874 notes how unlikely it is to encounter these in practice

   Thus, URIs including a
   ZoneID are unlikely to be encountered in HTML documents.  However, if
   they do (for example, in a diagnostic script coded in HTML), it would
   be appropriate to treat them exactly as above.

Note that a 'dumb' parser may not be sufficient, as the Security Considerations of 6874 note:

   To limit this risk, implementations MUST NOT allow use of this format
   except for well-defined usages, such as sending to link-local
   addresses under prefix fe80::/10.  At the time of writing, this is
   the only well-defined usage known.

And also

   An HTTP client, proxy, or other intermediary MUST remove any ZoneID
   attached to an outgoing URI, as it has only local significance at the
   sending host.

This requires a transformative rewrite of any URLs going out the wire. That's pretty substantial. Anne, do you recall the bug talking about IP canonicalization (e.g. http://127.0.0.1 vs http://[::127.0.0.1] vs http://012345 and friends?) This is conceptually a similar issue - except it's explicitly required in the context of <zone_id> that the <zone_id> not be emitted.

There's also the issue that zone_id precludes/requires the use of APIs that user agents would otherwise prefer to avoid, in order to 'properly' handle the zone_id interpretation. For example, Chromium on some platforms uses a built in DNS resolver, and so our address lookup functions would need to define and support <zone_id>'s and map them to system concepts. In doing so, you could end up with weird situations where a URL works in Firefox but not Chrome, even though both 'hypothetically' supported <zone_id>'s, because FF may use an OS routine and Chrome may use a built-in routine and they diverge.

Overall, our internal consensus is that <zone_id>'s are bonkers on many grounds - the technical ambiguity (and RFC 6874 doesn't really resolve the ambiguity as much as it fully owns it and just says #YOLOSWAG) - and supporting them would add a lot of complexity for what is explicitly and admittedly a limited value use case.

Firefox feature request https://bugzilla.mozilla.org/show_bug.cgi?id=700999 has been rejected using this comment as well at 2015-08-14.


Currently, only Microsoft Edge supports IPv6 scope: Firefox and Chromium don't.

I suggest to follow Firefox, Chromium and living URL Standard example: don't support IPv6 scope.

My current implementation doesn't implement the RFC 6874 which suggests to use %25 between the IPv6 and the scope. For example address ::1 with scope eth0 should be written ::1%25eth0. This syntax is hard to read if you use numeric scopes which are common: ::1 with scope 2 should be written ::1%252 :-(

@vstinner
Copy link
Member Author

It is supposed to get the unquoted URL. I relied on tests of urlparse to state this.

This makes me even more uncomfortable to support IPv6 scope: it is not well defined if urlsplit() is expected to be used on a quote or unquoted URL. This is a major difference for RFC 6874 which is tied to quoted characters. Not well defined means: we should not have to dig into tests to reverse engineer the "expected" function behavior. It should be well documented and well tested.

I mean that if someone wants to support IPv6 scope in URL, I suggest to first clarify what urlsplit() expects. IMHO fixing this is out of the scope of fixing https://bugs.python.org/issue36338 security vulnerability.

@vstinner
Copy link
Member Author

I failed finding time to finish the PR. I prefer to abandon it.

@vstinner vstinner closed this Sep 21, 2021
@vstinner vstinner deleted the urlparse_ipv6 branch September 21, 2021 21:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants