-
-
Notifications
You must be signed in to change notification settings - Fork 32k
bpo-36338: urllib.urlparse rejects invalid IPv6 addresses #16780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@corona10: I wrote the scope test differently for make it more readable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vadmium, @zooba, @tirkarthi: Would you mind to review this change? What do you think of my approach using ipaddress and excluding some characters from the IPv6 scope ("zone identifier")? |
I modified my PR to also fix https://bugs.python.org/issue33342 : "urllib IPv6 parsing fails with special characters in passwords". |
p = urllib.parse.urlsplit(url) | ||
with self.assertRaisesRegex(ValueError, "out of range"): | ||
p.port | ||
with self.assertRaisesRegex(ValueError, "Port out of range 0-65535"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This slightly backwards incompatible but I am okay with the intention of the PR in validating port during parsing instead of accessing the port attribute since it just means the URL is invalid and is known earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm checking the port to reject [ and ] in the port number. Reject port number outside the [0; 65535] is a side effect. IMHO it's a good thing to reject an invalid URL, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, to be more clear I am fine with the change. It's that previously port validation is done while accessing port attribute allowing invalid URL to be parsed but now it's done in parsing itself which is better as per this PR.
if not(host.startswith('[') and host.endswith(']')): | ||
return False | ||
|
||
parts = host[1:-1].split('%', 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly offtopic but this just reminded me that ipaddress module doesn't support scope id in IPV6 address yet. Maybe once #13772 is merged we can just catch ValidationError and return False since the same validation would already be done in the parser to check for % in scope and remove validation here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The allowed characters in the scope part is not well defined. I read https://tools.ietf.org/html/rfc4007 and https://tools.ietf.org/html/rfc6874 RFC 6874:
A <zone_id> SHOULD contain only ASCII characters classified as
"unreserved" for use in URIs [RFC3986]. This excludes characters
such as "]" or even "%" that would complicate parsing. However, the
syntax described below does allow such characters to be percent-
encoded, for compatibility with existing devices that use them.
If an operating system uses any other characters in zone or interface
identifiers that are not in the "unreserved" character set, they MUST
be represented using percent encoding [RFC3986].
So... is % allowed in an unquoted URL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, the syntax described below does allow such characters to be percent-
encoded, for compatibility with existing devices that use them.
@vstinner - I read the rfc6874 and the part that you quoted. If we want strict adherence to the RFC, I think, allowing % in unquoted URL is correct.
I looked at #13772 , which is trying to bring in the scope id to ipaddress module. There is an agreement and a documented statement that states: "If present, the scope ID must be non-empty, and may not contain %
."
Let's keep the current change, and not allow '%', assuming that it wont be common for all practical purposes. It will be consistent within the standard library. If the allowance of '%' in scope-id is desired, that could be changed in ipaddress module, and once the ipaddress module scope-id support is merged, we could use the facility here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, allowing % in unquoted URL is correct.
Sorry but I'm not used to the urllib module. Is urllib.parse.urlsplit() supposed to get a "quoted" or "unquoted" URL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is urllib.parse.urlsplit() supposed to get a "quoted" or "unquoted" URL?
It is supposed to get the unquoted URL. I relied on tests of urlparse to state this.
My reading of RFC 6874 and especially, the part quoted makes me think that 'percent-encoded' character like %40
or %25
could be present in the zone-id component beyond the first '%'. - If it is the case, the current implementation will False for it a valid IPv6 URL, but this is consistent with what ipaddress module return and it is documented by the ipaddress module.
I lack the experience to say something confidently about 'percent-encoded characters' in zone-id, and i think being consistent within modules and with documentation is the most appropriate thing to do.
I am +1 with committing this change. Please let me know you have hesitation or do you want (me/) us to research further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- urllsplit accepts unquoted URl and returns the components of the URI. The change proposed in this patch and tests is doing this accurately.
- urlopen and open interface will quote the URL and percent-encode them for the special characters. (So, the examples of firefox and chromium presented in the discussion, I will expected those URIs to work in those browsers if they are percent-encoded.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@orsenthil , @vstinner In #13772 I come to the decision, that %
character shall not be allowed in <zone_id> part according to paragraph 5 of Section 11.2 RFC 4007:
An implementation MAY support other kinds of non-null strings as
<zone_id>. However, the strings must not conflict with the delimiter
character.
At the same time, Section 2 RFC 6874, especially paragraph 4, presumes, that <zone_id> part may contain %
represented using percent encoding.
That is confusing and seems to conflict with RFC 4007.
That's why, I decided to assume, that RFC 6874 conserns only representing IPv6 Zone Identifiers in URI`s.
@orsenthil: Would you mind to review this change? Any idea for the allowed characters in an IPv6 scope? |
@vstinner - Sure, I will review. I will refer to the RFC for the valid characters for the IPv6 scope. |
I tried to allow [ and ] in the user:password part, but then the URL parser is confused by the URL:
It reads it as IPv6 |
I rebased my PR to fix the merge conflict. @orsenthil: Ping for review. |
Firefox doesn't seem to accept % in the IPv6 part of an URL. When I type the following URL, it opens Google with the URL as a search...
|
Same behavior in Chromium. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you.
Done. Thank you, Victor. |
I expected these browsers to have percent-encode these and work. https://bugzilla.mozilla.org/show_bug.cgi?id=700999 Also, "Microsoft Edge (as well as Microsoft Explorer) works well with link local IPV6 addresses." This is the most relevant information I found https://en.wikipedia.org/wiki/IPv6_address#Use_of_zone_indices_in_URIs
|
The living URL Standard doesn't implement IPv6 scope on purpose:
This comment points to https://www.w3.org/Bugs/Public/show_bug.cgi?id=27234#c2 which is a comment written by Ryan Sleevi at 2015-08-14:
Firefox feature request https://bugzilla.mozilla.org/show_bug.cgi?id=700999 has been rejected using this comment as well at 2015-08-14. Currently, only Microsoft Edge supports IPv6 scope: Firefox and Chromium don't. I suggest to follow Firefox, Chromium and living URL Standard example: don't support IPv6 scope. My current implementation doesn't implement the RFC 6874 which suggests to use |
This makes me even more uncomfortable to support IPv6 scope: it is not well defined if urlsplit() is expected to be used on a quote or unquoted URL. This is a major difference for RFC 6874 which is tied to quoted characters. Not well defined means: we should not have to dig into tests to reverse engineer the "expected" function behavior. It should be well documented and well tested. I mean that if someone wants to support IPv6 scope in URL, I suggest to first clarify what urlsplit() expects. IMHO fixing this is out of the scope of fixing https://bugs.python.org/issue36338 security vulnerability. |
I failed finding time to finish the PR. I prefer to abandon it. |
The urllib.urlparse module now rejects invalid IPv6 addresses and
invalid port numbers when parsing an URL.
https://bugs.python.org/issue36338