Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

℅ does not encode as a domain name (Python's built-in idna encoding is insufficient) #19

Closed
wsanchez opened this issue Jun 27, 2017 · 16 comments

Comments

@wsanchez
Copy link
Contributor

I'm not entirely sure this is a bug, but it sure seems like one:

>>> from hyperlink import URL
>>> text = u'http://\u2105'
>>> url = URL.fromText(text)
>>> url.asIRI()
URL.from_text('http://℅')
>>> url.asURI()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/wsanchez/Dropbox/Developer/BurningMan/ranger-ims-server/.tox/coverage-py36/lib/python3.6/site-packages/hyperlink/_url.py", line 1070, in to_uri
    fragment=_encode_fragment_part(self.fragment, maximal=True)
  File "/Users/wsanchez/Dropbox/Developer/BurningMan/ranger-ims-server/.tox/coverage-py36/lib/python3.6/site-packages/hyperlink/_url.py", line 861, in replace
    userinfo=_optional(userinfo, self.userinfo),
  File "/Users/wsanchez/Dropbox/Developer/BurningMan/ranger-ims-server/.tox/coverage-py36/lib/python3.6/site-packages/hyperlink/_url.py", line 601, in __init__
    self._host = _textcheck("host", host, '/?#@')
  File "/Users/wsanchez/Dropbox/Developer/BurningMan/ranger-ims-server/.tox/coverage-py36/lib/python3.6/site-packages/hyperlink/_url.py", line 410, in _textcheck
    % (''.join(delims), name, value))
ValueError: one or more reserved delimiters /?#@ present in host: 'c/o'

The cause is probably in the IDNA encoding:

>>> "℅".encode("idna")
b'c/o'

This surprises me, but I might just be ignorant about how this works…?

@mahmoud
Copy link
Member

mahmoud commented Jun 27, 2017

Count me among the surprised as well. Looking at the idna package (a possible dependency of hyperlink in the future), it says: idna.core.InvalidCodepoint: Codepoint U+2105 at position 1 of u'\u2105' not allowed

It's possible the builtin idna encoding is dropping the ball hard.

@wsanchez
Copy link
Contributor Author

wsanchez commented Jun 27, 2017

Here's another one:

>>> "\N{DOUBLE QUESTION MARK}".encode("idna")
b'??'

@wsanchez
Copy link
Contributor Author

The built-in IDNA encoder seems fairly weak.

@wsanchez
Copy link
Contributor Author

I switched to the IDNA module to check input data and it's working far better. (It's also slower, but correct trumps fast.)

@wsanchez
Copy link
Contributor Author

wsanchez commented Jun 28, 2017

This may be of interest to you, by the way:

https://github.com/twisted/klein/blob/46b62cc126dae29a5aeea829612568c0e20ed23e/src/klein/test/_strategies.py

I'm still fleshing it out.

@wsanchez
Copy link
Contributor Author

wsanchez commented Jun 28, 2017

More info: so I can't create a URL with host="۽0":

  File "/Users/wsanchez/Dropbox/Developer/Twisted/klein/.tox/trial-py36-tw166/lib/python3.6/site-packages/hyperlink/_url.py", line 1063, in to_uri
    host=self.host.encode("idna").decode("ascii")
builtins.UnicodeError: encoding with 'idna' codec failed (UnicodeError: Violation of BIDI requirement 3)

If I replace self.host.encode("idna").decode("ascii") with idna.encode(self.host).decode("ascii") in that line, it works.

So, unless these hostnames are actually not valid, I'm in favor of using the idna module over the built-in encoder. It also would remove the need for you to check for "/?#@", since it catches those.

@mahmoud
Copy link
Member

mahmoud commented Jun 28, 2017

Haha, from the very first report I had a suspicion hypothesis was at work. The idna package was already on the table in my mind due to the split between IDNA 2003 and 2008, along with some hyperlink API changes, but I'll bump it up the priority list even further after this onslaught.

@wsanchez
Copy link
Contributor Author

:-) thanks

@Lukasa
Copy link
Member

Lukasa commented Jun 28, 2017

IDNA 2008 is strictly the right thing to use now. Just an FYI. Requests already uses it. ;)

@mahmoud
Copy link
Member

mahmoud commented Jun 28, 2017

Ah, I thought we were still in some partial state where different TLDs were adhering to different standards. If not, all the better. :)

@wsanchez
Copy link
Contributor Author

I took a quick stab at it:

diff --git a/hyperlink/_url.py b/hyperlink/_url.py
index 8da9949..0dcf715 100644
--- a/hyperlink/_url.py
+++ b/hyperlink/_url.py
@@ -52,6 +52,8 @@ except ImportError:
         raise socket.error('unknown address family')
 
 
+from idna import encode as idna_encode, decode as idna_decode, IDNAError
+
 try:
     from urllib import unquote as urlunquote
 except ImportError:
@@ -1059,9 +1061,20 @@ class URL(object):
                                   self.userinfo.split(':', 1)])
         new_path = _encode_path_parts(self.path, has_scheme=bool(self.scheme),
                                       rooted=False, joined=False, maximal=True)
+        if self.host:
+            try:
+                host = idna_encode(self.host).decode('ascii')
+            except IDNAError as e:
+                raise ValueError(
+                    'Unable to encode {url} as URL: {error}'
+                    .format(url=self, error=e)
+                )
+        else:
+            # idna_encode(host) raises if host is empty
+            host = self.host
         return self.replace(
             userinfo=new_userinfo,
-            host=self.host.encode("idna").decode("ascii"),
+            host=host,
             path=new_path,
             query=tuple([tuple(_encode_query_part(x, maximal=True)
                                if x is not None else None
@@ -1101,7 +1114,7 @@ class URL(object):
             textHost = self.host
         else:
             try:
-                textHost = asciiHost.decode("idna")
+                textHost = idna_decode(asciiHost)
             except ValueError:
                 # only reached on "narrow" (UCS-2) Python builds <3.4, see #7
                 textHost = self.host
diff --git a/setup.py b/setup.py
index d060ca5..7c02ac7 100644
--- a/setup.py
+++ b/setup.py
@@ -29,6 +29,7 @@ setup(name='hyperlink',
       zip_safe=False,
       license=__license__,
       platforms='any',
+      install_requires=['idna'],
       classifiers=[
           'Topic :: Utilities',
           'Intended Audience :: Developers',

But a couple of tests don't pass:

___________________________________________________________________________ [doctest] hyperlink._url.URL.to_iri ___________________________________________________________________________
Expected:
    https://→example.com/foo⇧bar/
Got:
    https://xn--example-dk9c.com/foo⇧bar/

/Users/wsanchez/Dropbox/Developer/Twisted/hyperlink/.tox/py26/lib/python2.6/site-packages/hyperlink/_url.py:1094: DocTestFailure
___________________________________________________________________________ [doctest] hyperlink._url.URL.to_uri ___________________________________________________________________________
UNEXPECTED EXCEPTION: ValueError("Unable to encode URL.from_text(u'https://\\u2192example.com/foo\\u21e7bar/') as URL: Codepoint U+2192 at position 1 of u'\\u2192example' not allowed",)
Traceback (most recent call last):

  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py", line 1253, in __run
    compileflags, 1) in test.globs

  File "<doctest hyperlink._url.URL.to_uri[0]>", line 1, in <module>

  File "/Users/wsanchez/Dropbox/Developer/Twisted/hyperlink/.tox/py26/lib/python2.6/site-packages/hyperlink/_url.py", line 1070, in to_uri
    .format(url=self, error=e)

ValueError: Unable to encode URL.from_text(u'https://\u2192example.com/foo\u21e7bar/') as URL: Codepoint U+2192 at position 1 of u'\u2192example' not allowed

/Users/wsanchez/Dropbox/Developer/Twisted/hyperlink/.tox/py26/lib/python2.6/site-packages/hyperlink/_url.py:1052: UnexpectedException
=========================================================================== 2 failed, 73 passed in 0.94 seconds ===========================================================================

These are both the apparently same thing; which is an error trying to decode xn--example-dk9c.com:

>>> idna.decode("xn--example-dk9c.com")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/wsanchez/Dropbox/Developer/Twisted/hyperlink/.tox/py36/lib/python3.6/site-packages/idna/core.py", line 384, in decode
    result.append(ulabel(label))
  File "/Users/wsanchez/Dropbox/Developer/Twisted/hyperlink/.tox/py36/lib/python3.6/site-packages/idna/core.py", line 303, in ulabel
    check_label(label)
  File "/Users/wsanchez/Dropbox/Developer/Twisted/hyperlink/.tox/py36/lib/python3.6/site-packages/idna/core.py", line 253, in check_label
    raise InvalidCodepoint('Codepoint {0} at position {1} of {2} not allowed'.format(_unot(cp_value), pos+1, repr(label)))
idna.core.InvalidCodepoint: Codepoint U+2192 at position 1 of '→example' not allowed

Looks like the idna thinks that this is not a valid host name, because you can't have that kind of funky character at the start of the string:

>>> idna.decode("foo-xn--example-dk9c.com")
'foo-xn--example-dk9c.com'

That suggests the test is wrong?

@mahmoud
Copy link
Member

mahmoud commented Jun 29, 2017

It may well be! I don't recall writing it, so it's probably from Twisted's doctests which were probably just tested against the built-in behavior. I'm working on a local branch to get the decoding straightened out, and once we have correct behavior we'll see about profiling and idna performance possibilities.

@wsanchez
Copy link
Contributor Author

Cool, thanks for looking into it

@wsanchez
Copy link
Contributor Author

poke… this is challenging to work around when you hit it (for me, anyway).

@mahmoud
Copy link
Member

mahmoud commented Sep 16, 2017

Acknowledged. Alas, if only core python had used hypothesis... Anyways, I'm going to rename this slightly to represent the domain name/IDNA-specific-ness of this error and once #39 is out of the way, hopefully we can integrate the idna package soon.

@mahmoud mahmoud changed the title ℅ does not encode as a URI ℅ does not encode as a domain name (Python's built-in idna encoding is insufficient) Sep 16, 2017
@wsanchez
Copy link
Contributor Author

Awesome, thanks. I'll try to file a bug against Python next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants