Skip to content

GH-126367: Fix urllib.request.url2pathname() colon handling in URLs on Windows #127752

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

samtstephens
Copy link

@samtstephens samtstephens commented Dec 9, 2024

@bedevere-app
Copy link

bedevere-app bot commented Dec 9, 2024

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

url = url.replace(':', '|')
if not '|' in url:
url = url.replace(':', '|', 1)
if not '|' in url or ('|' in url and not (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this not be simplified? I had a hard time understanding the code:

url = url.replace(':', '|', 1)
if not '|' in url or ('|' in url and not (
    url.startswith('|') # error case
    or '/|' in url      # error case
    or (len(url) > 2 and url[0] == '/' and url[1].isalpha() and url[2] == '|')
)):

@barneygale barneygale self-assigned this Dec 10, 2024
Copy link
Contributor

@barneygale barneygale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look at this

It's perhaps worth taking a step back to think about what we're aiming to do here, and why Windows drive paths need special handling.

The simplest way to create a file: URI is to prepend file: to a path, so:

  • foo/bar.txt becomes file:foo/bar.txt (relative file URI, not widely supported)
  • c:/foo becomes file:c:/foo (Windows)
  • /etc/hosts becomes file:/etc/hosts (POSIX)

For conformance with RFC 1738, folks began adding an empty authority section for absolute POSIX paths, so:

  • /etc/hosts becomes file:///etc/hosts (or file://localhost/etc/hosts more explicitly)

This works nicely on POSIX because absolute paths must start with a slash, just like the path component of URLs.

But what about Windows? A path like C:/foo can't be directly mapped to a URL path component because it doesn't start with a slash. The solution folks came up with: just add a slash prefix!

  • C:/foo becomes file:///C:/foo (or file://localhost/C:/foo more explicitly)

Our task in this PR is to undo that operation. We need to:

  1. Detect when a URL's path component begins with a slash and then a Windows drive, like /A:/foo/bar or /B|/doo
    • Remove the leading slash if so
  2. If second character is |, convert it to :

Once those steps are done, all that remains is to convert slashes and unquote URL entities.

As far as detecting a Windows drives goes, I suggest having a look at the implementation of os.path.splitroot() or os.path.isabs() in ntpath.py:

cpython/Lib/ntpath.py

Lines 80 to 95 in 5c89adf

def isabs(s):
"""Test whether a path is absolute"""
s = os.fspath(s)
if isinstance(s, bytes):
sep = b'\\'
altsep = b'/'
colon_sep = b':\\'
double_sep = b'\\\\'
else:
sep = '\\'
altsep = '/'
colon_sep = ':\\'
double_sep = '\\\\'
s = s[:3].replace(altsep, sep)
# Absolute: UNC, device, and paths with a drive and root.
return s.startswith(colon_sep, 1) or s.startswith(double_sep)

We don't need to check whether the drive letter is alphabetic IMHO. It's probably sufficient to check that 3rd character in the URL path is a colon or a pipe (after checking that the first character is a slash).

@bedevere-app
Copy link

bedevere-app bot commented Dec 10, 2024

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

And if you don't make the requested changes, you will be put in the comfy chair!

@barneygale
Copy link
Contributor

Don't worry about breaking the following test case - there's no reason for url2pathname() to be stricter about Windows drives than functions in ntpath.py. It would be good if url2pathname() didn't raise OSError at all.

# Non-ASCII drive letter
self.assertRaises(IOError, fn, "///\u00e8|/")

@barneygale
Copy link
Contributor

All this code smells funny and should be binned IMO:

# Windows itself uses ":" even in URLs.
url = url.replace(':', '|')
if not '|' in url:
# No drive specifier, just convert slashes
# make sure not to convert quoted slashes :-)
return urllib.parse.unquote(url.replace('/', '\\'))
comp = url.split('|')
if len(comp) != 2 or comp[0][-1] not in string.ascii_letters:
error = 'Bad URL: ' + url
raise OSError(error)
drive = comp[0][-1]
tail = urllib.parse.unquote(comp[1].replace('/', '\\'))
return drive + ':' + tail

@samtstephens
Copy link
Author

Hi, thank you for the detailed response, I really appreciate it!! We will look into these changes and we'll see if we can come up with a better solution. I can definitely say we felt a bit out of our depth, not having a lot of in-depth knowledge with how the files should look, so this background is appreciated.

@barneygale
Copy link
Contributor

You didn't pick the easiest issue to work on to be honest 😅 file URIs are a bit of a dark art...

@barneygale
Copy link
Contributor

Hey @samtstephens, is there anything I can do to help with this? Ta

@barneygale
Copy link
Contributor

I'm keen to get this sorted in time for python 3.14, so I've opened a PR here: #131428

@barneygale
Copy link
Contributor

Closing this PR as the bug has been fixed elsewhere. Thanks all the same for taking a look!

@barneygale barneygale closed this Mar 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants