-
Notifications
You must be signed in to change notification settings - Fork 346
URL crate is failing to parse these existing URLs #489
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
And one more note: All of them are marked as malware or phishing so be careful when accessing them, prefer command line to browser or use a sand boxed instance... ;-) |
These URLs are reported as invalid because they violate rules for "Right-to-Left Scripts for Internationalized Domain Names" that are specified in RFC-5893
So, in the URL ""http://mail.163.com.xn----9mcjf9b4dbm09f.com/iloystgnjfrgthteawvo/indexx.php" domain name is the Bidi domain and it does contain label "163" that violate Rule #1, so the URL is reported as invalid. But actually, I think that there is some misinterpretation of standard. Yes, label "163" violate Bidi rules, but the following paragraph of sections 2 of RFC-5893 says following:
So, in order to be a valid domain name Bidi domain all labels in the domain name should satisfy bidi rules with a only exception that LDH label (label that contains only an ascii letters, digits and hyphens) can start with a European digit if it comes before any RTL label.
Section 7. "Compatibility Considerations" is most straightforward
So, in general it is not recommended (while not strictly forbidden) to use RTL labels as a subdomain of a digit-leading label, but not vice-versa. But another issue it that while RFC-5893 does allow labels starting with a digit, the "UNICODE IDNA COMPATIBILITY PROCESSING" standard, which is used as a base for that crate implementation provides a publicly available test files that contains some domain names that are marked as violating Bidi rule #1, while they should be considered as valid according to RFC-5893. For example domain "0a.xn--4db" is a part of IDNA test as a domain that violate Bidi rule #1, while according to RFC-5893 it should be considered as valid because a label starting with a digit precedes RTL label. I tried to look at other implementations of UTS46, for example python implementation Anyway I personally think that practicality aspect should outweigh aiming to following the specification, and if current realization considers existing domains as invalid, it's better to relax some parsing restrictions. |
Newer versions of https://www.unicode.org/reports/tr46/#ToASCII take a Regarding practicality, if we have reason to believe the spec makes a particular choice wrong, it’s preferably to file a bug on the spec to discuss changing it (especially if it helps align with other implementations) rather than just unilaterally making a different choice. |
I'm getting confused here a bit. If I understand the previous comment, the old spec makes these URLs valid but not recommended and suggests not allowing their registration. But if I have to work with existing URLs, I can't un-register them so it makes sense to allow them. If I'm a registrar, I might decide not to allow them. The new version would disallow them, but then, what happens to the existing-but-not-recommended URLs? Do I understand it correctly? |
Not exactly. RFC-5893 says that URLS that contains labels starting with a digit are absolutly valid if such a label comes before any RTL label. While domain names that contains digit-leading labels after RTL label should be forbiden for registration. Also I don't sure about term "new" and "old" spec. As I know RFC-5893 is an actual spec for bidirectional international domain names, and it wasn't replaced by any newer spec. |
I don't want to sound demanding, or anything, but is there some way forward about this? This seems to be somewhat stuck. If it's matter of manpower (eg. needing to do some specific research, or implementing), I can try doing it when it's me needing it. Even closing as wontfix would probably be better than this uncertainty, as I would know I have to look for a different solution for my problem. Thanks |
What’s needed is research to find out whether an up-to-date (which this crate is known not to be: #290, #163) and correct implementation of https://url.spec.whatwg.org/ would also reject those URLs.
|
Regardless of whether the crate is compliant with the current URL spec and whether the current URL spec should allow certain domain names, I think the crate needs an override mode in which it will accept any domain name that is technically possible, because we know for a fact that noncompliant domain names do exist in the real DNS. Google isn't going to stop using host names of the form |
“Need” is an interesting word. I’ll just leave this here: https://users.rust-lang.org/t/help-wanted-maintaining-rust-url/10707 |
I'll try to dig some time up to help. This post gives new light at why the pull request #497 sits there… at the time I thought there was disagreement about it and discussion happening here, not that you'd want someone else to take the review on them. It'll probably take some time, though (I need to get familiar with the crate internals), thanks for letting it be known. |
I've tried digging through the standards. It seems to me that the section 2 of RFC-5893 disallows this URL. The bullet point says that if one allowed such domain as in here, it would still satisfy the goals in section 3, but the rules don't say one should allow them. The rules are referenced as „All six numbered properties in section 2 must be satisfied“, which rules out using the bullet point even if it said it should be allowed. The section 5 and 7 are not normative. However, the described bullet point and the text in section 7 kind of hints at the intention of allowing these URLs, even though the rules don't. I'll try to reach the authors of the RFC or find if there was some discussion about this already. |
This crate implements https://url.spec.whatwg.org/ (which in turn references https://www.unicode.org/reports/tr46/) rather than IETF RFCs |
Yes, but in depth 3 from there it references the RFC-5893. From the tr46:
|
Ah sorry, never mind my previous comment. |
See also https://github.com/whatwg/url/labels/topic%3A%20idna. This might warrant an additional URL Standard issue as these domain names do seem to work in browsers. Perhaps the URL Standard should take more direct ownership of these algorithms as UTR46 does not seem to cut it... |
So far, I haven't seen this reported as a Firefox bug, which suggests that non-phishing sites aren't relying significantly on being able to violate the current bidi rule. There's been discussion in the Unicode Technical Committee: https://www.unicode.org/L2/L2024/24064r2-utc179-properties-recs.pdf A change was not made in Unicode 16, but the issue remains pending. |
Safari now considers the inputs in OP to also not be URLs by the way. |
Hi, I'm looking at an issue where The context where this is being evaluated is SMTP as an email domain, rather than as part of an URL, so this use case is possibly a little outside the scope of this crate. This particular domain has no MX or A records in DNS so it's not a critical issue that this operation does not succeed for it, but I will note that if I I suppose that my question here is: is there a way to understand why specifically this operation fails, programatically, so that that context can be reported in a more actionable way? Are there alternative ways to use either this crate, or alternative crates, that might reduce the strictness and/or show more context on the underlying issue? I did also try: let uts = Uts46::new();
match uts.to_ascii(
name.as_bytes(),
AsciiDenyList::EMPTY,
Hyphens::Allow,
DnsLength::VerifyAllowRootDot) {...} which seemed like the most permissive way to operate this, but it had the same result. Thanks! |
Yeah, that is probably the same issue since it has RTL in it. The URL standard has a large number of failure cases, I think returning more specific errors for stuff like this may not be practical. This error ends up in a particular subclause of punycode processing, for example. There's not really any intention to make this crate support more than the WHATWG URL standard. |
Previously, we'd use the default presentation for the Error type reported by the idna crate. Unfortunately, that type holds no context: it is a zero-sized empty struct type, and it chooses to render itself as `Errors`, which isn't super helpful, but it is hard to convey every possible error clause from the underlying spec, so I understand why this is this way. This commit handles that error case to present a slightly more informative error message. It doesn't provide context on what specifically is bad about the input, but it at least helps to characterize that the domain is bad, rather than imply that we don't know how to deal with punycode at all. The specific problem with the domain in the included test case is most likely BIDI related per discussion in servo/rust-url#489
Previously, we'd use the default presentation for the Error type reported by the idna crate. Unfortunately, that type holds no context: it is a zero-sized empty struct type, and it chooses to render itself as `Errors`, which isn't super helpful, but it is hard to convey every possible error clause from the underlying spec, so I understand why this is this way. This commit handles that error case to present a slightly more informative error message. It doesn't provide context on what specifically is bad about the input, but it at least helps to characterize that the domain is bad, rather than imply that we don't know how to deal with punycode at all. The specific problem with the domain in the included test case is most likely BIDI related per discussion in servo/rust-url#489
Hi there, I have these five URLs that exist but can't be parsed by url crate due to invalid domain name. This issue is similar to #483 with subdomains ending on
-
character, but these ones use punycode.All of them ping and respond.
The text was updated successfully, but these errors were encountered: