-
Notifications
You must be signed in to change notification settings - Fork 38.4k
Add javadoc to org.springframework.web.util.UrlParser to indicate that it should only be used with modern browsers, not anything else #33542
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for this feedback @joakime Could you elaborate on the limitations of using this parser for REST clients? What kind of valid URLs wouldn't parsed correctly with this parser in the context of REST clients? Thanks! |
Take for example a URL with no heir-part, or no authority.
The UrlParser sees that as a host with Here's what java.net.URI does ...
Here's what java.net.URL does ...
Those built-in parsers, along with the existing URL / URI parsers in various Servlet libraries follow the spec in RFC3986, which parses that as a URL with no authority, just a path. WhatWG is good, if you want to follow browser behaviors, but not good outside of that limited scope. There are more examples than just this, but just know that WhatWG isn't a great choice for the general internet behaviors, non-browser clients, http hardware, security tooling, caching servers, load balancers, etc ... |
There are many decisions in WhatWG that are designed to "clean up" bad behaviors from users, typos and whatnot (eg: eliminate duplicates, normalize away extra slashes, eliminate whitespace, etc) Those are great choices for a browser dealing with HTML and Javascript, but not appropriate outside of a browser. |
Thanks for reaching out. This is an area we've been we've thinking about after CVE--2024-22243, CVE--2024-22259, and CVE--2024-22262 all of which relate to misaligned parsing choices between server and client. For the specific example, the RFC syntax defines several choices for The CVE's highlighted this for us, and it's how we came across the WhatWG parsing algorithm backed by an extensive set of test cases, and providing an opportunity to align with browsers where security issues originate most of the time. The extensive test cases shows just how massive the issue with parsing malicious input is, and the WhatWG algorithm provides a basis for alignment. There is little chance for clients and intermediaries to agree otherwise, and I'm sure they don't in many cases. I'm interested in your feedback, but so far this looks like a good direction to us. For documentation, I think it would be a challenge to advise to use UrlParser for browser clients only. It's not always easy to know the source for a URL that's being parsed, and even if it was known, it's still not possible to know what parsing choices the client would make, and how to align with that. We could mention that the algorithm is based on WhatWG and that it parses leniently many URLs that deviate from the spec and create ambiguity. |
The RFC is very clear on this specific example, it is not ambiguous, or even a vague URI. Ambiguous URIs do exist, and are documented in RFC3986, most of those are due to path section anomalies. (Anomalies that WhatWG do not address, but most servers do)
Interestingly, this topic, the differences in URI between the RFC (used by all internet hardware, nearly all internet servers, all non-browser user-agents) and WhatWG (used only by browsers) has come up recently on the ietf-http-wg
So far, only the Browsers vendors have advocated for WhatWG, the rest are still on RFC3986. |
I think it is clear that it is invalid. I gave specific reasons. Could you clarify why you think otherwise? Thanks for the pointer to the to ietf discussion. I will check it out. |
The example you linked to RFC3986: Section 5.3 is for recomposition of a URI from components, not parsing it. The example URI is The overall ANBF is ...
The
... which makes the example URI scheme Next is the raw Next is the
The example URI of We now parse the Next, in our example, we have passed the authority (it was empty), and now we have to parse the remaining characters
So we have a parsed URI path of The If there was no |
A collection of related thoughts. Browser handling of URIs has been inconsistent every time I've looked at it. While I have seen statements that there are issues with RFC 3986 (and hence the need for another URI spec) I haven't yet seen a valid example of any such issue. The Servlet spec references (and Tomcat follows) RFC 3986. Tomcat does have In hindsight, I'm not sure I advocated for the correct approach to path parameters with the Servlet spec as it creates some issues with reverse proxies. It might have been better to say path parameters are not removed and are treated like the rest of the segment for mapping purposes. But we are where we are. I would expect to see Getting back to the original point, I'd agree that some sort of warning that a URL parser doesn't follow RFC 3986 is a good idea. |
Some other thoughts to try to convey in the javadoc ... WhatWG URL document is good if you are writing a user-agent (especially so if you are dealing with user provided URLs from things like the Location bar, user edited configuration, HTML, and Javascript). RFC3986 is what you use if you want to use internet protocols that use URL/URIs. (eg: HTTP) The WhatWG URL document is not a spec, and is subject to wild changes as the living document updates, what works today is not guaranteed to work in 2 years time. (This is the reality of the WhatWG URL document over its history. Even the test cases have changed dramatically, what was once allowed is no longer, and vice versa). IMO the WhatWG URL document is trying to do too much.
The RFC3986 addresses points 2 and 3, but not 1. |
Only a minor point but having looked more carefully at RFC 3986 I have reached the conclusion the |
For
This is very useful context. Greg's comment in particular matches the concerns in #33542 (comment).
This is not always easy to separate. For our own purposes, we parse only in our clients (RestClient, WebClient, etc), and when processing forwarded headers. We have no knowledge where the string came from. More broadly, our URI parsing can be used in applications, in other Spring projects like Spring Security, or any other framework that choose to use it. The URI string may have been passed through the query or the request body, it may be parsed for validation, and then used in a redirect or included in a response to a browser, which leads to security issues. Given the lack of certainty on how the URI string was prepared, whether it is malicious or not, if you can't be 100% certain and you don't want to reject it, then you need to parse it leniently, and for that WhatWG at least offers something to align with. Other clients with strict parsing will reject invalid URL's, and it least it won't lead to SSRF or other attacks.
That sounds reasonable to me, but at the moment it's the only alternative. I hope there is eventually some movement in this space. We'll discuss this as a team to decide on the way forward. There is still some time before the 6.2 release, so thanks for the timely feedback. I'll share more on that. |
@joakime we are going to add an RFC-based parser, taking inspiration from Jetty's |
Affects: Any with UrlParser
The recently added
org.springframework.web.util.UrlParser
is not spec compliant outside of the limited scope of modern browsers.The living URL document at whatwg is incompatible with the IETF URI spec, Java itself, the Servlet spec, and various other non-browser use cases.
Users that want to use the new
UrlParser
should not be using it for non-browser use cases (eg: REST clients).The text was updated successfully, but these errors were encountered: