hc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gordon Mohr (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HTTPCLIENT-587) derelativizing of relative URIs with a scheme is incorrect
Date Fri, 16 Jun 2006 21:47:31 GMT
    [ http://issues.apache.org/jira/browse/HTTPCLIENT-587?page=comments#action_12416592 ] 

Gordon Mohr commented on HTTPCLIENT-587:

> What's wrong with the JDK URI class?

(a) It still has bugs where it fails to implement the spec at well as httpclient.URI. One
recent example, still a problem in current JDK 1.6 betas:


java.net.URI base = new java.net.URI("http://www.example.com/some/page");
java.net.URI rel = new java.net.URI("");
java.net.URI derel = base.resolve(rel);
(java.lang.String) http://www.example.com/some/   // INCORRECT

org.apache.commons.httpclient.URI base = new org.apache.commons.httpclient.URI("http://www.example.com/some/page");
org.apache.commons.httpclient.URI rel = new org.apache.commons.httpclient.URI("");
org.apache.commons.httpclient.URI derel = new org.apache.commons.httpclient.URI(base,rel);
(java.lang.String) http://www.example.com/some/page  // CORRECT

(b) java.net.URI and its maintainers reject the idea that there should be any facility in
the URI class for tolerating the same sorts of formal spec deviations often seen in real URIs
and domain names. 

As one example, domain names with '_' are technically illegal but have often been tolerated
by DNS-related software and we have run across functioning websites at subdomains with '_'
in their name. Browsers browse these sites fine, so we want to be able to crawl them. java.net.URI
can't help us.

Now of course, it's legitimate and useful to provide a class which regirously implements all
written standards. Not everyone wants a class that also tolerates de facto practices. But
that leads us to the ultimate problem with java.net.URI:

(c) java.net.URI licensing and language declarations make it resistant to reuse and adaptation
to other legitimate uses

It's not open source and major portions of its implementation are 'private' or 'final'. So
it's impossible to reuse 99% of it (such as its various RFC syntax character-class definitions,
fields, and working parsing code) while also either  patching the bugs like in (a) above or
overriding the strictness which makes it unsuitable for some purposes like in (b) above. 

In comparison, the org.apache.commons.httpclient.URI class is friendly to subclassing (which
we've used to work around bugs and change the behavior to better fit our problem domain) and
if that didn't work ith respect to a bug, we'd at least have the option of patching it ourselves
and redistributing the fix. 

So our project would very much miss the pretty-good (and at least serviceable when broken)
httpclient.URI class if it were dropped in favor of the JDK java.net.URI class. 

> Have you looked at HttpCore?

Only a little. Until it has an official test release, and comes close to matching the HttpClient
facilities for cookies, URIs,  etc., it probably won't be suitable to replace our HttpClient
3.x use.

(The ability to issue unvalidated request strings would be useful -- but we've already patched
this into HttpClient 3.x to the extent we need it. Also, we still need to perform best-effort,
highly-tolerant parsing of URIs into their traditional constituent parts for various decisions
and kinds of analysis.)

> derelativizing of relative URIs with a scheme is incorrect
> ----------------------------------------------------------
>          Key: HTTPCLIENT-587
>          URL: http://issues.apache.org/jira/browse/HTTPCLIENT-587
>      Project: Jakarta HttpClient
>         Type: Bug

>     Versions: 3.0.1
>     Reporter: Gordon Mohr

> URI constructor "public URI(URI base, URI relative) throws URIException" assumes that
if given 'relative' URI has a scheme, it should provide an authority and complete path to
the constructed URI. However, a URI can have a scheme but still be relative, requiring the
authority and base path of the 'base' URI. 
> Demonstration code:
> URI base = new URI("http://www.example.com/some/page");
> URI rel = new URI("http:boo");
> URI derel = new URI(base,rel);
> derel.toString();
> (java.lang.String) http:boo
> In fact, derel should be "http://www.example.com/some/boo". 
> RFC2396 is a little confused about this; section 3.1 states ""Relative URI references
are distinguished from absolute URI in that they do not begin with a scheme name." But, in
section 5, there are several sentences talking about relative URIs that begin with schemes
(and how this prevents using relative URIs that have leading path segments that look like
scheme identifiers). 
> RFC3896, which supercedes RFC2396, removes the implication a relative URI cannot begin
with a scheme, leaving the other text explcitly discussing relative URIs with schemes. 
> Both Firefox (1.5) and IE (6.0) treat "http:boo" the same as "boo" for purposes of derelativization
against an HTTP base URI, which would give the final URI "http://www.example.com/some/boo"
in the example above. 
> Even relative URIs like "http:../../boo" are explicitly legal. 

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:

To unsubscribe, e-mail: httpclient-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: httpclient-dev-help@jakarta.apache.org

View raw message