hc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oleg Kalnichevski <ol...@apache.org>
Subject Re: URLEncodeUtils - change in format behaviour since 4.2
Date Tue, 26 Jun 2012 14:45:38 GMT
On Tue, 2012-06-26 at 15:21 +0100, sebb wrote:
> On 26 June 2012 14:37, Oleg Kalnichevski <olegk@apache.org> wrote:
> > On Tue, 2012-06-26 at 14:21 +0100, sebb wrote:
> >> On 26 June 2012 13:33, Oleg Kalnichevski <olegk@apache.org> wrote:
> >> > On Tue, 2012-06-26 at 11:41 +0100, sebb wrote:
> >> >> On 26 June 2012 08:46, Oleg Kalnichevski <olegk@apache.org> wrote:
> >> >> > On Tue, 2012-06-26 at 02:00 +0100, sebb wrote:
> >> >> >> The escaping of non-alphabetic characters by the format methods
is no
> >> >> >> longer quite the same as that done by java.net.URLEncoder.encode.
> >> >> >>
> >> >> >> The former allows the chars in ".-*_!'()" to pass through
without
> >> >> >> conversion, whereas the latter only allows ".-*_" unchanged.
> >> >> >> The latter is also how browsers behave when escaping form
fields.
> >> >> >>
> >> >> >> I think the behaviour should be consistent with URLEncoder
and browsers.
> >> >> >> That was in fact the behaviour with 4.2, which delegated the
escaping
> >> >> >> to URLEncoder.
> >> >> >> I think the code should revert to using URLEncoder/URLDecoder.
> >> >> >>
> >> >> >> There is still a need for the extended path, query and fragment
> >> >> >> escape/unescape methods, but perhaps these belong in URIBuilder?
> >> >> >> If not, maybe they should be in a separate class anyway?
> >> >> >>
> >> >> >
> >> >> > Would not that lead to inconsistent behavior when the same query
form
> >> >> > gets encoded differently depending on whether it is enclosed in
the
> >> >> > request URI or in the request body?
> >> >>
> >> >> I don't think so, I think encodeFormFields could use a different safe
> >> >> character set without problems, so long as the safe set is a subset
of
> >> >> all possible safe query characters. In fact the UNRESERVED BitSet is
> >> >> only currently used in URLEncodedUtils#encodeFormFields(), so I don't
> >> >> see how changing encodeFormFields to use a different safe set can
> >> >> affect anything.
> >> >>
> >> >> Besides, AFAIK 4.2 did not have a problem with using a more limited
safe set.
> >> >>
> >> >> > Browsers do a lot of silly stuff to maximize compatibility with
all
> >> >> > sorts of broken software out there. I am not sure we need to do
> >> >> > likewise.
> >> >>
> >> >> Well-written software will be able to deal with form data that has
> >> >> some additional safe characters encoded, so I don't think there is
any
> >> >> problem in playing safe here.
> >> >>
> >> >> [But if we do decide to change the safe list from the one previously
> >> >> used, it needs to be flagged up in the release notes.]
> >> >>
> >> >
> >> > Likewise well-written software should be able to deal with the form data
> >> > containing valid URL encoded content. To me this is more about doing the
> >> > right thing rather than making sure some broken code is unaffected.
> >>
> >> http://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1
> >> says that reserved chars are to be encoded as per RFC 1738 section 2.2.
> >>
> >> This implies that the safe set of chars is "$-_.+!*'()," plus "=" as
> >> it is reserved for the delimiter
> >> 4.2.1 doesn't currently allow "$", so arguably is not "doing the right
> >> thing" anyway.
> >>
> >
> > RFC 1738 was superseded by RFC 2396 (which is what java.net.URI is based
> > on and this is what we ought to use as a basis as well). RFC 2396
> > clearly states "$" is one of the reserved characters.
> >
> > ---
> > 2.2. Reserved Characters
> >
> >   Many URI include components consisting of or delimited by, certain
> >   special characters.  These characters are called "reserved", since
> >   their usage within the URI component is limited to their reserved
> >   purpose.  If the data for a URI component would conflict with the
> >   reserved purpose, then the conflicting data must be escaped before
> >   forming the URI.
> >
> >      reserved    = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
> >                    "$" | ","
> 
> But AFAIK "$" is not reserved within form data (or a general query),
> so does not need to be escaped.
> Also "~" is not reserved, but is escaped by browsers and 4.2 and 4.2.1.
> 

Are you sure about 4.2.1? As far as I can tell it should not as it is
clearly included in the UNRESERVED set.

> More fun: RFC 2396 is superseded by RFC 3986.
> The lists of allowable characters for path and query have not changed,
> but the reserved list is now larger.
> The only unreserved characters are now ".-_~", i.e. "!'()*" are now
> reserved (as are "#[]") ...
> 

I am aware of RFC 2396 having been superseded by RFC 3986. However as
long as we target Java 1.5 as the minimal runtime level, we should stick
to the same compliance level as the java.net.URI, which is RFC 2396 for
Java 1.5.

Oleg 



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@hc.apache.org
For additional commands, e-mail: dev-help@hc.apache.org


Mime
View raw message