hc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sebb <seb...@gmail.com>
Subject Re: URLEncodeUtils - change in format behaviour since 4.2
Date Tue, 26 Jun 2012 15:20:35 GMT
On 26 June 2012 15:45, Oleg Kalnichevski <olegk@apache.org> wrote:
> On Tue, 2012-06-26 at 15:21 +0100, sebb wrote:
>> On 26 June 2012 14:37, Oleg Kalnichevski <olegk@apache.org> wrote:
>> > On Tue, 2012-06-26 at 14:21 +0100, sebb wrote:
>> >> On 26 June 2012 13:33, Oleg Kalnichevski <olegk@apache.org> wrote:
>> >> > On Tue, 2012-06-26 at 11:41 +0100, sebb wrote:
>> >> >> On 26 June 2012 08:46, Oleg Kalnichevski <olegk@apache.org>
wrote:
>> >> >> > On Tue, 2012-06-26 at 02:00 +0100, sebb wrote:
>> >> >> >> The escaping of non-alphabetic characters by the format
methods is no
>> >> >> >> longer quite the same as that done by java.net.URLEncoder.encode.
>> >> >> >>
>> >> >> >> The former allows the chars in ".-*_!'()" to pass through
without
>> >> >> >> conversion, whereas the latter only allows ".-*_" unchanged.
>> >> >> >> The latter is also how browsers behave when escaping form
fields.
>> >> >> >>
>> >> >> >> I think the behaviour should be consistent with URLEncoder
and browsers.
>> >> >> >> That was in fact the behaviour with 4.2, which delegated
the escaping
>> >> >> >> to URLEncoder.
>> >> >> >> I think the code should revert to using URLEncoder/URLDecoder.
>> >> >> >>
>> >> >> >> There is still a need for the extended path, query and
fragment
>> >> >> >> escape/unescape methods, but perhaps these belong in URIBuilder?
>> >> >> >> If not, maybe they should be in a separate class anyway?
>> >> >> >>
>> >> >> >
>> >> >> > Would not that lead to inconsistent behavior when the same
query form
>> >> >> > gets encoded differently depending on whether it is enclosed
in the
>> >> >> > request URI or in the request body?
>> >> >>
>> >> >> I don't think so, I think encodeFormFields could use a different
safe
>> >> >> character set without problems, so long as the safe set is a subset
of
>> >> >> all possible safe query characters. In fact the UNRESERVED BitSet
is
>> >> >> only currently used in URLEncodedUtils#encodeFormFields(), so I
don't
>> >> >> see how changing encodeFormFields to use a different safe set can
>> >> >> affect anything.
>> >> >>
>> >> >> Besides, AFAIK 4.2 did not have a problem with using a more limited
safe set.
>> >> >>
>> >> >> > Browsers do a lot of silly stuff to maximize compatibility
with all
>> >> >> > sorts of broken software out there. I am not sure we need
to do
>> >> >> > likewise.
>> >> >>
>> >> >> Well-written software will be able to deal with form data that
has
>> >> >> some additional safe characters encoded, so I don't think there
is any
>> >> >> problem in playing safe here.
>> >> >>
>> >> >> [But if we do decide to change the safe list from the one previously
>> >> >> used, it needs to be flagged up in the release notes.]
>> >> >>
>> >> >
>> >> > Likewise well-written software should be able to deal with the form
data
>> >> > containing valid URL encoded content. To me this is more about doing
the
>> >> > right thing rather than making sure some broken code is unaffected.
>> >>
>> >> http://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1
>> >> says that reserved chars are to be encoded as per RFC 1738 section 2.2.
>> >>
>> >> This implies that the safe set of chars is "$-_.+!*'()," plus "=" as
>> >> it is reserved for the delimiter
>> >> 4.2.1 doesn't currently allow "$", so arguably is not "doing the right
>> >> thing" anyway.
>> >>
>> >
>> > RFC 1738 was superseded by RFC 2396 (which is what java.net.URI is based
>> > on and this is what we ought to use as a basis as well). RFC 2396
>> > clearly states "$" is one of the reserved characters.
>> >
>> > ---
>> > 2.2. Reserved Characters
>> >
>> >   Many URI include components consisting of or delimited by, certain
>> >   special characters.  These characters are called "reserved", since
>> >   their usage within the URI component is limited to their reserved
>> >   purpose.  If the data for a URI component would conflict with the
>> >   reserved purpose, then the conflicting data must be escaped before
>> >   forming the URI.
>> >
>> >      reserved    = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
>> >                    "$" | ","
>>
>> But AFAIK "$" is not reserved within form data (or a general query),
>> so does not need to be escaped.
>> Also "~" is not reserved, but is escaped by browsers and 4.2 and 4.2.1.
>>
>
> Are you sure about 4.2.1? As far as I can tell it should not as it is
> clearly included in the UNRESERVED set.

My bad, "~" is treated as safe by 4.2.1.

>> More fun: RFC 2396 is superseded by RFC 3986.
>> The lists of allowable characters for path and query have not changed,
>> but the reserved list is now larger.
>> The only unreserved characters are now ".-_~", i.e. "!'()*" are now
>> reserved (as are "#[]") ...
>>
>
> I am aware of RFC 2396 having been superseded by RFC 3986. However as
> long as we target Java 1.5 as the minimal runtime level, we should stick
> to the same compliance level as the java.net.URI, which is RFC 2396 for
> Java 1.5.

OK.

BTW Java 1.6 URI still references 2396.


> Oleg
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@hc.apache.org
> For additional commands, e-mail: dev-help@hc.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@hc.apache.org
For additional commands, e-mail: dev-help@hc.apache.org


Mime
View raw message