From sebb <seb...@gmail.com>
Subject Re: URLEncodeUtils - change in format behaviour since 4.2
Date Tue, 26 Jun 2012 13:21:09 GMT
On 26 June 2012 13:33, Oleg Kalnichevski <olegk@apache.org> wrote:
> On Tue, 2012-06-26 at 11:41 +0100, sebb wrote:
>> On 26 June 2012 08:46, Oleg Kalnichevski <olegk@apache.org> wrote:
>> > On Tue, 2012-06-26 at 02:00 +0100, sebb wrote:
>> >> The escaping of non-alphabetic characters by the format methods is no
>> >> longer quite the same as that done by java.net.URLEncoder.encode.
>> >>
>> >> The former allows the chars in ".-*_!'()" to pass through without
>> >> conversion, whereas the latter only allows ".-*_" unchanged.
>> >> The latter is also how browsers behave when escaping form fields.
>> >>
>> >> I think the behaviour should be consistent with URLEncoder and browsers.
>> >> That was in fact the behaviour with 4.2, which delegated the escaping
>> >> to URLEncoder.
>> >> I think the code should revert to using URLEncoder/URLDecoder.
>> >>
>> >> There is still a need for the extended path, query and fragment
>> >> escape/unescape methods, but perhaps these belong in URIBuilder?
>> >> If not, maybe they should be in a separate class anyway?
>> >>
>> >
>> > Would not that lead to inconsistent behavior when the same query form
>> > gets encoded differently depending on whether it is enclosed in the
>> > request URI or in the request body?
>> I don't think so, I think encodeFormFields could use a different safe
>> character set without problems, so long as the safe set is a subset of
>> all possible safe query characters. In fact the UNRESERVED BitSet is
>> only currently used in URLEncodedUtils#encodeFormFields(), so I don't
>> see how changing encodeFormFields to use a different safe set can
>> affect anything.
>> Besides, AFAIK 4.2 did not have a problem with using a more limited safe set.
>> > Browsers do a lot of silly stuff to maximize compatibility with all
>> > sorts of broken software out there. I am not sure we need to do
>> > likewise.
>> Well-written software will be able to deal with form data that has
>> some additional safe characters encoded, so I don't think there is any
>> problem in playing safe here.
>> [But if we do decide to change the safe list from the one previously
>> used, it needs to be flagged up in the release notes.]
> Likewise well-written software should be able to deal with the form data
> containing valid URL encoded content. To me this is more about doing the
> right thing rather than making sure some broken code is unaffected.

says that reserved chars are to be encoded as per RFC 1738 section 2.2.

This implies that the safe set of chars is "$-_.+!*'()," plus "=" as
it is reserved for the delimiter
4.2.1 doesn't currently allow "$", so arguably is not "doing the right
thing" anyway.

However, 1738 also says that characters may be encoded so long as they
are not being used for their reserved purpose.
So at least in that regard we were doing the right thing.

> Having said all that I see no problem reducing the set of safe
> characters in URI query to the bare minimum.

Strictly speaking, that would just be "=" for form data, but I assume
you mean the safe set as implemented by browsers/java.net.URLEncoder

My take is that version 4.2 was compliant, if perhaps too strict.
There is a risk that not encoding the extra characters may cause problems.

> Oleg
