hc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oleg Kalnichevski <ol...@apache.org>
Subject Re: URLEncodeUtils - change in format behaviour since 4.2
Date Tue, 26 Jun 2012 13:37:36 GMT
On Tue, 2012-06-26 at 14:21 +0100, sebb wrote:
> On 26 June 2012 13:33, Oleg Kalnichevski <olegk@apache.org> wrote:
> > On Tue, 2012-06-26 at 11:41 +0100, sebb wrote:
> >> On 26 June 2012 08:46, Oleg Kalnichevski <olegk@apache.org> wrote:
> >> > On Tue, 2012-06-26 at 02:00 +0100, sebb wrote:
> >> >> The escaping of non-alphabetic characters by the format methods is
no
> >> >> longer quite the same as that done by java.net.URLEncoder.encode.
> >> >>
> >> >> The former allows the chars in ".-*_!'()" to pass through without
> >> >> conversion, whereas the latter only allows ".-*_" unchanged.
> >> >> The latter is also how browsers behave when escaping form fields.
> >> >>
> >> >> I think the behaviour should be consistent with URLEncoder and browsers.
> >> >> That was in fact the behaviour with 4.2, which delegated the escaping
> >> >> to URLEncoder.
> >> >> I think the code should revert to using URLEncoder/URLDecoder.
> >> >>
> >> >> There is still a need for the extended path, query and fragment
> >> >> escape/unescape methods, but perhaps these belong in URIBuilder?
> >> >> If not, maybe they should be in a separate class anyway?
> >> >>
> >> >
> >> > Would not that lead to inconsistent behavior when the same query form
> >> > gets encoded differently depending on whether it is enclosed in the
> >> > request URI or in the request body?
> >>
> >> I don't think so, I think encodeFormFields could use a different safe
> >> character set without problems, so long as the safe set is a subset of
> >> all possible safe query characters. In fact the UNRESERVED BitSet is
> >> only currently used in URLEncodedUtils#encodeFormFields(), so I don't
> >> see how changing encodeFormFields to use a different safe set can
> >> affect anything.
> >>
> >> Besides, AFAIK 4.2 did not have a problem with using a more limited safe set.
> >>
> >> > Browsers do a lot of silly stuff to maximize compatibility with all
> >> > sorts of broken software out there. I am not sure we need to do
> >> > likewise.
> >>
> >> Well-written software will be able to deal with form data that has
> >> some additional safe characters encoded, so I don't think there is any
> >> problem in playing safe here.
> >>
> >> [But if we do decide to change the safe list from the one previously
> >> used, it needs to be flagged up in the release notes.]
> >>
> >
> > Likewise well-written software should be able to deal with the form data
> > containing valid URL encoded content. To me this is more about doing the
> > right thing rather than making sure some broken code is unaffected.
> 
> http://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1
> says that reserved chars are to be encoded as per RFC 1738 section 2.2.
> 
> This implies that the safe set of chars is "$-_.+!*'()," plus "=" as
> it is reserved for the delimiter
> 4.2.1 doesn't currently allow "$", so arguably is not "doing the right
> thing" anyway.
> 

RFC 1738 was superseded by RFC 2396 (which is what java.net.URI is based
on and this is what we ought to use as a basis as well). RFC 2396
clearly states "$" is one of the reserved characters.

---
2.2. Reserved Characters

   Many URI include components consisting of or delimited by, certain
   special characters.  These characters are called "reserved", since
   their usage within the URI component is limited to their reserved
   purpose.  If the data for a URI component would conflict with the
   reserved purpose, then the conflicting data must be escaped before
   forming the URI.

      reserved    = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
                    "$" | ","
---

Oleg


> However, 1738 also says that characters may be encoded so long as they
> are not being used for their reserved purpose.
> So at least in that regard we were doing the right thing.
> 
> > Having said all that I see no problem reducing the set of safe
> > characters in URI query to the bare minimum.
> 
> Strictly speaking, that would just be "=" for form data, but I assume
> you mean the safe set as implemented by browsers/java.net.URLEncoder
> 
> My take is that version 4.2 was compliant, if perhaps too strict.
> There is a risk that not encoding the extra characters may cause problems.
> 
> > Oleg
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@hc.apache.org
> > For additional commands, e-mail: dev-help@hc.apache.org
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@hc.apache.org
> For additional commands, e-mail: dev-help@hc.apache.org
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@hc.apache.org
For additional commands, e-mail: dev-help@hc.apache.org


Mime
View raw message