httpd-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From William A Rowe Jr <>
Subject StrictURI in the wild [Was: Backporting HttpProtocolOptions survey]
Date Mon, 12 Sep 2016 15:49:47 GMT
On Mon, Aug 29, 2016 at 1:04 PM, Ruediger Pluem <> wrote:

> On 08/29/2016 06:25 PM, William A Rowe Jr wrote:
> > Thanks all for the feedback. Status and follow-up questions inline
> >
> > On Thu, Aug 25, 2016 at 10:02 PM, William A Rowe Jr <
> <>> wrote:
> >
> >     4. Should the next 2.4/2.2 releases default to Strict[URI] at all?
> >
> >     Real world direct observation especially appreciated from actual
> deployments.
> >
> > Strict (and StrictURI) remain the default.
> StrictURI as a default only makes sense if we have our own house in order
> (see above), otherwise it should be opt in.

So it's not only our house [our %3B encoding in httpd isn't a showstopper
here]... but also whether widely used user-agent browsers and tooling have
their houses in order, so I started to study the current browser behaviors.
The applicable spec is

Checked the unreserved set with '?' and '/' observing special meanings.
Nothing here should become escaped when given as a URI;

Checked the invalid set of characters all of which must be encoded
per the spec, and verify that #frag is not passed to the server;
http://localhost:8080/gen-delims-[]/invalid- "<>\^`{|}#frag

Checked the reserved set including '#' '%' '?' by their encoded value
to determine if there are any unpleasant reverse surprises lurking;

Checked a list of unreserved/unassigned gen-delims and sub-delims
to determine if the user agent normalizes while composing the request;

Using the simplistic $ nc -kl localhost 8080 here are the results
I obtained from a couple of current browsers, more observations and
of other user-agents to this list would be appreciated.

Chrome 53:
GET /unreserved-._~/sub-delims-!$&'()*+,;=/gen-delims-:@?query HTTP/1.1
GET /gen-delims-[]/invalid-%20%22%3C%3E/%5E%60%7B%7C%7D HTTP/1.1
odd>            ^^                     ^
GET /encoded-%23%25%2F%3A%3B%3D%3F%40%5B%5C%5D%7B%7C%7D HTTP/1.1
GET /plain-%21%24%26%27%28%29%2A%2B%2C-.123ABC_abc~ HTTP/1.1
odd>        ^  ^  ^  ^  ^  ^  ^  ^  ^

Firefox 48:
GET /unreserved-._~/sub-delims-!$&'()*+,;=/gen-delims-:@?query HTTP/1.1
GET /gen-delims-[]/invalid-%20%22%3C%3E/%5E%60%7B|%7D HTTP/1.1
odd>            ^^                     ^         ^
GET /encoded-%23%25%2F%3A%3B%3D%3F%40%5B%5C%5D%7B%7C%7D HTTP/1.1
odd>        ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^

The character '\' is converted to a '/' by both browsers, in a nod either
to Microsoft insanity, or a less-accessible '/' key. (Which suggests that
the yen sign might be treated similarly in some jp locales.) Invalid as a
literal '\' character, both browsers support an explicit %5C for those who
really want to use that in a URI. No actual issue here.

Interestingly, gen-delims '@' and ':' are explicitly allowed by 3.3 grammer
(as I've tested above), while '[' and ']' are omitted and therefore not
according to spec. (On this, StrictURI won't care yet, because we are
simply correcting for any valid URI character, not by section, and '[' ']'
obviously allowed for the IPv6 port specification - so we don't reject yet.)
When we add strict parsing to the apr uri parsing function, we will trip
over this, from all browsers, in spite of these being prohibited and
unwise for the past 18 years or more.

The character '|' is also invalid. However, Firefox fails to follow the spec
again here (although Chrome gets it right).

With respect to these characters, recall this 18 year old document,
last paragraph describes the rational;

   unwise      = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"

   Data corresponding to excluded characters must be escaped in order to
   be properly represented within a URI.

Which replaced now
almost 22 years old, without changing the rules;


   Characters can be unsafe for a number of reasons.  The space
   character is unsafe because significant spaces may disappear and
   insignificant spaces may be introduced when URLs are transcribed or
   typeset or subjected to the treatment of word-processing programs.
   The characters "<" and ">" are unsafe because they are used as the
   delimiters around URLs in free text; the quote mark (""") is used to
   delimit URLs in some systems.  The character "#" is unsafe and should
   always be encoded because it is used in World Wide Web and in other
   systems to delimit a URL from a fragment/anchor identifier that might
   follow it.  The character "%" is unsafe because it is used for
   encodings of other characters.  Other characters are unsafe because
   gateways and other transport agents are known to sometimes modify
   such characters. These characters are "{", "}", "|", "\", "^", "~",
   "[", "]", and "`".

   All unsafe characters must always be encoded within a URL.

While it was labeled 'unsafe', 'unwise', and now disallowed-by-omission
from RFC3986, the 'must' designation couldn't have been any clearer.
We've had this right for 2 decades at httpd.

Second paragraph of
goes into some detail about this change, and while it is hard to parse,
the paragraph is stating that '[' ']' were once invalid, now are reserved,
and remain disallowed in all other path segments and use cases.

The upshot, right now StrictURI will accept '[' and ']', but this won't
a rewrite of the apr parser operating with a 'strict' toggle. StrictURI does
not accept '|'. The remaining question is what to do, if anything, about
carving a specific exception here due to modern Firefox issues.

Thoughts/Comments/Additional test data?  TIA!

View raw message