httpd-bugs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzi...@apache.org
Subject DO NOT REPLY [Bug 49396] New: PATH_INFO normalization, especially relating to void path segments
Date Sun, 06 Jun 2010 22:00:12 GMT
https://issues.apache.org/bugzilla/show_bug.cgi?id=49396

           Summary: PATH_INFO normalization, especially relating to void
                    path segments
           Product: Apache httpd-2
           Version: 2.2.15
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Core
        AssignedTo: bugs@httpd.apache.org
        ReportedBy: theimp@iinet.net.au


The PATH_INFO request variable is treated by httpd as a path, which is
normalized to have dot segments or void path segments reduced (an empty path
segment has traditionally, on UNIX, been treated as synonymous for a dot
segment, ie /./ ). This is almost always the desired behavior, but is
technically incorrect (the variable value itself, not how it is reduced), and
can cause problems when a script/module cannot use PATH_INFO against
REQUEST_URI. My proposed solution is to add a RAW_PATH_INFO variable, which
contains the PATH_INFO portion of the REQUEST_URI as it appears in REQUEST_URI,
undecoded and unresolved (ie as received on the Request Line).

The rest of this report is my rationale/testing and is probably superfluous and
certainly badly edited for brevity, so please feel free to ignore it unless you
think you need some background.

The following URL:

/index.html/1/2//3/./4/../5

has a PATH_INFO of:

/1/2/3/5

The removal of the dot segments is correct per RFC 3986, which doesn't
recognize PATH_INFO other than as part of a path, and requires that dot
segments be normalized irrespective of whether they are path components or
opaque tokens (it's hierarchical so it is considered that it doesn't make a
difference which type they are).

Note that most clients and/or intervening proxies will remove dot segments as
part of their own resolution before they ever send the request to httpd.

So far, this is all correct behavior.

However, in the case of a void path segment (//), there is no normalization
procedure defined as per RFC 3986 (or any of the others that deal with the
subject - it's almost as if they're deliberately avoiding addressing it…).

So, a URL such as the following:

/index.html/http://example.com/index2.html
                 ^^
would have a PATH_INFO of:

http:/example.com/index2.html
     ^

And since there are fewer characters in PATH_INFO than there are in the
PATH_INFO portion of REQUEST_URI, even after unencoding REQUEST_URI, it becomes
extremely difficult to examine REQUEST_URI to determine the non-PATH_INFO
portion of the path, or the original PATH_INFO.

Now, in this example, the slashes after http: are character data and not path
separators, and so they should be encoded as %2F, but there is no way for the
client to know to do this because it cannot differentiate between what is the
PATH_INFO and what is the path - only the server knows this, and it only knows
it when it decides what script to call. The author of the URL is at fault, but
the script has to deal with it anyhow, just like any other invalid data. And
while the script might just be able to throw back a HTTP 400 error (or other
error of its choice), scripts that need the original URI (for example, for
logging) without the PATH_INFO portion can't get it from REQUEST_URI (or
anywhere else) even after normalization, because the normal procedure of simply
removing (length PATH_INFO) characters from a normalized REQUEST_URI won't work
if extra characters have been removed.

(Not that the default httpd configuration would support such a PATH_INFO if it
did have encoded slashes, but if you're expecting to deal with non-filesystem
PATH_INFOs, it'd be up to you to know that you'd have AllowEncodedSlashes on.)

The only way that a script can recover the URL sans PATH_INFO with it is by
comparing the end of an unencoded REQUEST_URI (the number of characters from
the right as there are in PATH_INFO) with the PATH_INFO and if they don't
match, then work backwards along REQUEST_URI looking for dot and void segments
to add back into PATH_INFO until it matches (with special handling for segments
at the very beginning of the PATH_INFO), and only then what's left of
REQUEST_URI is the non-PATH_INFO portion of the URL, and then applying its own
segment resolution to PATH_INFO without collapsing void paths, to get the
PATH_INFO. (Even this is impossible if the last character of the script as
given in REQUEST_URI is an unencoded period ".", which would be rare and silly,
but not impossible).

Certainly, I would agree that it's dumb to use the PATH_INFO for anything other
than true files, as implied by RFC 3875 (you should use the Query string
instead). The point is that even if you ARE using PATH_INFO only for normal
files, that when you do get certain kinds of requests (valid files or not), you
can't isolate PATH_INFO from the REQUEST_URI. This realization came from the
debugging of deliberately malformed URLs as a robustness test.

Changing the path resolution engine to not reduce void path segments in
PATH_INFO means that special code must be written for the resolution of
PATH_INFO (and it looks like a whole new subrequest, at least). Also, using a
different resolution for PATH_INFO, from what is used for all other
resolutions, will probably break almost every existing script in the universe
that uses it, if they encounter such a URL, because while it is very
unfortunate that a fundamental assumption of RFC 3875 is that all URLs
implicitly map to files on a filesystem, that certainly is indeed by far the
most common use case (or certainly was, at the time).

By far the easiest, most compatible way of dealing with this is to add a
variable like RAW_PATH_INFO that doesn't feature path normalization or escape
decoding; it's simply lopped off of the end of REQUEST_URI. Anyone who has
never cared can continue to not care, any anyone else can easily get what they
need.

I'm not really sure of whether this constitutes a bug or a feature request.

Strictly speaking, reducing void path segments is not required by URL-related
specs, and implicitly prohibited (that is, they MAY be significant, and you
can't just remove significant data because of assumptions like that they
represent a filesystem path). So, technically, the specific behavior of
removing void path segments from FILE_INFO is a bug.

On the other hand, it IS the desired behavior; the PATH_INFO is specifically
intended to represent a filesystem path. Almost every script/module ever
written assumes that it will be a properly-formatted path (especially since RFC
3875 requires that it be unencoded). And the way that it is currently
determined makes it very inefficient (or, complex) to fix.

Also, changing the resolving for PATH_INFO to preserve void segments will not
entirely solve the discussed problem with it, because dot segments will still,
correctly, be removed and the length of the PATH_INFO in the REQUEST_URI will
remain as inscrutable as ever for such URLs.

So, adding the above-mentioned RAW_PATH_INFO would defer the argument over
whether void path segments are significant, but that's nothing less than a
naked feature request. So I classed it as a feature request.

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org


Mime
View raw message