Mailing-List: contact new-httpd-help@apache.org; run by ezmlm
Precedence: bulk
Reply-To: new-httpd@apache.org
Date: Thu, 5 Apr 2001 21:29:27 +0100
From: Francis Daly <deva@daoine.org>
To: new-httpd@apache.org
Subject: [PATCH] mod_negotiation and order of suffixes
Message-ID: <20010405212927.A9306@kerna.ie>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii


Hi there,

for your consideration, appended to this mail is a patch to remove
the requirements on the order of suffixes when using MultiViews /
mod_negotiation.  This corresponds to (part of) PR3430.

The patch is relative to the version of mod_negotiation.c distributed
with apache-2.0.15.  I believe that's identical to the one from
apache-2.0.16.

But first, some notes:

The current method takes the "file" part of r->filename (either the bit
after the final / in the URL, or the value of DirectoryIndex).  First,
if the exact filename matches, mod_negotiation declines to handle it.
Second, for each file in the directory, it checks for (regex syntax)
/^file\./, and only considers ones that match.

This patched method, for the requested file "file" does the same thing
after a single extra if(strchr()).

However, if the r->filename is actually "file.s1.s2.sZ" (with dots), the
current way looks for /^file\.s1\.s2\.sZ\./; the patched way looks for
each of /^file\./, /\.s1/, /\.s2/, /\.sZ/.  It bails out at the first
failure.

In this case, the patched code does an extra strchr, strlen, strstr,
some pointer arithmetic and pointer movement, changes a character, and
changes it back.  Per dot in r->filename, per file in the directory.  I
don't have numbers for how expensive that extra string manipulation is.

Some consequences of this implementation are:

Current method: file "name.en.html" is only accessible through (partial)
URIs "name", "name.en", or "name.en.html"

Patched method: The same three work, as do "name.html" and
"name.html.en".  That's good.  However: so do "name.htm", "name.htm.en",
and "name.en.htm".  That may or may not be considered good.  More however:
so do "name.h", "name.h.h", "name...h.e.e..e.h.h.", and an infinite
number of similar variations.  That may not be considered good.

[ side note -- most of that infinitude could be eliminated, if desired,
by (for example) checking that the length of r->filename (prefix_len, in
the code) is not more than the length of dirent.name, immediately before
the while loop which looks for dots in filp.  I would consider that an
enhancement to, rather than an integral part of, the patch, so didn't
include it here.  Opinions may differ ]

In each case, the content is returned with a Content-Location: header
indicating the canonical filename.

The requirements are (1)r->filename up to the first dot must match the
real filename up to the first dot; (2)each .suffix in r->filename must
exist (string match) in the real filename; (3)the real filename must
correspond to a known mime-type, encoding, etc -- which I think means
that the final suffix must be known, and only suffixes followed by known
suffixes are considered.

As a real example, testing with the apache "It worked!" page (named
index.html.LANG), if I request index.html.fr, I get the page back.
If I request index.fr.html, or just index.fr, I get back the 406 Not
Acceptable page, with a link to index.html.fr, _unless_ I include fr
as an acceptable language.  That's PR6282, which is mentioned but not
addressed in this patch.  If I include fr as a language, I can request
/index.fr, /index.fr.html, or /index.html.fr successfully.  If I include
fr as my preferred language, I can additionally request /, /index, and
/index.html.  (As well as the .h, .ht, .htm, .f variants referred to
earlier).  If I request /index.d, I get a 406 with links to index.html.de
and index.html.dk

As a faked example, consider five files in the DocumentRoot, with no
special customisations to the (MIME) configuration:

files a.b.c, d.e.html, g.h.i.j.k.en, m.n.o.p.q.html, s.t.html.u.v

The following requests have the indicated results:

GET /a            -> not found
GET /a.b          -> not found
GET /a.c          -> not found
GET /a.b.c        -> success
GET /d            -> success
GET /d.e          -> success
GET /d.h          -> success
GET /d.html       -> success
GET /d....html    -> success
GET /g            -> not found
GET /g.h          -> not found
GET /g.h.i.j.k    -> not found
GET /g.h.i.j.k.en -> success
GET /g.h.i.k.j.en -> not found
GET /m            -> success
GET /m.html       -> success
GET /m.o.q.p.n    -> success
GET /m.o.r.p.n    -> not found
GET /s.t.html.u.v -> success
GET /s            -> not found
GET /s.t.html.u   -> not found

note that in the "not found" cases there (except for /m.o.r.p.n), the
patched code does pass the file down as being potentially valid --
it's later code which decides that it doesn't know how to treat the
final suffix, and fails it.

As another faked example, with files ..d.f.html and .e.txt, I can
successfully issue GETs for /.d, /.f, /.h, /.e and /.t, as well as
things like /....t. (whether or not the final . there is punctuation). 

So that's it.  Any comments, criticism, or ridicule related to the
patch, please send my way, or to the list.

All the best,

	f
-- 
Francis Daly        deva@daoine.org


--- modules/mappers/mod_negotiation.c.orig	Wed Apr  4 18:59:20 2001
+++ modules/mappers/mod_negotiation.c	Thu Apr  5 20:51:13 2001
@@ -911,6 +911,9 @@
     struct var_rec mime_info;
     struct accept_rec accept_info;
     void *new_var;
+    char *pos;
+    int pos_len;
+    int not_this_dirent;
 
     clean_var_rec(&mime_info);
 
@@ -935,12 +938,76 @@
         request_rec *sub_req;
         
         /* Do we have a match? */
-        if (strncmp(dirent.name, filp, prefix_len)) {
-            continue;
-        }
-        if (dirent.name[prefix_len] != '.') {
-            continue;
+
+        if ((pos = strchr(filp, '.'))) {
+
+        /* Given "name.suf1.suf2.suffix", check for "name." */
+
+            pos_len = pos - filp + 1;
+            if (strncmp(dirent.name, filp, pos_len)) {
+                continue;
+            }
+
+            not_this_dirent = 0;
+            filp = ++pos;
+
+        /* Check for each internal ".sufN" from r->filename */
+            while ((pos = strchr(filp, '.'))) {
+                --filp;
+                pos_len = pos - filp ;
+                filp[pos_len]='\0';
+                if (!strstr(dirent.name, filp)) {
+                    not_this_dirent=1;
+                }
+
+        /* XXX: Right now, filp points to a suffix (encoding indicator,
+         * handler indicator, mime-type indicator, whatever), starting
+         * with a ".". If we want to do stuff, like consider that to be
+         * an implicit additional Accept: header, here would be a good
+         * place to do it.  See PR6282 for an example of what I mean.
+         * Note, this would have to be repeated once more, just after the
+         * check for the final ".suffix" and before filp gets moved back
+         * again. 
+         */
+
+                filp[pos_len] = '.';
+                filp += pos_len + 1;
+                
+                if (not_this_dirent) {
+                    /* get to next dirent */
+                    break;
+                }
+            }
+            if (not_this_dirent) {
+                /* reset filp */
+                pos_len = strlen(filp);
+                filp -= prefix_len - pos_len;
+                /* next dirent */
+                continue;
+            }
+            --filp;
+            pos_len = strlen(filp);
+
+        /* Check for the final ".suffix" from r->filename */
+            if (!strstr(dirent.name, filp)) {
+                filp -= prefix_len - pos_len;
+                continue;
+            }
+            filp -= prefix_len - pos_len;
+
+        } else {
+
+        /* Alternatively, given just "name", check for "name." 
+         * Just like it used to be  
+         */
+            if (strncmp(dirent.name, filp, prefix_len)) {
+                continue;
+            }
+            if (dirent.name[prefix_len] != '.') {
+                continue;
+            }
         }
+
 
         /* Yep.  See if it's something which we have access to, and 
          * which has a known type and encoding (as opposed to something