httpd-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Roy T. Fielding" <field...@avron.ICS.UCI.EDU>
Subject many URL per resource
Date Wed, 11 Oct 1995 22:14:28 GMT
>> URL != File.  However, if by chance a particular URL == File and File is
>> incapable of making use of extra path, then any extra path is bad.
>> In fact, I would extend that to SSI and only allow it for ASIS and CGI,
> But CGI and SSI are the same damned thing surely.  They're notionally programs
> that run when you access them, take parameters sucked outta the URL and spit
> output back to the browser?

In theory, yes, but that is not how SSI is used.  It is extremely rare
for people to use includes that expect extra path info as an argument
to the included exec'd CGI programs.  In fact, I don't know of anyone
that does so.  Extra path is useful for ASIS files because
they can be used to redirect or close-off an entire resource tree.

>> if I thought I could get away with it.  Same goes for / vs /index.html
>> but I know I can't get away with that one.
>> Why is it bad?  Because you can't index sites with infinite URLs.
> Huh?  Stemming?  Fuzzy matching?  Publicised search engines? Is it possible
> to 'index' a site that makes extensive use of multiviews?  I'm confused by
> what you mean by 'indexing' in this instance.

When a spider traverses a site, it uses the URL to distinguish between
documents.  Multiple URLs per document will degrade the quality of later
search results if more than one of those URLs is used to reference that
document (most spiders are smart enough to combine .../ and .../index.html).

Any server resource that uses extra path info is able to create a web
black hole by ignoring (on purpose or not) the extra path and including
relative path references in the returned entity.  For example, let's say
we have <http://site/A/index.html> containing

    <A href="B/">...</A>

and that, by chance, there is no actual /A/B/ resource.
Now, if a spider encounters this it will retrieve

    http://site/A/B/      (getting A again)
    http://site/A/B/B/            (and again)
    http://site/A/B/B/B/            (and again)

and so on until it (hopefully) reaches some internal depth limit.

In general, this is not a problem for CGI because CGI scripts only
return relative paths when they are already using the extra path info
for something meaningful.  In contrast, people using SSI (or, prior to
the submitted patch, *any* file) never even think about extra path info,
and thus would never be aware of such a trap until they looked at their
logfiles or did a search for their stuff on Lycos.  At that point, we 
encounter the "astonishment test".


View raw message