httpd-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From r..@ai.mit.edu (Robert S. Thau)
Subject Re: No HOST header solutions?
Date Sun, 02 Jun 1996 15:49:36 GMT
  I guess robots.txt is from pre-CGI days... the "Useragent" thing
  seems more than a little useless.

There was actually a fair bit of talk at the workshop about people
wanting to shoehorn their own little bit into robots.txt --- something
which Martijn Koster, the author of the current robots.txt spec, was
opposed to (the way he put it in his talk was "it doesn't fix the net,
and if you want to fix the net, there are better ways to do it").

Another thing that was discussed, btw, was enriching the set of <META>
tags which are commonly respected by spiders to allow fine-grained
control of their behavior --- in particular, a <META NAME="robots" ...>
tag which would allow people to inhibit a spider from indexing the page
it occured on, or following any of the links.

The problem with this sort of notion, from the perspective of the spider
maintainers, is that they still have to retrieve a page which is marked
this way before they can find out whether they're supposed to be dealing
with it or not --- and that wastes time, particularly compared to
robots.txt-style maintenance, which *does* tell them that they don't even
have to look at the thing.  (Believe it or not, they actually don't *want*
to be retrieving data that shouldn't be indexed --- it wastes their
cycles as well as the targets', and in *far* greater abundance).

Still, they're likely to be doing something of this nature eventually,
if only to satisfy Lycos' paranoid lawyers that *anybody* who wants to
keep their content out of the index has a last-ditch way to arrange for
that, whether or not they have enough administrative control over the
server it resides on to be able to scribble in robots.txt.

rst

Mime
View raw message