www-apache-bugdb mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Olly Betts <o...@muscat.co.uk>
Subject mod_dir/1057: Web robots should be told not to index auto-generated index pages
Date Tue, 26 Aug 1997 15:10:02 GMT

>Number:         1057
>Category:       mod_dir
>Synopsis:       Web robots should be told not to index auto-generated index pages
>Confidential:   no
>Severity:       non-critical
>Priority:       medium
>Responsible:    apache (Apache HTTP Project)
>State:          open
>Class:          change-request
>Submitter-Id:   apache
>Arrival-Date:   Tue Aug 26 08:10:01 1997
>Originator:     olly@muscat.co.uk
>Organization:
apache
>Release:        1.3a1
>Environment:
Linux noxious.muscat.co.uk 2.0.18 #1 Tue Sep 10 10:15:48 EDT 1996 i586
>Description:
A web robot rarely wants to add auto-generated pages to its database.  But it
can't reliably spot them.  Apache could help a lot by marking such pages as
not to be indexed by putting:

<META NAME=robots CONTENT=noindex>

into the HTML <HEAD>...</HEAD> section.  This still allows compliant robots to
follow links on the page, which is probably what's wanted.

See <URL:http://info.webcrawler.com/mak/projects/robots/exclusion.html#meta>
for details of the protocol.
>How-To-Repeat:
Look at:

http://www.altavista.digital.com/cgi-bin/query?pg=q&what=web&kl=XX&q=title%3A%22Index+of%22+%22parent+directory%22

which gives "about 274150" examples.
>Fix:
Here's a patch to 1.3a1 -- the change is actually to mod_autoindex, but that's
not available in the picker on the bug report form.

--- src/mod_autoindex.c Mon Jul 21 06:53:49 1997
+++ src.mod/mod_autoindex.c     Tue Aug 26 11:43:28 1997
@@ -122,6 +122,9 @@
  * This routine puts the standard HTML header at the top of the index page.
  * We include the DOCTYPE because we may be using features therefrom (i.e.,
  * HEIGHT and WIDTH attributes on the icons if we're FancyIndexing).
+ * "<META NAME=robots CONTENT=noindex>" tells robots which support the protocol
+ * that they shouldn't index this page (but that they can follow links).
+ * See <URL:http://info.webcrawler.com/mak/projects/robots/exclusion.html#meta>
  */
 static void emit_preamble(request_rec *r, char *title)
 {
@@ -131,7 +134,7 @@
             "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 3.2 Final//EN\">\n",
             "<HTML>\n <HEAD>\n  <TITLE>Index of ",
             title,
-            "</TITLE>\n </HEAD>\n <BODY>\n",
+            "</TITLE>\n  <META NAME=robots CONTENT=noindex>\n </HEAD>\n
<BODY>\n",
             NULL
         );
 }

%0
>Audit-Trail:
>Unformatted:



Mime
View raw message