httpd-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Behlendorf <br...@collab.net>
Subject recursive robot queries
Date Sun, 31 Dec 2000 21:18:14 GMT

Doing a tail -f of /logs/www/weblogs on apache.org is a lesson
in... something.  Mainly robot insanity.  Every time I've checked
recently, it looks like 1 out of every 20-30 accesses looks like this:

xml.apache.org 139.179.10.17 - - [31/Dec/2000:12:49:03 -0800] "HEAD /xerces-c/faq-other.html/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/graphics/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/images/images/build.html
HTTP/1.0" 200 0 "http://xml.apache.org:80/xerces-c/faq-other.html/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/graphics/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/resources/images/images/install.html"
"Wget/1.4.5"

and

www.apache.org 210.73.88.163 - - [31/Dec/2000:12:49:04 -0800] "GET /index/full/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/images/images/images/images/foundation/images/images/images/images/foundation/images/apache_pb.gif
HTTP/1.0" 403 1282 "http://www.apache.org:80/index/full/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/foundation/images/images/images/images/images/foundation/images/images/images/images/foundation/FAQ.html"
"Wget/1.5.3"

These are allowed to happen due to content negotiation - any extra
information after a valid link is presumed to simply be PATH_INFO
information.  So in the www.apache.org example, the above URL will pull up
the page "/index", i.e. index.html, with "/full/foundation/...." as the
PATH_INFO.  How did this recursion start?

I narrowed it down to this sequence of accesses from that host:

httpd.apache.org 210.73.88.163 - - [31/Dec/2000:08:07:15 -0800] "GET /docs/misc/known_client_problems.html
HTTP/1.0" 200 13973 "http://httpd.apache.org/docs/misc/compat_notes.html" "Wget/1.5.3"
www.apache.org 210.73.88.163 - - [31/Dec/2000:08:07:25 -0800] "GET /index/full/4118 HTTP/1.0"
200 3785 "http://httpd.apache.org/docs/misc/known_client_problems.html" "Wget/1.5.3"
www.apache.org 210.73.88.163 - - [31/Dec/2000:08:07:26 -0800] "GET /index/full/foundation/images/asf_logo.gif
HTTP/1.0" 200 3785 "http://www.apache.org:80/index/full/4118" "Wget/1.5.3"

Somehow Wget is munging the link from known_client_problems.html to
http://bugs.apache.org/index/full/4118 (a perfectly valid link) into a
link to http://www.apache.org/index/full/4118, and that URL renders what
http://www.apache.org/index, only the relative URL on that page to
foundation/images/asf_logo.gif renders out to
http://www.apache.org/index/full/foundation/images/asf_logo.gif, and
getting that page leads to....

Gar.  This is silly.  OK, so I can fix this by redirecting any requests to
www.apache.org/index/full to www.apache.org/, but that feels like and is
an ugly hack.  What's a more general way of solving this?  Is this a bug
in Wget?

In the XML case, I see the following chain:

xml.apache.org 139.179.10.17 - - [31/Dec/2000:02:15:31 -0800] "GET /xerces-c/feedback.html
HTTP/1.0" 200 15788 "http://xml.apache.org:80/xerces-c/releases.html" "Wget/1.4.5"
xml.apache.org 139.179.10.17 - - [31/Dec/2000:02:18:20 -0800] "GET /xerces-c/faq-other.html/
HTTP/1.0" 200 29459 "http://xml.apache.org:80/xerces-c/feedback.html" "Wget/1.4.5"
xml.apache.org 139.179.10.17 - - [31/Dec/2000:02:18:28 -0800] "GET /xerces-c/faq-other.html/resources/script.js
HTTP/1.0" 200 29459 "http://xml.apache.org:80/xerces-c/faq-other.html/" "Wget/1.4.5"
xml.apache.org 139.179.10.17 - - [31/Dec/2000:02:18:36 -0800] "GET /xerces-c/faq-other.html/resources/resources/script.js
HTTP/1.0" 200 29459 "http://xml.apache.org:80/xerces-c/faq-other.html/resources/script.js"
"Wget/1.4.5"
xml.apache.org 139.179.10.17 - - [31/Dec/2000:02:18:43 -0800] "GET /xerces-c/faq-other.html/resources/resources/resources/script.js
HTTP/1.0" 200 29459 "http://xml.apache.org:80/xerces-c/faq-other.html/resources/resources/script.js"
"Wget/1.4.5"

This is clearly a typo in /xerces-c/feedback.html.  I'll ask to have this
fixed, but it's painful to see how such an easy typo to make can cause
such a cascade.

Anyways, just thought I'd post about this, coz I thought it was a humorous
problem.

	Brian



Mime
View raw message