nutch-agent mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Lundberg <lundb...@cs.washington.edu>
Subject [sin #177] [6293] Your Nutch Crawler is Out of Control - Apache Notified (fwd)
Date Wed, 28 Sep 2005 17:01:36 GMT
Dear Web Experts America,

Please see the message below, regarding your complaint about a Nutch
Crawler running on host 'turingc@cs.washington.edu'.

If you can provide us with more detailed information about the
incident, we can investigate further.  

 Erik Lundberg
 Director, CS Laboratory
 Department of Computer Science & Engineering
 University of Washington

     ---------- Original Message ----------
     Date: Fri, 23 Sep 2005 12:25:49 -0700
     From: WebExpertsAmerica <expert@WebExpertsAmerica.com>
     To: abuse@cac.washington.edu, noc@cac.washington.edu
     Cc: nutch-agent@lucene.apache.org
     Subject: Your Nutch Crawler is Out of Control - Apache Notified

     You crawler is ignoring our robots.txt file.

     http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)"
     128.95.1.189

     You are eating bandwidth at our domain in incredible amounts. This is
     rude.

     Please stop or we will be forced to block your IP and the crawler you
     are using.

     Best Regards,

     Web Experts America
     ---------


 ----Forwarded Message

This may refer to a crawling task I ran intermittently over the last
three weeks.  We're definitely observing robots.txt, with code that's
been widely tested.  (Nutch is an Apache project that's been around
for 3 years.)

It's possible there's a bug in the robots code, but I'd find that
somewhat surprising.  The only other thing I can think of is that
WebExpertsAmerica is a Search Engine Optimization company, and they
might be doing something slightly tricky or unusual that confuses
Nutch's politeness guarantees.

It's hard for me to say much else (eg, how many of their pages we
actually crawled, whether this is a widely-seen problem) without a
little more info (eg, what domains they're complaining about, what
kinds of other complaints we might have received).  I'm happy to talk
to you or anyone at CAC about needed further action.

Note that the task has been complete for some time, and I have no more
crawling plans anytime soon.

   --Mike


Mime
View raw message