nutch-agent mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fred Tyre" <fred.t...@hlipublishing.com>
Subject Nutch Problems (0.8-dev)
Date Wed, 26 Jul 2006 22:29:32 GMT

Our web server has been receiving a lot of failing traffic from shopping.com
and irl.cs.tamu.edu

I believe your crawler is seeing "&section" and replacing it with "§ion"

http://www.businessair.com/avdealers.cfm?alpha_choice=ALL§ion=AC&first_sort_
by_column=DEALRSTATE&sort_by_columns=DEALRSTATEDESC,DEALRNAME,CITY,DESCRIPTI
ON

The URL should be...

http://www.businessair.com/avdealers.cfm?alpha_choice=ALL&section=AC&first_s
ort_by_column=DEALRSTATE&sort_by_columns=DEALRSTATEDESC,DEALRNAME,CITY,DESCR
IPTION

This would only be a minor problem, except that your bot is sending several
requests while only waiting a second between requests.

A typical user can only click on the page a few seconds after the request
has been fulfilled.  Therefore, a request should only be made every 15-20
seconds at the most.

It doesn't look like your bot even waited for the page to finish loading.

Otherwise, a system admin could see the above actions as a Denial Of Service
attack.

As far as the "&section" being replaced with "§ion"...

Under the file ...
   nutch\html\Entities.java

  there is an area adding special characters.  However, I believe that those
special characters are supposed to start with
     & and end in ; (ie: &sect; or &nbsp).  I have not recompiled the code,
yet, but I believe that this should remedy the
     problem.

Please keep me informed to your progress, or I will be forced to block your
bots (which I would prefer not to do).

Thanks.

Sincerely,
Fred

><><><><><><><><><><><><><><><><><><
   Fred Tyre
   Information Services
   Heartland Communications, Inc.
   515-574-2147
   Fred.Tyre@hlipublishing.com
><><><><><><><><><><><><><><><><><><




Mime
View raw message