nutch-agent mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rod Taylor <...@sitesell.com>
Subject Re: Crawler submits forms?
Date Tue, 13 Dec 2005 17:13:48 GMT
On Tue, 2005-12-13 at 16:57 +0000, Andy Read wrote:
> Hi,
> 
> I'm using nutch to create a site search facility for a couple of site.
> 
> I upgraded from 0.6 to 0.7.1 a few days ago and have just noticed that blank
> users are being registered on my site at the exact times the cron job runs
> the crawl tool to re-index the site.  This means that the crawler is now
> submitting a post request from the registration form!  Is this a new
> 'feature' of 0.7 or 0.7.1?  I can't find any mention in changes.txt and I
> can't find any config option referring to it.  Surely the crawler should
> never submit form input?

Nutch follows links. You can argue that it should not extract links from
POST style forms (this change has been made) but in the end it doesn't
make much of a difference since if you link to that script in any way (a
href, etc.) it will be followed and give you the same results.

Your registration form script is broken for accepting invalid input (or
GET requests at all) and robots.txt should be used to protect dynamic
areas from inadvertent uses.

-- 
Rod Taylor <rbt@sitesell.com>


Mime
View raw message