nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Mendenhall <j...@surfutopia.net>
Subject Re: Ignoring Robots.txt
Date Fri, 11 Sep 2009 17:17:56 GMT
Zee,

> My sysadm refuses to change the robots.txt citing the following reason:
> 
> The moment he allows a specific agent, a lot of crawlers impersonate
> as that user agent and tries to crawl that site.
> 
> Are you saying there is no way to configure nutch to ignore robots.txt?

We had a similar situation.

We modified the parse-html plugin, with a configurable flag
to adhere to robots.txt or not adhere to robots.txt.  Works
great.

JohnM

-- 
john mendenhall
john@surfutopia.net
surf utopia
internet services

Mime
View raw message