nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Mendenhall <>
Subject Re: Ignoring Robots.txt
Date Fri, 11 Sep 2009 17:17:56 GMT

> My sysadm refuses to change the robots.txt citing the following reason:
> The moment he allows a specific agent, a lot of crawlers impersonate
> as that user agent and tries to crawl that site.
> Are you saying there is no way to configure nutch to ignore robots.txt?

We had a similar situation.

We modified the parse-html plugin, with a configurable flag
to adhere to robots.txt or not adhere to robots.txt.  Works


john mendenhall
surf utopia
internet services

View raw message