nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Super Man <>
Subject Ignoring Robots.txt
Date Fri, 11 Sep 2009 09:30:05 GMT

I want to crawl a website which denies access to all crawlers. The
website is our own, so there are no issues with crawling it, but the
sysadmin doesnt want to change the robots.txt for the fear that we get
many other impersonated crawlers once we allow a crawler.

Is it possible to configure nutch to ignore robots.txt? I set the
Protocol.CHECK_ROBOTS property to false in nutch-site.xml, but that
doesnt seem to help.

Any clues?


View raw message