nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fuad Efendi" <f...@efendi.ca>
Subject RE: Ignoring Robots.txt
Date Fri, 11 Sep 2009 17:18:16 GMT
> 
> My sysadm refuses to change the robots.txt citing the following reason:
> 
> The moment he allows a specific agent, a lot of crawlers impersonate
> as that user agent and tries to crawl that site.



Extremely strange thoughts of some smart sys-minds...

If crawler wants impersonate... it will, and it will ignore robots.txt, and
sysadmin may ban such IP... I don't know any such public crawler except some
desktop based download agents such as WebCEO or Teleport or even IE and
Firefox...

No way, Nutch must follow robots.txt.



Mime
View raw message