nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fuad Efendi" <>
Subject RE: Ignoring Robots.txt
Date Fri, 11 Sep 2009 17:18:16 GMT
> My sysadm refuses to change the robots.txt citing the following reason:
> The moment he allows a specific agent, a lot of crawlers impersonate
> as that user agent and tries to crawl that site.

Extremely strange thoughts of some smart sys-minds...

If crawler wants impersonate... it will, and it will ignore robots.txt, and
sysadmin may ban such IP... I don't know any such public crawler except some
desktop based download agents such as WebCEO or Teleport or even IE and

No way, Nutch must follow robots.txt.

View raw message