nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Guillermo Garrido <ggarr...@lsi.uned.es>
Subject Re: Ignoring Robots.txt
Date Fri, 11 Sep 2009 17:42:10 GMT
Certainly, Nutch must follow robots.txt.

Otherwise you risk your IP banned, or worse.

I find quite illogical the stance of not changing robots.txt because an
agent can declare a fake agent name, and on the other hand letting a crawler
that ignores robots.txt run over your site.

2009/9/11 Fuad Efendi <fuad@efendi.ca>

> >
> > My sysadm refuses to change the robots.txt citing the following reason:
> >
> > The moment he allows a specific agent, a lot of crawlers impersonate
> > as that user agent and tries to crawl that site.
>
>
>
> Extremely strange thoughts of some smart sys-minds...
>
> If crawler wants impersonate... it will, and it will ignore robots.txt, and
> sysadmin may ban such IP... I don't know any such public crawler except
> some
> desktop based download agents such as WebCEO or Teleport or even IE and
> Firefox...
>
> No way, Nutch must follow robots.txt.
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message