nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Super Man <z35...@gmail.com>
Subject Re: Ignoring Robots.txt
Date Fri, 11 Sep 2009 17:05:01 GMT
My sysadm refuses to change the robots.txt citing the following reason:

The moment he allows a specific agent, a lot of crawlers impersonate
as that user agent and tries to crawl that site.

Are you saying there is no way to configure nutch to ignore robots.txt?

Thanks,
Zee

On Fri, Sep 11, 2009 at 9:10 PM, David M. Cole <dmc@colegroup.com> wrote:
> At 3:00 PM +0530 9/11/09, Super Man wrote:
>>
>> Any clues?
>
> Zee:
>
> The robots.txt protocol allows for identifying different user-agents within
> the one file, with each getting their own individual set of privileges (see
> http://www.robotstxt.org/ for more info).
>
> Ask your sysadmin to include an additional robots privilege record for the
> robot-name you choose that allows your robot access where others are not
> allowed.
>
> You can set the user-agent in the nutch-default.xml file, changing the
> http.robots.agents tag accordingly. As Jake Jacobson found out in June, you
> *must* end the series of user-agents in the http.robots.agents tag with an
> asterisk (*), i.e.:
>
> <property>
>     <name>http.robots.agents</name>
>     <value>my-robot,*</value>
> </property>
>
> Hope this helps.
>
> \dmc
>
> --
> *+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+
>   David M. Cole                                            dmc@colegroup.com
>   Editor & Publisher, NewsInc. <http://newsinc.net>        V: (650) 557-2993
>   Consultant: The Cole Group <http://colegroup.com/>       F: (650) 475-8479
> *+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+
>

Mime
View raw message