nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David M. Cole" <...@colegroup.com>
Subject Re: Ignoring Robots.txt
Date Fri, 11 Sep 2009 15:40:02 GMT
At 3:00 PM +0530 9/11/09, Super Man wrote:
>Any clues?

Zee:

The robots.txt protocol allows for identifying different user-agents 
within the one file, with each getting their own individual set of 
privileges (see http://www.robotstxt.org/ for more info).

Ask your sysadmin to include an additional robots privilege record 
for the robot-name you choose that allows your robot access where 
others are not allowed.

You can set the user-agent in the nutch-default.xml file, changing 
the http.robots.agents tag accordingly. As Jake Jacobson found out in 
June, you *must* end the series of user-agents in the 
http.robots.agents tag with an asterisk (*), i.e.:

<property>
      <name>http.robots.agents</name>
      <value>my-robot,*</value>
</property>

Hope this helps.

\dmc

-- 
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+
    David M. Cole                                            dmc@colegroup.com
    Editor & Publisher, NewsInc. <http://newsinc.net>        V: (650) 557-2993
    Consultant: The Cole Group <http://colegroup.com/>       F: (650) 475-8479
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+

Mime
View raw message