nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David M. Cole" <>
Subject Re: Ignoring Robots.txt
Date Fri, 11 Sep 2009 15:40:02 GMT
At 3:00 PM +0530 9/11/09, Super Man wrote:
>Any clues?


The robots.txt protocol allows for identifying different user-agents 
within the one file, with each getting their own individual set of 
privileges (see for more info).

Ask your sysadmin to include an additional robots privilege record 
for the robot-name you choose that allows your robot access where 
others are not allowed.

You can set the user-agent in the nutch-default.xml file, changing 
the http.robots.agents tag accordingly. As Jake Jacobson found out in 
June, you *must* end the series of user-agents in the 
http.robots.agents tag with an asterisk (*), i.e.:


Hope this helps.


    David M. Cole                                  
    Editor & Publisher, NewsInc. <>        V: (650) 557-2993
    Consultant: The Cole Group <>       F: (650) 475-8479

View raw message