nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kirby Bohling <>
Subject Re: Ignoring Robots.txt
Date Fri, 11 Sep 2009 18:03:44 GMT
On Fri, Sep 11, 2009 at 12:05 PM, Super Man <> wrote:
> My sysadm refuses to change the robots.txt citing the following reason:
> The moment he allows a specific agent, a lot of crawlers impersonate
> as that user agent and tries to crawl that site.
> Are you saying there is no way to configure nutch to ignore robots.txt?
> Thanks,
> Zee

robots.txt isn't enforced, it's merely a convention that polite people
agree upon.  I'm shocked that people who would respect robots.txt,
would then go to the trouble of impersonating a different crawler.  I
suppose it makes some sense, just act like Google's bot if that's
allowed, and then you blame Google, not them.

Heck, just serve up a different robots.txt file internally then
externally and move on with life (it's not that difficult to do in
most web servers).  I'm sure that Robots.txt has no work around
precisely to make it make it more difficult for people who don't
understand what robots.txt is for to just disable it.

The interface that controls robots.txt behavior is

It's not terrible difficult to identify various ways to fix that
(there are two obvious ones just from looking at the interface to just
gut all robots.txt handling for all websites everywhere).  I would
convince the SA to make the changes required (anybody who uses our
robots User Agent from the wrong IP range just drop on the floor, or
serve up different robots.txt files to internal IP's).

Barring that, I'd write another implementation of RobotRules for
specific domains listed in a configuration file, the appropriate
methods would just return "the robots.txt says we are allowed".  I'd
go to the trouble of doing this just in case I misconfigured the
domains it was allowed to crawl.  At least then I'd stand a chance of
not ignoring robots.txt on every server on the planet.


> On Fri, Sep 11, 2009 at 9:10 PM, David M. Cole <> wrote:
>> At 3:00 PM +0530 9/11/09, Super Man wrote:
>>> Any clues?
>> Zee:
>> The robots.txt protocol allows for identifying different user-agents within
>> the one file, with each getting their own individual set of privileges (see
>> for more info).
>> Ask your sysadmin to include an additional robots privilege record for the
>> robot-name you choose that allows your robot access where others are not
>> allowed.
>> You can set the user-agent in the nutch-default.xml file, changing the
>> http.robots.agents tag accordingly. As Jake Jacobson found out in June, you
>> *must* end the series of user-agents in the http.robots.agents tag with an
>> asterisk (*), i.e.:
>> <property>
>>     <name>http.robots.agents</name>
>>     <value>my-robot,*</value>
>> </property>
>> Hope this helps.
>> \dmc
>> --
>> *+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+
>>   David M. Cole                                  
>>   Editor & Publisher, NewsInc. <>        V: (650)
>>   Consultant: The Cole Group <>       F: (650) 475-8479
>> *+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+

View raw message