nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David M. Cole" <...@colegroup.com>
Subject Re: Difference between Deiselpoint and Nutch?
Date Fri, 18 Sep 2009 16:06:35 GMT
At 11:30 AM -0400 9/18/09, Paul Tomblin wrote:
>Is anybody here familiar with how Desielpoint (DP) works?

Dieselpoint is designed specifically for intranets and therefore 
doesn't take robots.txt into account because the Dieselpoint 
administrator and the web administrator (theoretically) work toward 
the same goals (see the thread from last Friday, "Ignoring 
Robots.txt" for an instance where that wasn't the case).

Nutch is designed specifically for all-web crawling (like Google or 
Bing) and respects robots.txt because Nutch needs to be polite when 
indexing sites over which it has no control.

Your client has a robots.txt file to control Google and/or Bing, so 
Nutch is respecting it the same way Google or Bing would.

While Nutch is not designed as an intranet indexer, it can be used 
that way, but the Nutch administrator must make some compromises. You 
won't be able to build a script with wget and make URLs for Nutch to 
index, because Nutch will respect the robots.txt file. You have to 
figure out a way around the problem via the robots.txt file.

You can attack the problem in one of these ways:

*Modify the nutch-default.xml file, changing the http.robots.agents 
tag accordingly (search the list for "Jake Jacobson" to see how to do 
this). Then create a specific record in your client's robots.txt file 
that cites the Nutch user agent and allows a crawl of everything.

*Modify Nutch to ignore robots.txt files; you will need to work on 
the parse-html plugin.

*Modify the robots.txt file either by hand or by script. If you're 
only crawling once (unlikely), just open the robots.txt file, comment 
out the offending lines, save it, run Nutch, reopen the file, 
uncomment and save again. If you're crawling on a CRON job, write a 
script that alters the robots.txt file that CRONs just before Nutch's 
crawl and a script that changes robots.txt back after Nutch finishes.

*In Apache or whatever the HTTP server, create a ruleset that 
delivers one robots.txt file (which allows crawling everywhere) to 
the IP address where Nutch is running and the regular one to all 
other IP addresses.

There may be other kludges available.

Hope this helps.

\dmc

-- 
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+
    David M. Cole                                            dmc@colegroup.com
    Editor & Publisher, NewsInc. <http://newsinc.net>        V: (650) 557-2993
    Consultant: The Cole Group <http://colegroup.com/>       F: (650) 475-8479
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+

Mime
View raw message