nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David M. Cole" <...@colegroup.com>
Subject Re: Difference between Deiselpoint and Nutch?
Date Fri, 18 Sep 2009 17:16:05 GMT
At 12:46 PM -0400 9/18/09, Paul Tomblin wrote:
>Nutch is, I think, doing the right thing by not
>crawling it, but I can't convince her of this because she's convinced that
>DP is commercial and Nutch is "only" Open Source, so obviously DP is right.

Just the opposite ... the commercial product is doing it *wrong* (not 
respecting robots.txt) while the open source product is doing it 
*right* (respecting the file).

The client is ornery and is doing something patently against the 
wishes (expressed in the robots.txt file) of the owner(s) of the 
content (unless she has permission, in which case get the owner[s] of 
the content to include your Nutch agent name in their robots.txt 
file[s]).

I know how far and few between paying clients are these days, but 
personally --  under the circumstances you've described -- I think 
I'd walk away from this project.

\dmc

PS: The robots.txt file shouldn't have any mention of a sitemap, 
except possibly to include the URL.

-- 
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+
    David M. Cole                                            dmc@colegroup.com
    Editor & Publisher, NewsInc. <http://newsinc.net>        V: (650) 557-2993
    Consultant: The Cole Group <http://colegroup.com/>       F: (650) 475-8479
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+

Mime
View raw message