nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Tomblin <ptomb...@xcski.com>
Subject Re: Difference between Deiselpoint and Nutch?
Date Fri, 18 Sep 2009 16:46:49 GMT
On Fri, Sep 18, 2009 at 12:06 PM, David M. Cole <dmc@colegroup.com> wrote:

> At 11:30 AM -0400 9/18/09, Paul Tomblin wrote:
>
>> Is anybody here familiar with how Desielpoint (DP) works?
>>
>
> Dieselpoint is designed specifically for intranets and therefore doesn't
> take robots.txt into account because the Dieselpoint administrator and the
> web administrator (theoretically) work toward the same goals (see the thread
> from last Friday, "Ignoring Robots.txt" for an instance where that wasn't
> the case).
>
> Nutch is designed specifically for all-web crawling (like Google or Bing)
> and respects robots.txt because Nutch needs to be polite when indexing sites
> over which it has no control.
>
> Your client has a robots.txt file to control Google and/or Bing, so Nutch
> is respecting it the same way Google or Bing would.
>
>
I'm afraid I wasn't clear.  The site that the client is indexing with DP is
an external site, not hers.  Nutch is, I think, doing the right thing by not
crawling it, but I can't convince her of this because she's convinced that
DP is commercial and Nutch is "only" Open Source, so obviously DP is right.

The site in question does have several sitemaps.  Can Nutch do anything with
sitemaps?  (By the way, what does it mean when the robots.txt file lists
more than one sitemap?)

-- 
http://www.linkedin.com/in/paultomblin

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message