nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Tomblin <ptomb...@xcski.com>
Subject Difference between Deiselpoint and Nutch?
Date Fri, 18 Sep 2009 15:30:46 GMT
Is anybody here familiar with how Desielpoint (DP) works?  I'm working on a
contract to replace DP with Nutch because the person paying me decided that
she didn't want to pay the licensing costs for DP.  But one huge bone of
contention has come up - on one of the sites that she tells DP to index, she
only wants the one page (it's evidently a search page that she passes some
parameters to).  DP is happy to do it, but Nutch looks at the robots.txt
file, says "hey, I'm not supposed to crawl this directory", and won't
download the page.  So she's mad at me because it's somehow my fault that DP
works differently than Nutch.  She keeps saying "DP is a proper commercial
product, they wouldn't be doing something they're not supposed to do" (to
which I think but don't say "tell that to all the companies that have been
screwed by Microsoft").  So is DP doing the right thing by fetching the
requested page or not?
I'm tempted to just write a script that does a wget to fetch that one page
to a local directory, and then tell Nutch to crawl that directory.

-- 
http://www.linkedin.com/in/paultomblin

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message