nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Magnús Skúlason <magg...@gmail.com>
Subject nutch reindexes all documents after each crawl
Date Fri, 30 Dec 2011 13:16:00 GMT
Hi,

I am using nutch to crawl a set of web sites and index them to solr,
using the default crawl command:
bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5

I decided to use the default command since the sites I crawl are
relatively few (< 1000).

I have noticed that after each crawl, nutch reindexes every command to
solr, not only the ones fetched / parsed during the last crawl, is
this normal behaviour? If so, is there any way to turn this off, i.e.
can I add a parameter to the command to tell nutch to only reindex new
content?

If not what would be the easiest way to modify this behaviour?

One solution that comes to mind would be:
bin/nutch crawl urls -depth 3 -topN 5
find crawl/segments/ -maxdepth 1 -mmin -300 -type d -name '20*' -exec
runtime/local/bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl
crawl/linkdb {} \;

i.e. skip indexing in the crawl command and call the solr indexing
only on segments changed in the last X minutes (here 300, the
estimated time of my crawl), would this produce the desired results?
If I do this will I have to invert links before calling the solrindex
command or does the crawl command take care of that?

An additional question, how can I get a list of fetched urls from a segment?

best regards,
Magnus

Mime
View raw message