thanks for your answer. i use this script to crawl my sites:
$NUTCH_HOME/bin/nutch inject $NUTCH_HOME/bin/crawl1/crawldb
$NUTCH_HOME/bin/seedUrls
for((i=0; i < $depth; i++))
do
echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
$NUTCH_HOME/bin/nutch generate $NUTCH_HOME/bin/crawl1/crawldb
$NUTCH_HOME/bin/crawl1/segments $topN
if [ $? -ne 0 ]
then
echo "deepcrawler: Stopping at depth $depth. No more URLs to fetch."
break
fi
segment1=`ls -d $NUTCH_HOME/bin/crawl1/segments/* | tail -1`
$NUTCH_HOME/bin/nutch fetch $segment1
if [ $? -ne 0 ]
then
echo "deepcrawler: fetch $segment1 at depth `expr $i + 1` failed."
echo "deepcrawler: Deleting segment $segment1."
rm $RMARGS $segment1
continue
fi
$NUTCH_HOME/bin/nutch parse $segment1
$NUTCH_HOME/bin/nutch updatedb $NUTCH_HOME/bin/crawl1/crawldb $segment1
done
echo "----- Merge Segments (Step 5 of $steps) -----"
$NUTCH_HOME/bin/nutch mergesegs $NUTCH_HOME/bin/crawl1/MERGEDsegments
$NUTCH_HOME/bin/crawl1/segments/*
if [ "$safe" != "yes" ]
then
rm $RMARGS $NUTCH_HOME/bin/crawl1/segments
else
rm $RMARGS $NUTCH_HOME/bin/crawl1/BACKUPsegments
mv $MVARGS $NUTCH_HOME/bin/crawl1/segments
$NUTCH_HOME/bin/crawl1/BACKUPsegments
fi
mv $MVARGS $NUTCH_HOME/bin/crawl1/MERGEDsegments
$NUTCH_HOME/bin/crawl1/segments
echo "----- Invert Links (Step 6 of $steps) -----"
$NUTCH_HOME/bin/nutch invertlinks $NUTCH_HOME/bin/crawl1/linkdb
$NUTCH_HOME/bin/crawl1/segments/*
if [ "$safe" != "yes" ]
then
rm $RMARGS $NUTCH_HOME/bin/crawl1/NEWindexes
rm $RMARGS $NUTCH_HOME/bin/crawl1/index
else
rm $RMARGS $NUTCH_HOME/bin/crawl1/BACKUPindexes
rm $RMARGS $NUTCH_HOME/bin/crawl1/BACKUPindex
mv $MVARGS $NUTCH_HOME/bin/crawl1/NEWindexes
$NUTCH_HOME/bin/crawl1/BACKUPindexes
mv $MVARGS $NUTCH_HOME/bin/crawl1/index $NUTCH_HOME/bin/crawl1/BACKUPindex
fi
$NUTCH_HOME/bin/nutch solrindex http://$HOST:8983/solr/
$NUTCH_HOME/bin/crawl1/crawldb $NUTCH_HOME/bin/crawl1/linkdb
$NUTCH_HOME/bin/crawl1/segments/*
but nutch don't crawl all page in any site, for example when topN=1000,
nutch crawl 700 page from site1 and 250 from site2 and 40 from site3 and 10
page from site4. i want nutch crawl 1000 page from any site.help me.
--
View this message in context: http://lucene.472066.n3.nabble.com/how-give-several-sites-to-nutch-to-crawl-tp3556697p3558152.html
Sent from the Nutch - User mailing list archive at Nabble.com.
|