nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mina <tahereganji...@gmail.com>
Subject Re: how give several sites to nutch to crawl?
Date Sat, 03 Dec 2011 22:28:43 GMT
thanks for your answer. i use this script to crawl my sites:

$NUTCH_HOME/bin/nutch inject $NUTCH_HOME/bin/crawl1/crawldb
$NUTCH_HOME/bin/seedUrls
for((i=0; i < $depth; i++))
do
  echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
$NUTCH_HOME/bin/nutch generate $NUTCH_HOME/bin/crawl1/crawldb
$NUTCH_HOME/bin/crawl1/segments $topN

 if [ $? -ne 0 ]
  then
    echo "deepcrawler: Stopping at depth $depth. No more URLs to fetch."
    break
  fi
  segment1=`ls -d $NUTCH_HOME/bin/crawl1/segments/* | tail -1`

$NUTCH_HOME/bin/nutch fetch $segment1
  if [ $? -ne 0 ]
 then
    echo "deepcrawler: fetch $segment1 at depth `expr $i + 1` failed."
    echo "deepcrawler: Deleting segment $segment1."
    rm $RMARGS $segment1
    continue
  fi
$NUTCH_HOME/bin/nutch parse $segment1
$NUTCH_HOME/bin/nutch updatedb $NUTCH_HOME/bin/crawl1/crawldb $segment1
done
echo "----- Merge Segments (Step 5 of $steps) -----"
$NUTCH_HOME/bin/nutch mergesegs $NUTCH_HOME/bin/crawl1/MERGEDsegments
$NUTCH_HOME/bin/crawl1/segments/*

if [ "$safe" != "yes" ]
then
  rm $RMARGS $NUTCH_HOME/bin/crawl1/segments

else
  rm $RMARGS $NUTCH_HOME/bin/crawl1/BACKUPsegments
  mv $MVARGS $NUTCH_HOME/bin/crawl1/segments
$NUTCH_HOME/bin/crawl1/BACKUPsegments

fi

mv $MVARGS $NUTCH_HOME/bin/crawl1/MERGEDsegments
$NUTCH_HOME/bin/crawl1/segments

echo "----- Invert Links (Step 6 of $steps) -----"
$NUTCH_HOME/bin/nutch invertlinks $NUTCH_HOME/bin/crawl1/linkdb
$NUTCH_HOME/bin/crawl1/segments/*

if [ "$safe" != "yes" ]
then
  rm $RMARGS $NUTCH_HOME/bin/crawl1/NEWindexes
  rm $RMARGS $NUTCH_HOME/bin/crawl1/index


else

  rm $RMARGS $NUTCH_HOME/bin/crawl1/BACKUPindexes
  rm $RMARGS $NUTCH_HOME/bin/crawl1/BACKUPindex
  mv $MVARGS $NUTCH_HOME/bin/crawl1/NEWindexes
$NUTCH_HOME/bin/crawl1/BACKUPindexes
  mv $MVARGS $NUTCH_HOME/bin/crawl1/index $NUTCH_HOME/bin/crawl1/BACKUPindex


fi

$NUTCH_HOME/bin/nutch solrindex http://$HOST:8983/solr/
$NUTCH_HOME/bin/crawl1/crawldb $NUTCH_HOME/bin/crawl1/linkdb
$NUTCH_HOME/bin/crawl1/segments/*



but nutch don't crawl all page in any site, for example when topN=1000,
nutch crawl 700 page from site1 and 250 from site2 and 40 from site3 and 10
page from site4. i want nutch crawl 1000 page from any site.help me.

--
View this message in context: http://lucene.472066.n3.nabble.com/how-give-several-sites-to-nutch-to-crawl-tp3556697p3558152.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Mime
View raw message