nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrey Sapegin <>
Subject search not working with merged indexes (Total hits: 0)
Date Tue, 18 Jan 2011 11:28:59 GMT
Dear all.

I have a problem with nutch Internet crawl/recrawl script (I'm wanted to 
understand how it works, so I wrote it by myself).

After I merge indexes (merging segments seems to be fine), I search 
doesn't work for me:
    $ bin/nutch org.apache.nutch.searcher.NutchBean http
    Total hits: 0

Before recrawling I was able to search (index was placed at crawl/indexes)

My script:
export JAVA_HOME=/usr/lib/jvm/java-6-sun

#Inject new urls
bin/nutch inject crawl/crawldb dmoz/urls
echo "new URLs injected (dmoz/urls)"

#generate segments
bin/nutch generate crawl/crawldb crawl/segments -topN $3
echo "segments generated"

#generate fetch-list
s1=`ls -d crawl/segments/2* | tail -1`
echo $s1
echo "fetch-list generated"

bin/nutch fetch $s1 -threads $2
echo "fetching done"

#update the database with results of fetch
bin/nutch updatedb crawl/crawldb $s1
echo "database updated"

#merge segments
bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/*
rm -r crawl/segments
mv crawl/MERGEDsegments crawl/segments
echo "segments merged"

#inverting links
bin/nutch invertlinks crawl/linkdb -dir crawl/segments
echo "links inverted"

bin/nutch index crawl/NEWindexes crawl/crawldb crawl/linkdb crawl/segments/*
echo "indexing done"

#dedup - delete duplicate documents in the index
bin/nutch dedup crawl/NEWindexes
echo "dedup done"

#merging indexes
bin/nutch merge crawl/MERGEDindexes crawl/NEWindexes
echo "indexes merged"

# replace indexes with indexes_merged
mv --verbose crawl/indexes crawl/OLDindexes
mv --verbose crawl/MERGEDindexes crawl/indexes/part-00000

#clean up
rm -rf crawl/NEWindexes
rm -rf crawl/OLDindexes

What's wrong with the script?

Thank You in advance,
Kind Regards,


Andrey Sapegin,
Software Developer,

Unister GmbH

View raw message