Return-Path: Delivered-To: apmail-nutch-user-archive@www.apache.org Received: (qmail 88512 invoked from network); 18 Jan 2011 16:14:06 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 18 Jan 2011 16:14:06 -0000 Received: (qmail 38580 invoked by uid 500); 18 Jan 2011 16:14:05 -0000 Delivered-To: apmail-nutch-user-archive@nutch.apache.org Received: (qmail 38335 invoked by uid 500); 18 Jan 2011 16:14:02 -0000 Mailing-List: contact user-help@nutch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@nutch.apache.org Delivered-To: mailing list user@nutch.apache.org Delivered-To: moderator for user@nutch.apache.org Received: (qmail 59399 invoked by uid 99); 18 Jan 2011 11:29:29 -0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) X-Virus-Scanned: amavisd-new at mailgw1.unister-gmbh.de X-Spam-Score: -2.499 X-Spam-Level: Message-ID: <4D35797B.10303@unister-gmbh.de> Date: Tue, 18 Jan 2011 12:28:59 +0100 From: Andrey Sapegin User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.12) Gecko/20100913 Iceowl/1.0b1 Icedove/3.0.7 MIME-Version: 1.0 To: user@nutch.apache.org Subject: search not working with merged indexes (Total hits: 0) Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Old-Spam-Flag: NO X-Old-Spam-Status: No, score=-2.499 tagged_above=-999 required=5 tests=[BAYES_00=-2.599, RDNS_NONE=0.1] Dear all. I have a problem with nutch Internet crawl/recrawl script (I'm wanted to understand how it works, so I wrote it by myself). After I merge indexes (merging segments seems to be fine), I search doesn't work for me: $ bin/nutch org.apache.nutch.searcher.NutchBean http Total hits: 0 Before recrawling I was able to search (index was placed at crawl/indexes) My script: --------------------------------------------- #!/bin/bash export JAVA_HOME=/usr/lib/jvm/java-6-sun #Inject new urls bin/nutch inject crawl/crawldb dmoz/urls echo "new URLs injected (dmoz/urls)" #generate segments bin/nutch generate crawl/crawldb crawl/segments -topN $3 echo "segments generated" #generate fetch-list s1=`ls -d crawl/segments/2* | tail -1` echo $s1 echo "fetch-list generated" #fetch bin/nutch fetch $s1 -threads $2 echo "fetching done" #update the database with results of fetch bin/nutch updatedb crawl/crawldb $s1 echo "database updated" #merge segments bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/* rm -r crawl/segments mv crawl/MERGEDsegments crawl/segments echo "segments merged" #inverting links bin/nutch invertlinks crawl/linkdb -dir crawl/segments echo "links inverted" #indexing bin/nutch index crawl/NEWindexes crawl/crawldb crawl/linkdb crawl/segments/* echo "indexing done" #dedup - delete duplicate documents in the index bin/nutch dedup crawl/NEWindexes echo "dedup done" #merging indexes bin/nutch merge crawl/MERGEDindexes crawl/NEWindexes echo "indexes merged" # replace indexes with indexes_merged mv --verbose crawl/indexes crawl/OLDindexes mv --verbose crawl/MERGEDindexes crawl/indexes/part-00000 #clean up rm -rf crawl/NEWindexes rm -rf crawl/OLDindexes ------------------------------------------------- What's wrong with the script? Thank You in advance, Kind Regards, -- Andrey Sapegin, Software Developer, Unister GmbH andrey.sapegin@unister-gmbh.de www.unister.de