hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dennis Kubes <nutch-...@dragonflymc.com>
Subject Re: distributed search
Date Mon, 04 Dec 2006 21:52:00 GMT
The distributed searching section assumes that you have split the index 
into multiple pieces and there is a piece on each machine.  The tutorial 
doesn't tell you how to split the indexes because there is not tool to 
do that yet.  I was trying to layout a general architecture for how to 
do distributed searching instead of giving a step by step method.  What 
I would do for now is to create multiple indexes of say 2-4 million 
pages and put each index on a separate machine.  You would also need to 
copy all of the supporting database file such as the crawl db and link 
db to each machine.

Having a new index on each machine and having to create separate indexes 
is not the most elegant way to accomplish this architecture.  The best 
way that we have found is to have an splitter job that indexes and 
splits the index and supporting databases into multiple parts on the 
fly.  Then these parts are moved out to the search servers.  We have 
some base code for this but it is not in the nutch codebase as of yet.  
If you want to move down this path send me an email.


Giuseppe Cannella wrote:
> in http://wiki.apache.org/nutch/NutchHadoopTutorial page
> at 'Distributed Searching' section 
> i read:
> "On each of the search servers you would use the startup the distributed search server
by using the nutch server command like this:
> bin/nutch server 1234 /d01/local/crawled"
> but /d01/local/crawled has been created only for the first server, how could i create
it for all server? 
> if i use "bin/hadoop dfs -copyToLocal crawled /d01/local/" on every server, the search
finds N identical results (where N is how many servers are into the cluster)
> ------------------------------------------------------
> Passa a Infostrada. ADSL e Telefono senza limiti e senza canone Telecom
> http://click.libero.it/infostrada25nov06

View raw message