hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dennis Kubes <nutch-...@dragonflymc.com>
Subject Re: lucene index on hadoop
Date Mon, 20 Nov 2006 23:51:00 GMT
I should have been more specific.  Create the indexes using mapreduce, 
then store on the dfs using the indexer job.  To have clusters of 
servers answer a single query we have found a best practice to be 
splitting the index and associated databases into smaller pieces and 
having those pieces on local file system that are fronted by distributed 
search servers.  Then have a search website that uses the search servers 
to answer the query.  An example of this setup can be found on the 
NutchHadoopTutorial on the Nutch wiki.

Dennis

Doug Cutting wrote:
> Dennis Kubes wrote:
>> You would build the indexes on hadoop but then move then to local 
>> file systems for searching.  You wouldn't want to perform searches 
>> using the DFS.
>
> Creating Lucene indexes directly in DFS would be pretty slow.  Nutch 
> creates them locally, then copies them to DFS to avoid this.
>
> One could create a Lucene Directory implementation optimized for 
> updates, where new files are written locally, and only flushed to DFS 
> when the Directory is closed.  When updating, Lucene creates and reads 
> lots of files that might not last very long, so there's little point 
> in replicating them on the network.  For many applications, that 
> should be considerably faster than either updating indexes directly in 
> HDFS, or copying the entire index locally, modifying it, then copying 
> it back.
>
> Lucene search works from HDFS-resident indexes, but is slow, 
> especially if the indexes were created on a different node than that 
> searching them.  (HDFS tries to write one replica of each block 
> locally on the node where it is created.)
>
> Doug

Mime
View raw message