hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: lucene index on hadoop
Date Mon, 20 Nov 2006 19:02:42 GMT
Dennis Kubes wrote:
> You would build the indexes on hadoop but then move then to local file 
> systems for searching.  You wouldn't want to perform searches using the 
> DFS.

Creating Lucene indexes directly in DFS would be pretty slow.  Nutch 
creates them locally, then copies them to DFS to avoid this.

One could create a Lucene Directory implementation optimized for 
updates, where new files are written locally, and only flushed to DFS 
when the Directory is closed.  When updating, Lucene creates and reads 
lots of files that might not last very long, so there's little point in 
replicating them on the network.  For many applications, that should be 
considerably faster than either updating indexes directly in HDFS, or 
copying the entire index locally, modifying it, then copying it back.

Lucene search works from HDFS-resident indexes, but is slow, especially 
if the indexes were created on a different node than that searching 
them.  (HDFS tries to write one replica of each block locally on the 
node where it is created.)


View raw message