hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Creating Lucene index in Hadoop
Date Mon, 16 Mar 2009 21:36:48 GMT
Ning Li wrote:
> With
> http://issues.apache.org/jira/browse/HADOOP-4801, however, it may
> become feasible to search on HDFS directly.

I don't think HADOOP-4801 is required.  It would help, certainly, but 
it's so fraught with security and other issues that I doubt it will be 
committed anytime soon.

What would probably help HDFS random access performance for Lucene 
significantly would be:
  1. A cache of connections to datanodes, so that each seek() does not 
require an open().  If we move HDFS data transfer to be RPC-based (see, 
e.g., http://issues.apache.org/jira/browse/HADOOP-4386), then this will 
come for free, since RPC already caches connections.  We hope to do this 
for Hadoop 1.0, so that we use a single transport for all Hadoop's core 
operations, to simplify security.
  2. A local cache of read-only HDFS data, equivalent to kernel's buffer 
cache.  This might be implemented as a Lucene Directory that keeps an 
LRU cache of buffers from a wrapped filesystem, perhaps a subclass of 

With these, performance would still be slower than a local drive, but 
perhaps not so dramatically.


View raw message