hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ning Li <ning.li...@gmail.com>
Subject Re: Creating Lucene index in Hadoop
Date Mon, 16 Mar 2009 23:21:35 GMT
1 is good. But for 2:
  - Won't it have a security concern as well? Or is this not a general
local cache?
  - You are referring to caching in RAM, not caching in local FS,
right? In general, a Lucene index size could be quite large. We may
have to cache a lot of data to reach a reasonable hit ratio...


On Mon, Mar 16, 2009 at 5:36 PM, Doug Cutting <cutting@apache.org> wrote:
> Ning Li wrote:
>> With
>> http://issues.apache.org/jira/browse/HADOOP-4801, however, it may
>> become feasible to search on HDFS directly.
> I don't think HADOOP-4801 is required.  It would help, certainly, but it's
> so fraught with security and other issues that I doubt it will be committed
> anytime soon.
> What would probably help HDFS random access performance for Lucene
> significantly would be:
>  1. A cache of connections to datanodes, so that each seek() does not
> require an open().  If we move HDFS data transfer to be RPC-based (see,
> e.g., http://issues.apache.org/jira/browse/HADOOP-4386), then this will come
> for free, since RPC already caches connections.  We hope to do this for
> Hadoop 1.0, so that we use a single transport for all Hadoop's core
> operations, to simplify security.
>  2. A local cache of read-only HDFS data, equivalent to kernel's buffer
> cache.  This might be implemented as a Lucene Directory that keeps an LRU
> cache of buffers from a wrapped filesystem, perhaps a subclass of
> RAMDirectory.
> With these, performance would still be slower than a local drive, but
> perhaps not so dramatically.
> Doug

View raw message