Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 42159 invoked from network); 16 Mar 2009 23:22:11 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 16 Mar 2009 23:22:11 -0000 Received: (qmail 14912 invoked by uid 500); 16 Mar 2009 23:22:04 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 14868 invoked by uid 500); 16 Mar 2009 23:22:04 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 14857 invoked by uid 99); 16 Mar 2009 23:22:04 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 16 Mar 2009 16:22:04 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ning.li.00@gmail.com designates 74.125.46.31 as permitted sender) Received: from [74.125.46.31] (HELO yw-out-2324.google.com) (74.125.46.31) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 16 Mar 2009 23:21:56 +0000 Received: by yw-out-2324.google.com with SMTP id 5so852196ywb.29 for ; Mon, 16 Mar 2009 16:21:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=1oFTd2wiE1aEQM3pN8TvrxlDNRLfSl0T27jx0g6EeXs=; b=rqLwkMCVquwqoTPpLagUHXoPeigzoBkzLzN099n9jAOOLFU725a4WTRSekZzasfSo3 /dbXZ4HPOCPS15eVcu0Ua/+ylndMDgfQMN4+xloktL2rgyNyJmnhGQTQLJO7W2ZQwWwf 9UoJxU0/iQlLu7c19Pqo1vDbtUB8h0ftX3KMQ= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=AzZfCrx9CIjbCabnLqzGWyZu+niXegAkKUMEuDaER9JlvMFc5uWIo8qo4gkeW/q5Yz eUlNoLcCaNPeHkzk+i6WUrQlf1IIylG1LsDQ3A+C8oMZ2EAgBME6x2AdqXSkP3eiDTzf QKfzBVUIvgZXIReqwyX14fWsDPE0F/lml3ez4= MIME-Version: 1.0 Received: by 10.220.76.68 with SMTP id b4mr1767786vck.14.1237245695745; Mon, 16 Mar 2009 16:21:35 -0700 (PDT) In-Reply-To: <49BEC670.5080700@apache.org> References: <8131791a0903122234g18eac67bqeb50a1dfcae361f2@mail.gmail.com> <9cfk56pfg81.fsf@rogue.ncsl.nist.gov> <9cftz5tdwbz.fsf@rogue.ncsl.nist.gov> <49BEC670.5080700@apache.org> Date: Mon, 16 Mar 2009 19:21:35 -0400 Message-ID: Subject: Re: Creating Lucene index in Hadoop From: Ning Li To: core-user@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org 1 is good. But for 2: - Won't it have a security concern as well? Or is this not a general local cache? - You are referring to caching in RAM, not caching in local FS, right? In general, a Lucene index size could be quite large. We may have to cache a lot of data to reach a reasonable hit ratio... Cheers, Ning On Mon, Mar 16, 2009 at 5:36 PM, Doug Cutting wrote: > Ning Li wrote: >> >> With >> http://issues.apache.org/jira/browse/HADOOP-4801, however, it may >> become feasible to search on HDFS directly. > > I don't think HADOOP-4801 is required. =A0It would help, certainly, but i= t's > so fraught with security and other issues that I doubt it will be committ= ed > anytime soon. > > What would probably help HDFS random access performance for Lucene > significantly would be: > =A01. A cache of connections to datanodes, so that each seek() does not > require an open(). =A0If we move HDFS data transfer to be RPC-based (see, > e.g., http://issues.apache.org/jira/browse/HADOOP-4386), then this will c= ome > for free, since RPC already caches connections. =A0We hope to do this for > Hadoop 1.0, so that we use a single transport for all Hadoop's core > operations, to simplify security. > =A02. A local cache of read-only HDFS data, equivalent to kernel's buffer > cache. =A0This might be implemented as a Lucene Directory that keeps an L= RU > cache of buffers from a wrapped filesystem, perhaps a subclass of > RAMDirectory. > > With these, performance would still be slower than a local drive, but > perhaps not so dramatically. > > Doug >