Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 11335 invoked from network); 22 Sep 2006 13:23:57 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 22 Sep 2006 13:23:57 -0000 Received: (qmail 28676 invoked by uid 500); 22 Sep 2006 13:23:56 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 28068 invoked by uid 500); 22 Sep 2006 13:23:55 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 28057 invoked by uid 99); 22 Sep 2006 13:23:55 -0000 Received: from idunn.apache.osuosl.org (HELO idunn.apache.osuosl.org) (140.211.166.84) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 22 Sep 2006 06:23:55 -0700 Authentication-Results: idunn.apache.osuosl.org smtp.mail=lucene@mikemccandless.com; spf=permerror X-ASF-Spam-Status: No, hits=0.0 required=5.0 tests= Received-SPF: error (idunn.apache.osuosl.org: domain mikemccandless.com from 66.111.4.25 cause and error) Received: from [66.111.4.25] ([66.111.4.25:41974] helo=out1.smtp.messagingengine.com) by idunn.apache.osuosl.org (ecelerity 2.1.1.8 r(12930)) with ESMTP id 56/CA-06791-8E3E3154 for ; Fri, 22 Sep 2006 06:23:52 -0700 Received: from frontend3.internal (frontend3.internal [10.202.2.152]) by frontend1.messagingengine.com (Postfix) with ESMTP id 4516CDA919A for ; Fri, 22 Sep 2006 09:23:47 -0400 (EDT) Received: from heartbeat2.internal ([10.202.2.161]) by frontend3.internal (MEProxy); Fri, 22 Sep 2006 09:23:48 -0400 X-Sasl-enc: fuy0Yyxlr5TJhRQFr6Q1Q8+ESVot4IvxAk9eUCeJYDSP 1158931428 Received: from [10.17.4.90] (pool-72-70-33-156.bstnma.fios.verizon.net [72.70.33.156]) by mail.messagingengine.com (Postfix) with ESMTP id 9C44E12A4 for ; Fri, 22 Sep 2006 09:23:48 -0400 (EDT) Message-ID: <4513E3D4.7050906@mikemccandless.com> Date: Fri, 22 Sep 2006 09:23:32 -0400 From: Michael McCandless User-Agent: Thunderbird 1.5.0.7 (Windows/20060909) MIME-Version: 1.0 To: java-dev@lucene.apache.org Subject: Re: Distributed Indexes, Searches and HDFS References: <34cc3b0a0609210751w66bc8835y99c1747976186ef7@mail.gmail.com> <34cc3b0a0609211544v465b04fx88d46dc61351bec5@mail.gmail.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N I think this is a great question ("what's the best way to really scale up Lucene?"). I don't have alot of experience in that area so I'll defer to others (and I'm eager to learn myself!). I think understanding Solr's overall approach (whose design I believe came out of the thread you've referenced) is also a good step here. Even if you can't re-use the hard links trick, you might be able to reuse its snapshotting & index distribution protocol. However, I have been working on some "bottoms up" improvements to Lucene (getting native OS locking working and [separate but related] "lock-less commits") that I think could be related to some of the issues you're seeing with HDFS -- see below: > > The cronjob/link solution which is quite clean, doesn't work well in > > a windows environment. While it's my favorite, no dice... Rats. > > There may be hope yet for that on Windows. Hard links work on > Windows, but the only problem is that you can't rename/delete any > links when the file is open. Michael McCandless is working on a > patch that would eliminate all renames (and deletes can be handled > by deferring them). Right, with "lock-less commits" patch we never rename a file and also never re-use a file name (ie, making Lucene's use of the filesystem "write once"). > 1) Indexing and Searching Directly from HDFS > > Indexing to HDFS is possible with a patch if we don't use CFS. While > not ideal performance-wise, it's reliable and takes care of data > redundancy, component failure and means that I can have cheap small > drives instead of a large expensive NAS. It's also quite simple to > implement (see Nutch's indexer.FsDirectory for the Directory > implmentation) This is very interesting! I don't know enough about HDFS (yet!). On very quick read, I like that it's a "write once" filesystem because it's a good match to lock-less commits. > So I would have several indexes (ie 16) and the same number of > indexers, and a searcher for each index (possibly in the same > process) that searches each one directly from HDFS. One problem I'm > having is an occasional filenotfound exception. (Probably locking > related) > > It comes out of the Searcher when I try to do a search while things > are being indexed. I'd be interested to know what exactly is > happening when this exception is thrown, maybe I can design around > it. (Do synchronization at the appropriate times or similar) That exception looks disturbingly similar to the ones Lucene hits on NFS. See here for gory details: http://issues.apache.org/jira/browse/LUCENE-673 The summary of that [long] issue is that these exceptions seem to be due to cache staleness of Lucene's "segments" file (due to how the NFS client does caching, even on NFS V4 client/server) and not in fact due to locking (as had been previously assumed/expected). The good news is the lock-less commits fixes resolve this at least in my testing so far (ie, make it possible to share a single index over NFS). I wonder if in HDFS a similar cause is at work? HDFS is "write once" but the current Lucene isn't (not until we can get lock-less commits in). For example, it re-uses the "segments" file. I think even if lock-less commits ("write once") enables sharing of a single copy of index over remote filesystems like HDFS or NFS or SMB/CIFS, whether or not that's performant enough (vs replicating copies to local filesystems that are presumably quite a bit faster at IO, at the expense of local storage consumed) would still be a big open question. Mike --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org