Return-Path: Delivered-To: apmail-lucene-hadoop-user-archive@locus.apache.org Received: (qmail 33946 invoked from network); 20 Nov 2006 23:51:26 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 20 Nov 2006 23:51:26 -0000 Received: (qmail 82132 invoked by uid 500); 20 Nov 2006 23:51:35 -0000 Delivered-To: apmail-lucene-hadoop-user-archive@lucene.apache.org Received: (qmail 82107 invoked by uid 500); 20 Nov 2006 23:51:35 -0000 Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-user@lucene.apache.org Delivered-To: mailing list hadoop-user@lucene.apache.org Received: (qmail 82098 invoked by uid 99); 20 Nov 2006 23:51:35 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 Nov 2006 15:51:35 -0800 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: local policy) Received: from [74.0.0.77] (HELO linuxfly.dragonflymc.com) (74.0.0.77) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 Nov 2006 15:51:21 -0800 Received: from [192.168.1.246] (unknown [63.133.162.98]) by linuxfly.dragonflymc.com (Postfix) with ESMTP id 2034DC2006B for ; Mon, 20 Nov 2006 17:52:23 -0600 (CST) Message-ID: <45623F64.2040109@dragonflymc.com> Date: Mon, 20 Nov 2006 17:51:00 -0600 From: Dennis Kubes User-Agent: Thunderbird 1.5.0.8 (Windows/20061025) MIME-Version: 1.0 To: hadoop-user@lucene.apache.org Subject: Re: lucene index on hadoop References: <8837fb770611181541i4dcac59i9e7624dd5143c4f9@mail.gmail.com> <4560A4F4.50102@dragonflymc.com> <4561FBD2.5050206@apache.org> In-Reply-To: <4561FBD2.5050206@apache.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org I should have been more specific. Create the indexes using mapreduce, then store on the dfs using the indexer job. To have clusters of servers answer a single query we have found a best practice to be splitting the index and associated databases into smaller pieces and having those pieces on local file system that are fronted by distributed search servers. Then have a search website that uses the search servers to answer the query. An example of this setup can be found on the NutchHadoopTutorial on the Nutch wiki. Dennis Doug Cutting wrote: > Dennis Kubes wrote: >> You would build the indexes on hadoop but then move then to local >> file systems for searching. You wouldn't want to perform searches >> using the DFS. > > Creating Lucene indexes directly in DFS would be pretty slow. Nutch > creates them locally, then copies them to DFS to avoid this. > > One could create a Lucene Directory implementation optimized for > updates, where new files are written locally, and only flushed to DFS > when the Directory is closed. When updating, Lucene creates and reads > lots of files that might not last very long, so there's little point > in replicating them on the network. For many applications, that > should be considerably faster than either updating indexes directly in > HDFS, or copying the entire index locally, modifying it, then copying > it back. > > Lucene search works from HDFS-resident indexes, but is slow, > especially if the indexes were created on a different node than that > searching them. (HDFS tries to write one replica of each block > locally on the node where it is created.) > > Doug