Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hadoop-user@lucene.apache.org
Received-SPF: pass (herse.apache.org: local policy)
Message-ID: <45623F64.2040109@dragonflymc.com>
Date: Mon, 20 Nov 2006 17:51:00 -0600
From: Dennis Kubes <nutch-dev@dragonflymc.com>
User-Agent: Thunderbird 1.5.0.8 (Windows/20061025)
MIME-Version: 1.0
To: hadoop-user@lucene.apache.org
Subject: Re: lucene index on hadoop
References: <8837fb770611181541i4dcac59i9e7624dd5143c4f9@mail.gmail.com>
 <4560A4F4.50102@dragonflymc.com> <4561FBD2.5050206@apache.org>
In-Reply-To: <4561FBD2.5050206@apache.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

I should have been more specific.  Create the indexes using mapreduce, 
then store on the dfs using the indexer job.  To have clusters of 
servers answer a single query we have found a best practice to be 
splitting the index and associated databases into smaller pieces and 
having those pieces on local file system that are fronted by distributed 
search servers.  Then have a search website that uses the search servers 
to answer the query.  An example of this setup can be found on the 
NutchHadoopTutorial on the Nutch wiki.

Dennis

Doug Cutting wrote:
> Dennis Kubes wrote:
>> You would build the indexes on hadoop but then move then to local 
>> file systems for searching.  You wouldn't want to perform searches 
>> using the DFS.
>
> Creating Lucene indexes directly in DFS would be pretty slow.  Nutch 
> creates them locally, then copies them to DFS to avoid this.
>
> One could create a Lucene Directory implementation optimized for 
> updates, where new files are written locally, and only flushed to DFS 
> when the Directory is closed.  When updating, Lucene creates and reads 
> lots of files that might not last very long, so there's little point 
> in replicating them on the network.  For many applications, that 
> should be considerably faster than either updating indexes directly in 
> HDFS, or copying the entire index locally, modifying it, then copying 
> it back.
>
> Lucene search works from HDFS-resident indexes, but is slow, 
> especially if the indexes were created on a different node than that 
> searching them.  (HDFS tries to write one replica of each block 
> locally on the node where it is created.)
>
> Doug