Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (asf.osuosl.org: domain of prasenjitm@aol.com designates
 152.163.225.130 as permitted sender)
Message-ID: <440BD048.4060806@aol.com>
Date: Mon, 06 Mar 2006 11:31:44 +0530
From: Prasenjit Mukherjee <prasenjitm@aol.com>
User-Agent: Mozilla Thunderbird 1.0.7-1.1.fc4 (X11/20050929)
MIME-Version: 1.0
To: java-user@lucene.apache.org
Subject: Distributed Lucene.. 
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

I already have an implementation of a distributed crawler farm, where 
crawler instances are runnign on different boxes. I want to come up with 
a distributed indexing scheme using lucene and take advantage of the 
distributed nature of my crawlers' distributed nature. Here is what I am 
thinking.

Crawlers will analyze and tokenize the content for every URLs(aka 
Documents) and create the following data for every url document:
<url-id,  <field1, <term-f1-t1,term-f1-t2,term-f1-t3 etc.>>   <field-2, 
<term-f2-t1,term-f2-t2,term-f2-t3, >>  ...... >

And then based on some partitioning function the carwlers can send a 
subset of tokens(aka terms)  to the indexing server. The partitioning 
function can be as simple as based on the starting character of the 
terms.  Lets say if we have 5 indexers, we will distribute the indexing 
data in the following manner :

Indexer1 - a-e
Indexer2 - f-j
Indexer3 - k-o
Indexer4 - p-t
Indexer5 - u-z

Does it make any sense ? Also would like to know if there are other ways 
to distribute lucene's indexing/searching  ?

thanks,
prasen

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org