lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Prasenjit Mukherjee <>
Subject Distributed Lucene..
Date Mon, 06 Mar 2006 06:01:44 GMT
I already have an implementation of a distributed crawler farm, where 
crawler instances are runnign on different boxes. I want to come up with 
a distributed indexing scheme using lucene and take advantage of the 
distributed nature of my crawlers' distributed nature. Here is what I am 

Crawlers will analyze and tokenize the content for every URLs(aka 
Documents) and create the following data for every url document:
<url-id,  <field1, <term-f1-t1,term-f1-t2,term-f1-t3 etc.>>   <field-2,

<term-f2-t1,term-f2-t2,term-f2-t3, >>  ...... >

And then based on some partitioning function the carwlers can send a 
subset of tokens(aka terms)  to the indexing server. The partitioning 
function can be as simple as based on the starting character of the 
terms.  Lets say if we have 5 indexers, we will distribute the indexing 
data in the following manner :

Indexer1 - a-e
Indexer2 - f-j
Indexer3 - k-o
Indexer4 - p-t
Indexer5 - u-z

Does it make any sense ? Also would like to know if there are other ways 
to distribute lucene's indexing/searching  ?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message