lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nader Henein <>
Subject Re: Index Replication / Clustering
Date Sun, 26 Jun 2005 11:53:57 GMT
As far as indexing is concerned, a simple way of tracking a clustered 
system, is to create autonomous indecies that report to a central 
repository, creating a table in the DB with a row per document ( you 
have a unique document ID,  right? ) and then a column per server node 
(the columns act as an indexing flag P:pending I for indexed and F for 
failed) a row is added whenever a new document is added to the 
persistent store or tagged for indexing (trigger style) each node reads 
the persistent store using it's own scheduler and then proceeds to 
collect the XML file for indexing, modifying the server flag afterwards. 
Alternatively, and this one is much easier but you will have issues with 
atomicity, you can just rsync the xml files to a directory on each 
clustered server and then the servers can pick up the files and index 
them. quite simply one is a pull architecture and the other is a push 

Does that help?

Nader Henein

Stephane Bailliez wrote:

> Nader Henein wrote:
>> Our setup is quite similar to yours, but in all honesty, you will 
>> need to do some for of batching on your updates simply because, you 
>> don't want to keep the Index Writter open all the time.
> For now, the index writer is closed after each added document. It does 
> not seem to have such a major overhead compared to keep it open, at 
> most overhead is 2x in my tests, which is acceptable for now and in 
> par with other commercial search engines they have been using. My 
> constraint is basically that the mergeFactor must be 1, but I think 
> honestly that it will need to be relaxed when the document rate will 
> increase.
> There were no tuning yet.
> I have also a quite specific document lifecycle. Incoming documents 
> are 5-10KB xml where I'm only extracting 0.5-1KB data to be indexed. 
> These documents NEVER change. They are not updated, nor deleted.
> They are only deleted for archiving purposes because we keep only the 
> last 6-months of data.
>> As for clustering, we went through three iterations, that keep x 
>> indexes parallelized on x servers all of this with fail over and 
>> index independent synchronization with your persistent store. There 
>> was a little discussion about this a few weeks back, and I mentioned 
>> that your biggest pain will be maintaining the integrity of parallel 
>> indexes that are updated/deleted autonomously (atomic updates and 
>> deletes) but there are ways of running iterative checks to make sure 
>> that your indecies stay clean.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:
> ---


Nader S. Henein
Senior Applications Architect


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message