lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <>
Subject Re: Real time indexing and distribution to lucene on separate boxes (long)
Date Thu, 11 Mar 2004 13:39:03 GMT
I like option 3.  I've done it before, and it worked well.  I dealt
with very small indices, though, and if your indices are several tens
or hundred gigs, this may be hard for you.

Option 4: search can be performed on an index that is being modified
(update, delete, insert, optimize).  You'd just have to make sure not
to recreate new IndexSearcher too frequently, if your index is being
modified often.  Just change it every X index modification or every X
minutes, and you'll be fine.


--- "Kevin A. Burton" <> wrote:
> I'm curious to find out what others are doing in this situation.
> I have two boxes... the indexer and the searcher.  The indexer is
> taking
> documents and indexing them and creating indexes in a RAMDirectory
> (for
> efficiency) and is then writing these indexes to disk as we begin to
> run 
> out of
> memory.  Usually these aren't very big... 15->100M or so.
> Obviously I'm dividing the indexing and searching onto dedicated
> boxes to
> improve efficiency.  The real issue though is that the searchers need
> to 
> be live
> all the time as indexes are being added at runtime.
> So if that wasn't clear.  I actually have to push out fresh indexes 
> WHILE users
> are searching them.  Not a very easy thing to do.
> Here's my question.  What are the optimum ways to then distribute
> these 
> index
> segments to the secondary searcher boxes.  I don't want to use the 
> MultiSearcher
> because it's slow once we have too many indexes (see my PS)
> Here's what I'm currently thinking:
> 1.  Have the indexes sync'd to the searcher as shards directly.  This
> doesn't
> scale as I would have to use the MultiSearcher which is slow when it
> has too
> many indexes.  (And ideally we would want an optimized index).
> 2. Merge everything into one index on the indexer.  Lock the
> searcher, 
> then copy
> over the new index via rsync.  The problem here is that the searcher 
> would need
> to lock up while the sync is happening to prevent reads on the index.
> If I do
> this enough and the system is optimzed I think I would only have to 
> block for 5
> seconds or so but that's STILL very long.
> 3. Have two directories on the searcher.  The indexer would then sync
> to 
> a tmp
> directory and then at run time swap them via a rename once the sync
> is over.
> The downside here is that this will take up 2x disk space on the 
> searcher.  The
> upside is that the box will only slow down while the rsync is
> happening.
> 4. Do a LIVE index merge on the production box.  This might be an 
> interesting
> approach.  The major question I have is whether you can do an 
> optimize/merge on
> an index that's currently being used.  I *think* it might be possible
> but I'm
> not sure.  This isn't as fast as performing the merge on the indexer 
> before hand
> but it does have the benefits of both worlds.
> If anyone has any other ideas I would be all ears...
> PS.. Random question.  The performance of the MultiSearcher is
> Mlog(N) 
> correct?
> Where N is the number of documents in the index and M is the number
> of 
> indexes?
> Is this about right?
> -- 
> Please reply using PGP.
>     NewsMonster -
> Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
>        AIM/YIM - sfburtonator,  Web -
> GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
>   IRC - #infoanarchy | #p2p-hackers | #newsmonster

> ATTACHMENT part 2 application/pgp-signature name=signature.asc

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message