lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Realtime Search for Social Networks Collaboration
Date Fri, 19 Sep 2008 12:30:14 GMT

Jason Rutherglen wrote:

> Mike,
> The other issue that will occur that I addressed is the field caches.
> The underlying smaller IndexReaders will need to be exposed because of
> the field caching.  Currently in ocean realtime search the individual
> readers are searched on using a MultiSearcher in order to search in
> parallel and reuse the field caches. How will field caching work with
> the IndexWriter approach?  It seems like it would need a dynamically
> growing field cache array?  That is a bit tricky.  By doing in memory
> merging in ocean, the field caches last longer and do not require
> growing arrays.

First off, I think the combination of LUCENE-1231 and LUCENE-831,  
which should result in FieldCache that is "distributed" down to each  
SegmentReader and much faster to initialize, should make incrementally  
updating the FieldCache much more efficient (ie, on calling  
IndexReader.reopen, it should only be the new segments that need to  
populate their FieldCache).

Hopefully these land before real-time search, because then I have more  
API flexibility to expose column-stride fields on the in-RAM  
documents.  There is still some trickiness, because an "ordinary"  
IndexWriter would never hold the column-stride fields in RAM.  They'd  
be flushed to the Directory, immediately per document, just liked  
stored fields and term vectors are today.  So, maybe, the first  
RAMReader you get from the IndexWriter would load back in these  
fields, triggering IndexWriter to add to them as documents are added  
(maybe using exponentially growing arrays as the underlying store, or,  
perhaps separate array fragments, to prevent synchronization when  
reading from them), such that subsequent reopens simply resync their  
max docID.

> How do you plan to handle rapidly delete the docs of
> the disk segments?  Can the SegmentReader clone patch be used for
> this?

I was thinking we'd flush new .del files every time a reopen is  
called, but that could very well be costly.  Instead, we can keep the  
deletes pending in the SegmentReaders we're holding open, and then go  
back to flushing on IndexWriter's normal schedule.  Reopen then must  
only "materialize" any buffered deletes by Term & Query, unless we  
decide to move up that materialization into the actual delete cal,  
since we will have SegmentReaders open anyway.  I think I'm leaning  
towards that approach... best to pay the cost as you go, instead of  
aggregated cost on reopen?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message