incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: real time updates
Date Sun, 15 Mar 2009 21:44:43 GMT

Marvin Humphrey wrote:

> On Sat, Mar 14, 2009 at 05:51:43AM -0400, Michael McCandless wrote:
>> Even w/ background merging, which allows new segments to be written &
>> reopened in a reader even while the big merge is running in the BG,
>> Lucene still has the challenge of warming a reader on the [large]
>> newly merged segment before using the reader "for real".
> Lucy doesn't have to worry about the warming aspect; given  
> sufficient RAM, all
> the files in the recently written segment will still be "hot" in the  
> OS file
> cache.
> The trick we need to master is the coordination of two concurrent  
> write
> processes.  I think it goes something like this:
>  * The background consolidator writer grabs "consolidate.lock".  It  
> starts
>    writing its own segment based on the state of the index at that  
> moment.
>  * Meanwhile, an indeterminate number of consolidator-aware write  
> processes
>    launch and complete.

So eg you could merge 2 sets of segments at once (like Lucene)?

> These processes are forbidden from merging any files
>  that pre-date the establishment of "consolidate.lock".

Why?  It seems like it needs to merge segments created before it  
that lock (that's why it was launched).

>  * Once the consolidator finishes most of what it's doing, it waits  
> to obtain
>    a write lock.  The only task left is to carry forward new  
> deletions which
>    have been made since the establishment of "consolidate.lock"  
> against the
>    segments which the consolidator has just merged away.  It  
> finishes that
>    task, commits, releases "write.lock", releases  
> "consolidate.lock",then
>    exits.

That, and update the master "segments" file to actually record the  
merge, and
incRef/decRef to delete files.

> Does that sound similar to the Lucene implementation?


But, what if while a large merge is happening, and enough segments have
been written to warrant a small merge to kick off & finish?

>> We need an incremental copy-on-write solution (eg only the "page"  
>> that's
>> change gets copied when a new deletion arrives).  We need this for  
>> changes
>> to norms too.
> Norms, huh?  That's weird.  Do you have to do that because a field  
> definition
> has been modified?

No, it's to handle someone calling IndexReader.setNorm, eg if they are
doing "realtime boosting".

>> But then does contain all deletes for seg_2?   
>> In  which
>> case this is just like the "generation" Lucene increments & tacks  
>> on  when
>> it saves a del; just a different naming scheme.
> That's right, it's just a different naming scheme.  In fact, it's  
> marginally
> less efficient because the bit vector must be copied a little more  
> often.
> However, with that change, segment directories are truly never  
> modified once
> written.  For somewhat esoteric reasons, that made it easier to  
> factor a
> sensible DeletionsWriter out of the existing KinoSearch indexing  
> code so that
> we could plug in alternative implementations.



View raw message