incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: real time updates
Date Sun, 15 Mar 2009 16:08:25 GMT
On Sat, Mar 14, 2009 at 05:51:43AM -0400, Michael McCandless wrote:
> Even w/ background merging, which allows new segments to be written &
> reopened in a reader even while the big merge is running in the BG,
> Lucene still has the challenge of warming a reader on the [large]
> newly merged segment before using the reader "for real".  

Lucy doesn't have to worry about the warming aspect; given sufficient RAM, all
the files in the recently written segment will still be "hot" in the OS file

The trick we need to master is the coordination of two concurrent write
processes.  I think it goes something like this:
  * The background consolidator writer grabs "consolidate.lock".  It starts  
    writing its own segment based on the state of the index at that moment.
  * Meanwhile, an indeterminate number of consolidator-aware write processes
    launch and complete. These processes are forbidden from merging any files
    that pre-date the establishment of "consolidate.lock".
  * Once the consolidator finishes most of what it's doing, it waits to obtain
    a write lock.  The only task left is to carry forward new deletions which
    have been made since the establishment of "consolidate.lock" against the
    segments which the consolidator has just merged away.  It finishes that
    task, commits, releases "write.lock", releases "consolidate.lock",then

Does that sound similar to the Lucene implementation?

> We need an incremental copy-on-write solution (eg only the "page" that's
> change gets copied when a new deletion arrives).  We need this for changes
> to norms too.

Norms, huh?  That's weird.  Do you have to do that because a field definition
has been modified?

> But then does contain all deletes for seg_2?  In  which
> case this is just like the "generation" Lucene increments & tacks on  when
> it saves a del; just a different naming scheme.  

That's right, it's just a different naming scheme.  In fact, it's marginally
less efficient because the bit vector must be copied a little more often.

However, with that change, segment directories are truly never modified once
written.  For somewhat esoteric reasons, that made it easier to factor a
sensible DeletionsWriter out of the existing KinoSearch indexing code so that
we could plug in alternative implementations.  

Marvin Humphrey

View raw message