incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: real time updates
Date Fri, 13 Mar 2009 23:03:26 GMT
On Fri, Mar 13, 2009 at 12:48:11PM -0700, Nathan Kurz wrote:
> [moved from private to lucy-dev in case others are interested]
> > KS has a slow best-case write speed, but I'm not worried about that.  The
> > problem me and the Lucene folks are trying to address under the topic
> > heading of "real-time indexing" is *worst-case* write speed: most of the
> > time you're fine, but every once in a while you trigger a large merge and
> > you wait a loooooong time.  That problem has a different solution.
> What is the actual case that would trigger such a problem?  

The "big merge" problem is easy to trigger: repeatedly open an indexer, add a
small number of documents, and close it.  Each time you do that, a new segment
is added, and search-time performance degrades.  Every once in a while, the
segment merging algorithm decides that it needs to perform a big
consolidation, and you have to wait while it finishes.

Solving the big-merge problem requires setting up a background consolidation
routine that allows new segments to be written while it's running.  We'll need
to make some mods to write locking, commit locking, segment naming, snapshot
file naming, and merging policies; I'm currently working on an IndexManager
class which bundles all of those concerns together.

Lucene does background merging already using a single process with multiple
threads.  I'm thinking that we'd try a background process approach instead, so
that additional write processes could be launched which add new segments while
the background merger churns.

We also have a second major problem: the amount of time it takes to load
certain caches in IndexReader from disk into process memory -- especially sort
caches.  This one we bix by 1) designing index file formats that can be
mmap'd, and 2) implementing "segment-centric sorted search", which uses
per-segment mmap'd caches rather than a per-index cache which has to be
rebuilt from scratch in process memory each time the index is modified.

Lastly, we have a problem regarding writing out deletions, minor in comparison
to the other two but still worth thinking over: the size of the deletions files
we write out scales with the size of the index rather than number of deletions
we make.  If you delete a single document, but that document happens to reside
in a segment with 1 million documents, we have to write out a new 1 MB bit
vector file.

I thought we had that licked with the "tombstones" approach, but the
performance of iterated deletions turns out to be worse than expected, even
before we get a priority queue involved to merge multiple tombstone rows.

> Would the equivalent of row-level-locking allow you to
> modify the index files in real time, instead of adding addenda and
> tombstones?   I'm not necessarily suggesting that the default index
> format should do this, but that it might be worth considering whether
> a proposed API would support such a real-time format.

I absolutely agree that deletions support should be pluggable.

How does this interface look?

  package MyDelWriter;
  use base qw( Lucy::Index::DeletionsWriter );


  package MyDelReader;
  use base qw( Lucy::Index::DeletionsReader );


  package MyArchitecture;
  use base qw( Lucy::Architecture );

  sub make_deletions_writer { 
    return MyDelWriter->new(@_);

  sub make_deletions_reader { 
    return MyDelReader->new(@_);

  package MySchema;
  use base qw( Lucy::Schema );

  sub make_architecture {
    return MyArchitecture->new;

That's the interface I've worked into KS over the last few weeks.  

The default deletions implementation uses BitVecDelWriter/BitVecDelReader.  It
behaves slightly differently from Lucene and old KS; new bit vector files
belong to new segments but reference old ones (so "seg_8/"
contains deletions which should be applied against segment 2).

I haven't yet tested that API to verify that an alternative implementation
works.  Perhaps a pure-Perl, hash-based MockDeletions class is in order?

Marvin Humphrey

View raw message