incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: real time updates
Date Sun, 15 Mar 2009 10:33:53 GMT
On Sat, Mar 14, 2009 at 12:21:10PM -0700, Nathan Kurz wrote:
> > Every once in a while, the segment merging algorithm decides that it needs
> > to perform a big consolidation, and you have to wait while it finishes.
> Yes, but that's an artifact of current approach of adding segments rather
> than making real-time replacements.  

Perhaps we can abstract things enough to plug in a completely different
engine, but we're still going to have to solve this problem for the segmented
inverted index engine.

The very nature of an inverted index is that a single document gets broken up
and listed in many posting lists.  Updating all those posting lists is
basically impossible without rebuilding the index.  

Say you were to replace a chapter in a book; now the index at the back of the
book is out of date and needs to be updated.  You can't just swap out a couple
pages in the book's index, because references to the old chapter are
interspersed with references to all the other chapters.  You can go in with a
black marker and cross out all the references to the old chapter, but
inserting references to the new chapter is going to be very difficult.  The
only realistic approach is to tear out the whole index and start from scratch.

At least Lucene/KinoSearch/Lucy don't force you to rebuild the index from
scratch every time you make a change.  Instead they use a segmented approach,
which works pretty well, but has this annoying problem with worst-case write

> I was more wondering if there is anything inherent about the rate of change
> required that would prevent a fully incremental update from working.

In-place updating would work for deletions: we just give every segment its own
dedicated, permanent deletions bit vector file and flip bits in it.  

That can't be the default behavior because it breaks IndexReader's promise to
provide a point-in-time view of the index.  However, I think it would be fine
to accomodate users who don't need IndexReader consistency by making it
possible to plug in a QuickAndDirtyDeletionsWriter /
QuickAndDirtyDeletionsReader combo.

In-place updating could also work for subclasses of DocWriter.  Again, it
couldn't be the default implementation, but we could say, plug in a
FlatFileDocWriter that wrote to a flat file with fixed length rows; fixed
length rows are easy to update in place.

In theory, in-place updating could even work for PostingsWriter, if your
PostingsWriter didn't write an inverted index but instead wrote something else
entirely.  But again, that couldn't be the default behavior.

> > How does this interface look?
> >
> >  package MyDelWriter;
> >  use base qw( Lucy::Index::DeletionsWriter );
> >  ...
> This feels to me like it is solving the wrong problem.  There's
> nothing wrong with it, but DeletionsWriter and DeletionsReader seem
> like internal implementation details of particular type of Index.

The Architecture class is meant to help with this problem.  

I've been wrestling with exactly what to put into Architecture.  The default
type of index will a segmented inverted index, and the factory methods in
Architecture (Make_Deletions_Writer, Make_Doc_Writer, Make_Postings_Writer,
etc), are designed to accommodate that. 

Perhaps we ought to have an InvertedIndexArchitecture which serves as the
default, and have Architecture be an abstract base class which could also fit
well over an SQLite or graph-based back end.  But then what would we put in
Architecture that's common to all of those?  How do we avoid abstracting it so
thoroughly that there's nothing left?  

If you were to write an Architecture class to support pluggability, what would
it look like?

> I'd hope that the interface between a Scorer and an Index could be
> very simple,  probably just a single function to get a PostingList.

As an aside... I'd prefer that we never define a core class named "Index".
The word "index" is horrendously overloaded and thus unclear.  Same goes for
"Search".  They're fine as package names, but poor as class names.

Getting back to the point... I think IndexReader has to have Doc_Max() and
Doc_Freq().  How do we avoid those and still support TF/IDF?  It's also hard
to imagine life without Lexicon(), or support for sort caches.

Perhaps an Engine class with a smaller interface as IndexReader's parent?

> Thta PostingList would provide navigation by docID, but deletions
> would be handled internally and never be seen by the Scorer.

Actually, this is controversial.  We're hoping to see gains by applying
deletions at the Scorer_Collect() level, so PostingList wouldn't know about
them.  However, the matter is unresolved.

Marvin Humphrey

View raw message