incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: real time updates
Date Wed, 18 Mar 2009 16:58:18 GMT
On Mon, Mar 16, 2009 at 12:09:43AM -0700, Nathan Kurz wrote:

> > Since Lucy's internal doc numbers will be ephemeral, they wouldn't work.
> > We'd need to add a primary key field.
> While I understand this is mostly true, I had assumed that it was
> possible to create some synthetic primary key, say, 'segment - doc'.

We have ephemeral, private primary keys, if you want to think about it that
way.  For a given Snapshot/PolyReader, we order the segments by age, laying
them end to end.  If we had 3 segments with exactly 100 docs each, then from
the perspective of the PolyReader, there is a document number 150 which
belongs to the second segment.

However, once the first, oldest segment in that group gets merged away, the
second segment moves leftward in the array, and what was formerly document
number 150 becomes number 50. 

Is that good enough for your purposes?  I don't think it is.  You can't use a
Lucy/KS document number to recall a specific document over time from a naive
external caller.

> Yes, I'm suggesting that updating each and every one of those posting lists
> both feasible and possibly the best approach.  

That sounds kind of crazy to me -- but in my experience awesome sometimes
hides behind what I think must be crazy, so let's stumble onwards. :)

The problem is making space for the inserted postings.  Say you were to start
off a book's index with only the first chapter.  As you add chapters, that
index needs to grow.  To make room for each inserted posting, you're going to
have to shift stuff rightwards continuously in the postings file.  Ultimately,
it's a lot more work doing that incrementally than just writing out a new
file.  Even a single move is expensive.

Other data structures are optimized for insertions (e.g. B+ tree), but even so
you end up with a lot of churn.

> It only becomes impossible if the update rate exceeds the available memory
> bandwidth, which for a modern processor is mighty high.  Given a reasonable
> read/write ratio, I think it's a win, at least for the cases I'm interested
> in.  I'm not suggesting you move to such an approach now, but it's a route I
> strongly want left open until conclusively proven to be unworkable.

Well, it would be great if we could keep the basic Indexer class as lean and 
mean as possible.  Then implementing an alternative which supports update
semantics rather than just delete/add is super simple.

> > Getting back to the point... I think IndexReader has to have Doc_Max() and
> > Doc_Freq().  How do we avoid those and still support TF/IDF?
> You are right that we need these, but it's just a question of where
> the support should go.  As a non-OO thinker, I view these as being
> data members within some standardized Posting struct.  Thus I'd guess
> that these should be part of the Posting object, accessible only
> during the Scoring phase, rather than on the IndexReader.   But likely
> I'm not understanding you here.

We need to know Doc_Freq() and Doc_Max() before scoring commences, because
they are used to calculate a float multiplier that's applied against the score
of each hit.  Rarer terms get a higher multiplier.  

Furthermore, IDF is calculated using Doc_Freq for the entire collection, which
may span several searchers in a cluster, while Scorer and Posting are local
only and so don't know enough about the whole collection.

In contrast, Term Frequency (the "TF" in TF/IDF) is per-document, and is
stored in the postings files as per your impression.

> > It's also hard to imagine life without Lexicon(), or support for sort caches.
> Again, it's just a question of where these should go.  While you view
> the Lexicon as being part of the Index (and commented appropriately on
> the overloading of this term), I view them as independent.  

Without Lexicon(), sort caches, and the like, IndexReader becomes a
very generic class.  So generic that it's not really useful?  I don't know.

We're going to have to publish a class that looks like IndexReader --
Doc_Freq, Lexicon, Doc_Max, etc -- in order to support the segmented inverted
index engine.  It should probably be named "IndexReader".  :)  Do you think
there ought to be a more generic super class above that one?

> I can view cases where different inverted indexes might share a common
> lexicon in the same way that segments currently share one, perhaps even with
> the Lexicon being generated and stored by an unrelated app (say, for user
> applied tags).

Lexicon is an abstract class, and SegReader is pluggable.  If you override
Arch_Make_Lex_Reader, you can supply an implementation of LexReader that
reads from an external app.

> > If you were to write an Architecture class to support pluggability, what would
> > it look like?
> I guess that's what I'm hoping to answer.  I don't currently know.  My
> instinct is that it might be best to treat the problem as one of data
> exchange between levels rather than pluggability, but that might just
> be my poor OO intuition.

It'll probably be easier to offer a critique once I can provide links to HTML
documentation for the entire pluggability apparatus: SegWriter, SegReader,
Segment, Snapshot, DataReader, DataWriter, and various subclasses of
DataWriter/DataReader like BitVecDelWriter, HashRowWriter (the default doc
writer), etc.

I think what you will object to about the current implementation of
Architecture is that it is dedicated to plugins for SegWriter and SegReader,
and is thus segment-centric, making it insufficiently generic.  However, I'm
not sure how to write something that's more generic that would still be

Marvin Humphrey

View raw message