incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: real time updates
Date Sun, 15 Mar 2009 00:41:44 GMT
On Sat, Mar 14, 2009 at 12:21:10PM -0700, Nathan Kurz wrote:
> For indexing, I'd love to see the same agnostic behaviour.  The
> Indexer calls knows only about a single function like
> UpdatePosting(docID, newPostings).

While this interface tries hard to be "agnostic" and highly abstract, it in
fact imposes a requirement that neither Lucene nor KinoSearch nor SQLite could
satisfy by default: "docID" has to be a primary key.  Since Lucy's internal
doc numbers will be ephemeral, they wouldn't work.  We'd need to add a primary
key field.

> My canonical example for this is that I want to be able to store my
> index in SQLite, and write a thin layer of interface between it and
> the rest of Lucy.  But my real desire is to substitute a custom mmap()
> solution such as the fast graph database referenced earlier.

That sounds like a fun exercise.  Let's start from a clean slate, and try to
build up interfaces for Indexer and Searchable.  Just so that it's clear that
nothing final is being decided, let's call our experimental project "Luser".

Our Luser APIs must support two engines:

  * A segmented inverted index back end.
  * An SQLite back end.

We will assume the following:

  * A Doc class that supports multiple fields.
  * An opaque Hits class.
  * A Query class that somehow compiles down to a Scorer given access
    to a Searchable.
  * Global field semantics, supported by an opaque Schema class.
  * An Engine class which does all the work.

I'm going to use a slightly simplified version of the current ".bp" syntax as

Let's start with Indexer:

  public class Luser::Index::Indexer extends Luser::Obj {
    public Indexer*
    new(Indexer *self, Schema *schema, Engine *engine);

    public void
    Add_Doc(Indexer *self, Doc *doc);

    public void
    Delete_By_Term(Indexer *self, const CharBuf *field, Obj *term);

    public void
    Commit(Indexer *self);

I'm dissatisfied with that constructor, and we're leaving off Prepare_Commit,
the destructor and a bunch of other important methods... but let's just stumble
past all that.

I'd like to add this to Indexer:

    public void
    Replace_Doc(Indexer *self, const CharBuf *field, Obj *term, Doc *doc);

"Replace_Doc" is superior to "Update_Doc" because the method name informs the
user that they have to supply the entire document and not just the updated

Unfortunately, absent a primary key constraint they both suck, because it's
not clear whether you'll end up replacing many docs with one.  

I think the only way to work in update semantics is to put them in an
IndexUpdater class that forces you to specify a primary key.  Otherwise we'll
end up fielding waaaay too many questions from confused noobs on the user

Next up, Searchable.

  public abstract class Luser::Search::Searchable extends Luser::Obj {
    public Searchable*
    init(Searchable *self, Schema *schema);

    public Hits*
    Hits(Searchable *self, Query *query, SortSpec *sort_spec = NULL,
         u32_t offset = 0, u32_t wanted = 10);

    public abstract void
    Collect(Searchable *self, Query *query, HitCollector *collector);

    public abstract Doc*
    Fetch_Doc(Searchable *self, i32_t doc_num);

Hmm.  We need to add Doc_Max and Doc_Freq, otherwise TF/IDF scoring won't
work.  However, they aren't needed by the SQLite back end.

We've run into abstracting difficulties with both Indexer and Searchable.  We
can't reduce our interfaces to a pure intersection without crippling one of
the engines.  

The alternative is to define some abstract methods which wouldn't always be
needed or implemented.  However, Luser can't support every possible engine
without bloating up.  We have to make choices about who we are.

Marvin Humphrey

View raw message