incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Pluggable IndexReader (was "real time updates")
Date Sun, 22 Mar 2009 18:51:23 GMT
I wrote:

> Without Lexicon(), sort caches, and the like, IndexReader becomes a
> very generic class.  So generic that it's not really useful?  I don't know.
> We're going to have to publish a class that looks like IndexReader --
> Doc_Freq, Lexicon, Doc_Max, etc -- in order to support the segmented inverted
> index engine.  It should probably be named "IndexReader".  :)  Do you think
> there ought to be a more generic super class above that one?

How about stripping down IndexReader and turning SegReader into a bucket of

Remove Lexicon(), PostingList(), Deletions(), Doc_Freq(), and most other
methods from IndexReader.  The DocReader, PostingsReader, DeletionsReader,
LexReader, and TermVectorsReader members of SegReader, which implement those
methods, all get the heave ho.

What's left:

   Segment  *segment;
   Snapshot *snapshot;
   Schema   *schema;
   Folder   *folder;
   Hash     *components; /* <------ !!! */

Here's the new sequence for fetching a PostingList:

  PostingsReader *postings_reader 
    = SegReader_Fetch_Component(seg_reader, "postings");
  PostingList *plist = postings_reader
    ? PostReader_Posting_List(postings_reader, field, term)
    : NULL;

Doc_Freq() would also require an extra step:

  LexReader *lex_reader
    = SegReader_Fetch_Component(seg_reader, "lexicon");
  u32_t doc_freq = lex_reader
    ? LexReader_Doc_Freq(lex_reader, field, term)
    : 0;

Going this direction would make IndexReader less user-friendly but more

I think we're going to need something along the lines of Fetch_Component()
anyway.  I can't think of another way to add arbitrary components.

Say the user has supplied an RTreeQuery to Searcher:

  Hits *hits = Searcher_Hits(searcher, rtree_query);

At some point, the RTreeQuery is going to need data supplied by an RTreeReader
so that it can compile an RTreeScorer.  However, IndexReader doesn't define an
interface for accessing R-tree data.  

In theory, the LucyX::RTree package could supply a
LucyX::RTree::RTreeIndexReader class which extends IndexReader -- but that's
cumbersome and won't work well if you want to plug in more than one component.
In contrast, the hash-based Fetch_Component() scheme allows us to extend
SegReader without subclassing anything other than Architecture.

Strictly speaking, we don't *have* to strip down IndexReader if we add
Fetch_Component().  Having core classes interact with IndexReader exclusively
through the Fetch_Component interface wouldn't be a problem, but it's awkward
for users who might want to interact with IndexReader directly.  
But how many people really need that, and what's their profile?  The majority
of users will just go through the Searcher interface; only power-users need
IndexReader, and they should be able to handle going through Fetch_Component()
to deal with the plugins directly.

There are details to work out regarding conflating data from multiple
segments, but that's the gist.

Marvin Humphrey

View raw message