lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Pluggable IndexReader (was 2.9/3.0 plan & Java 1.5)
Date Mon, 15 Dec 2008 12:04:08 GMT

Marvin Humphrey wrote:

> I have a bunch of file format changes to push through, and I'm  
> hoping to
> implement them using pluggable modules.  For instance, I'd like to  
> be able to
> swap out bit-vector-based deletions for tombstone-based deletions,  
> just by
> overriding a method or two.

I think Lucene should also aim for this (swappability of index codecs)  
LUCENE-1458 is a step towards that specifically for postings.  The
tombstone approach for deletions sounds compelling too, though first
we need to fix the API to switch to iterator only and stop calling  
in document(docID).

PFOR, pulsing are other recent examples where if we had swappability,
people could more easily explore.

> Jason Rutherglen:
>> Decoupling IndexReader would for 3.0 would be great.  This includes  
>> making
>> public SegmentReader, MultiSegmentReader.
> I definitely think that IndexReader can and should be made more  
> pluggable.  Is
> exposing per-segment sub-readers a definite win, though?  Does it  
> make sense
> to leave open the door to index components which don't operate on  
> segments?
> Or even to eliminate SegmentReader entirely and have sub-components of
> IndexReader manage collation?
> I've been thinking about this with regard to tombstone-based  
> deletions, where
> you can't know everything about a segment unless you've opened up  
> other
> segments.

These are good points: it may be exposing too much if we fully expose
SegmentReader now, since some components (deletion tombstones) may
want to skip that API and operate directly on lower level files.
Though, with LUCENE-1483 we are moving to excuting scoring &
collection per-segment.

>> A constructor like new SegmentReader(TermsDictionary termDictionary,
>> TermPostings termPostings, ColumnStrideFields csd, DocIdBitSet  
>> deletedDocs);
> You end up with a proliferation of constructors that way.  Term  
> vectors?
> Arbitrary auxiliary components such as an R-tree component supporting
> geographic search?
> My original proposal to clean this up involved an "IndexComponent"  
> class.
> However, when I started implementing it, I ended up with a slew of  
> new classes
> with only two factory methods each.
> We could possibly move those factory methods up into Schema, but I'm  
> reluctant to
> dirty it up, since it's a major public class in KS (as I anticipate  
> it will be
> in Lucy) and major public classes should be as simple as possible.
> So, how about an IndexArchitecture or IndexPlan class?
>  class MyArchitecture extends IndexArchitecture {
>    public PostingsWriter PostingsWriter() {
>      return new PForDeltaPostingsWriter();
>    }
>    public PostingsReader PostingsReader() {
>      return new PForDeltaPostingsReader();
>    }
>    public DeletionsWriter DeletionsWriter() {
>      return new TombstoneWriter();
>    }
>    public DeletionsReader DeletionsReader() {
>      return new TombstoneReader();
>    }
>  }
> Lucene:
>  IndexWriter writer = new IndexWriter("/path/to/index",
>    new StandardAnalyzer(), new MyArchitecture());
> Lucy with Java bindings:
>  class MySchema extends Schema {
>    public MySchema() {
>      initField("title", "text");
>      initField("content", "text");
>    }
>    public IndexArchitecture indexArchitecture() {
>      return new MyArchitecture();
>    }
>    public Analyzer analyzer() {
>      return new PolyAnalyzer("en");
>    }
>  }
>  IndexWriter writer = new IndexWriter("/path/to/ 
> index"));

I think this is a reasonable approach.  I might name it IndexCodec(s)
though, and I agree conceptually it's orthogonal to a "schema".

>> Decouple rollback, commit, IndexDeletionPolicy from  
>> DirectoryIndexReader
>> into a class like SegmentsVersionSystem which could act as the  
>> controller
>> for reopen types of methods.  There could be a SegmentVersionSystem  
>> that
>> manages the versioning of a single segment.
> I like it. :)
> Sometimes you want to change up the merge policy for different  
> writers against
> the same index.  How does that fit into your plan?
> My thought is that merge-policies would be application-specific  
> rather than
> index-specific.

This one I'm a little hazy on.  It would be nice to have a single
source for IndexWriter & IndexReader-acting-as-writer to share this
logic, but then we are [very, very slowly] migrating towards
IndexWriter being the only thing that writes to an index so it seems
like eventually it's OK if this logic is managed via the IndexWriter.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message