lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: Payloads
Date Wed, 20 Dec 2006 18:43:03 GMT
Michael Busch wrote:
 > Some weeks ago I started working on an improved design which I would
 > like to propose now. The new design simplifies the API extensions (the
 > Field API remains unchanged) and uses less disk space in most use cases.
 > Now there are only two classes that get new methods:
 > - Token.setPayload()
 >  Use this method to add arbitrary metadata to a Token in the form of a
 > byte[] array.
 > - TermPositions.getPayload()
 >  Use this method to retrieve the payload of a term occurrence.


This sounds like very good work.  The back-compatibility of this 
approach is great.  But we should also consider this in the broader 
context of index-format flexibility.

Three general approaches have been proposed.  They are not exclusive.

1. Make the index format extensible by adding user-implementable reader 
and writer interfaces for postings.

2. Add a richer set of standard index formats, including things like 
compressed fields, no-positions, per-position weights, etc.

3. Provide hooks for including arbitrary binary data.

Your proposal is of type (3).  LUCENE-662 is a (1).  Approaches of type 
(2) are most friendly to non-Java implementations, since the semantics 
of each variation are well-defined.

I don't see a reason not to pursue all three, but in a coordinated 
manner.  In particular, we don't want to add a feature of type (3) that 
would make it harder to add type (1) APIs.  It would thus be best if we 
had a rough specification of type (1) and type (2).  A proposal of type 
(2) is at:

But I'm not sure that we yet have any proposed designs for an extensible 
posting API.  (Is anyone aware of one?)  This payload proposal can 
probably be easily incorporated into such a design, but I would have 
more confidence if we had one.  I guess I should attempt one!

Here's a very rough, sketchy, first draft of a type (1) proposal.


interface PostingFormat {
   PostingInverter getInverter(FieldInfo, Segment, Directory);
   PostingReader getReader(FieldInfo, Segment, Directory);
   PostingWriter getWriter(FieldInfo, Segment, Directory);

interface PostingPointer {} ???

interface DictionaryFormat {
   DictionaryWriter getWriter(FieldInfo, Segment, Directory);
   DictionaryWriter getReader(FieldInfo, Segment, Directory);

IndexWriter#addDocument(Document doc)
   loop over doc.fields
     call PostingFormat#getPostingInverter(FieldInfo, Segment, Directory)
       to create a PostingInverter
     if field is analyzed
       call Analyzer#tokenStream() to get TokenStream
       loop over tokens
         PostingInverter#collectToken(Token, Field);

   call DictionaryFormat#getWriter(FieldInfo, Segment, Directory)
     to create a DictionaryWriter
   Iterator<Term> terms = PostingInverter#getTerms();
   loop over terms
     PostingPointer p = PostingInverter#getPointer();
     DictionaryWriter#addTerm(term, p);

   call DictionaryFormat#getReader(FieldInfo, Segment, Directory)
     to create a DictionaryReader
   loop over fields
     call PostingFormat#getWriter(FieldInfo, Segment, Directory)
       to create a PostingWriter
     loop over segments
       call PostingFormat#getReader(FieldInfo, Segment, Directory)
	to create a PostingReader
       loop over dictionary.terms
         PostingPointer p = PostingWriter#getPointer();
         DictionaryWriter#addTerm(Term, p);
	loop over docs
	  int doc = PostingReader#readPostings();

So the question is, does something like this conflict with your 
proposal?  Should Term and/or Token be extensible?  If so, what should 
their interfaces look like?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message