lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Modification of positional information encoding
Date Wed, 15 Oct 2008 09:33:21 GMT

Renaud Delbru wrote:

> Hi Michael,
>
> Michael McCandless wrote:
>> Also, this issue was just opened:
>>
>>
>>   https://issues.apache.org/jira/browse/LUCENE-1419
>>
>> which would make it possible for classes in the same package  
>> (oal.index) to use their own indexing chain.  With that fix, if you  
>> make your own classes in oal.index package, and perhaps subclass  
>> the above classes, you could then create your own indexing chain  
>> for indexing?  If you take that approach, please report back so we  
>> can learn how to improve Lucene for these very advanced  
>> customizations!
>>
> As a first impression, what will be handy in order to customize  
> postings list will be to make an abstract class FreqProxTermsWriter,  
> that separates segment creation and term information serialisation.  
> This class will implement the generic logic for flushing and  
> appending postings, but will delegate to subclasses the way you  
> write doc + freq and prox + payload info.
>
> A first idea will be to have the following abstract methods:
> - writeMinState : called by appendPostings, and define how to  
> serialise one FreqProxFieldMergeState
> - writeDocFreq : called by writeMinState, and define how to  
> serialise docs and freq
> - writeProx: called by writeMinState and define how to serialise  
> positions and payloads
>
> I think other parts of the FreqProxTermsWriter can stay generic.  
> What do you think ?

I agree: let's decouple the "codec" (how to write terms/freq/prox)  
from the other mechanics in FreqProxTermsWriter.

I don't think FreqProxFieldMergeState should be visible to that codec,  
though.  That class is used, internally to FreqProxTermsWriter, to  
manage the multiple threads that had accumulated postings data.

I think the codec API could look something like this:

   newField(...)
     startTerm(...)
       startDocument(...)
         addPosition(...)
       endDocument(...)
     endTerm(...)

We would then make a codec that matches today's index file format, but  
allow for others (you) to swap in a new codec.  All of this would be  
experimental & private to oal.index for starters.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message