lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Modification of positional information encoding
Date Mon, 13 Oct 2008 14:53:57 GMT

Renaud Delbru wrote:

> Hi,
> We are trying to modify the positional encoding of a term occurrence  
> for experimentation purposes. One solution we adopt is to use  
> payloads to sotre our own positional information encoding, but with  
> this solution, it becomes difficult to measure the increase or  
> decrease of index size. It is why we would like to directly change  
> the positional encoding.
> I have seen that Michael McCandless recently refactored the  
> DocumentsWriter into a flexible indexer chain (see LUCENE-1301). By  
> analysing the code, I have noticed that only two classes should be  
> modified when writing a document:
> When adding a document with IndexWriter.addDocument
> - FieldInvertState (increment and store positions)
> - FreqProxTermsWriterPerField.writeProx
> When flushing documents:
> - FreqProxTermsWriterPerField.appendPostings
> I have noticed that only one class should be modified when reading a  
> document:
> - SegmentTermPositions.nextPositions, and
> - SegmentTermPositions.readDeltaPositions
> Could a member of the Lucene team approve my modifications ? Do I  
> forget to modify some classes ?

This looks right, though you would also need to modify SegmentMerger  
to read & write your new format when merging segments.

Another thing you could do is grep for "omitTf" which should touch  
exactly the same places you need to touch.

It'd be awesome to get to the point where this read & write logic is  
captured in a single "codec" that's cleanly shared in all these places  
("flexible indexing") but we are not quit there yet...

Out of curiosity, what change are you planning on trying?

> Another question, since the lucene core classes are kind of close,  
> what is the best way to implement these modifications ? Make a  
> branch of lucene, and add my new classes to the lucene package  
> org.apache.lucene.index ? Or do a more elegant solution is possible ?

For starters (to try things out) I would just make local modifications  
with a lucene source checkout (via svn).

Also, this issue was just opened:

which would make it possible for classes in the same package  
(oal.index) to use their own indexing chain.  With that fix, if you  
make your own classes in oal.index package, and perhaps subclass the  
above classes, you could then create your own indexing chain for  
indexing?  If you take that approach, please report back so we can  
learn how to improve Lucene for these very advanced customizations!


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message