lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex vB <>
Subject Re: Implementing indexing of Versioned Document Collections
Date Tue, 16 Nov 2010 11:35:47 GMT

Hello Pulkit,

thank you for your answer and excuse me for my late reply. I am currently
working on the payload stuff and have implemented my own Analyzer and
Tokenfilter for adding custom payloads. As far as I understand I can add
Payload for every term occurence and write this into the posting list. My
posting list now looks like this:

car -> DocID1, [Payload 1], DocID2, [Payload2]....., DocID N, [Payload N]

Where each payload is a BitSet depending on the versions of a document. I
must admit that the index is getting really big at the moment because I am
adding around 8 to 16 bytes with each payload. I have to find a good
compression for the bitvectors. 
Further I am always getting the error
org.apache.lucene.index.CorruptIndexException: checksum mismatch in segments
file if I use my own Analyzer. After I uncomment the checksum test
everything works fine. Even Luke isn't giving me an error. Any ideas?
Another problem is the BitVector creation during tokenization. I am running
through all versions during the tokenizing step for creating my bitvectors
(stored in a HashMap). So my bitvectors are completly created after the last
field is analyzed (I added every wikipedia verison as an own field).
Therefore I need to add the payload after the tokenizing step. Is this
possible? What happens if I add payload for a current term and I add another
payload for the same term later ? Is it overwritten or appended?

View this message in context:
Sent from the Lucene - Java Users mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message