lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: bytecount as prefix
Date Sat, 06 May 2006 20:17:35 GMT
No progress yet.

I think my next move is to do what I did when trying to get KinoSearch
to write Lucene-compatible indexes: 

1) Generate an optimized split-file format Lucene index from a 
   pathological test corpus.
2) Hack KinoSearch so that it ought to produce an index which is 
   identical to the Lucene-generated index except for the segments file
   (which has a timestamp).  This involves overriding
   the segment-naming routine, setting the termIndexInterval to 128, and
   thwarting the attempts of CompoundFileWriter to merge the index
   files.  Also, it's tricky to get multiple fields to match up
   number-wise, so I generally just use one...  Then generate the
   KinoSearch index.
3) Run a script which performs a byte-by-byte comparison of each index
   file and reports the first byte where something differs.
4) Dive in with a hexdumper.  Calculate VInts mentally.  Memorize the
   data formats for each index file.  Think like a TermInfosWriter.
   Twiddle the test corpus so that it produces the smallest possible 
   index while still exposing differences.
5) Consume many aspirin.
6) Tweak 'n' repeat until the indexes are identical.
7) Tweak 'n' repeat until identical searches produce identical results.

The only differences will be that this time KinoSearch will provide the
authoritative index, since it already uses bytecounts (I'll use version
0.05, since the current version 0.10 has changes to .fdt) and that I
won't be able to use Luke to verify the search results.

Maybe some version of the pathological test corpus and the sample index
should be provided as a help for implementers.

Marvin Humphrey
Rectangular Research

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message