lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: Hacking Luke for bytecount-based strings
Date Wed, 17 May 2006 17:49:28 GMT

On May 16, 2006, at 11:58 PM, Paul Elschot wrote:
> Try and invoke luke with the a lucene jar of your choice on the
> classpath before luke itself:
> java -cp lucene-core-1.9-rc1-dev.jar:lukeall.jar org.getopt.luke.Luke

I tried this on an index built with KinoSearch 0.05, which pre-dates  
the addition of term vectors to .fdt.  After working out a  
SecurityException by using individual components rather than  

Luke powered by the patched library worked; Luke powered by straight- 
up Lucene did not.

The source material was stuff from Wikipedia, which contains a bunch  
of invalid UTF-8.  KinoSearch doesn't care about that, so it's in  
there in the index.  No problems.  :)

What I'd like to do is augment my existing patch by making it  
possible to specify a particular encoding, both for Lucene and Luke.   
Searches will continue to work regardless because the patched  
Termbuffer compares raw bytes. (A comparison based on Term.compareTo 
() would likely fail because raw bytes translated to UTF-8 may not  
produce the same results.)  That way, say, a Russian user who had  
built a KinoSearch index using KOI8-R (assumming I revert the .fdt  
change) could specify KOI8-R and have Luke display the correct  
characters.  Ideally, you'd want to store the index's encoding in the  
index somewhere, but Lucene doesn't have a place for that, so I need  
to patch both Luke and Lucene.

I wonder how Lucene would perform with my patch applied if the  
indexer were spec'd to use Latin1 rather than UTF-8...  patches to  
the segment merging apparatus would be required...

Marvin Humphrey
Rectangular Research

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message