lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: storing term text internally as byte array and bytecount as prefix, etc.
Date Wed, 03 May 2006 23:48:03 GMT

On May 1, 2006, at 7:33 PM, Chuck Williams wrote:
 > Could someone summarize succinctly why it is considered a
 > major issue that Lucene uses the Java modified UTF-8
 > encoding within its index rather than the standard UTF-8
 > encoding.  Is the only concern compatibility with index
 > formats in other Lucene variants?

I originally raised a stink about "Modified UTF-8" because at the  
time I was embroiled in an effort to implement the Lucene file  
format, and the
Lucene File Formats document claimed to be using "UTF-8", straight  
up.  It was most unpleasant to discover that if my app read legal  
UTF-8, Lucene-generated indexes would cause it to crash from time to  
time, and that if it wrote legal UTF-8, the indexes it generated  
would cause Lucene to crash from time to time.

More problematic than the "Modified UTF-8" actually, is the  
definition of a Lucene String.   According to the File Formats  
document, "Lucene writes strings as a VInt representing the length,  
followed by the character data."  The word "length" is ambiguous in  
that context, and at first I took it to mean either length in Unicode  
code points or bytes.  It was a nasty shock to discover that it was  
actually Java chars.  Bizarre and painful contortions were suddenly  
required for encoding/decoding a term dictionary which would  
otherwise have been completely unnecessary.

I used to think that the Lucene file format might serve as "the TIFF  
of inverted indexes".  My perspective on this has changed.  Lucene's  
file format is just beastly difficult to implement from scratch, and  
anything short of full implementation guarantees occasional "Read  
past EOF" errors on interchange.  Personally, I would assess the file  
format as the secondary expression of a beautiful algorithmic  
design.  Ease of interchange and ease of implementation do not seem  
to have been primary design considerations -- which is perfectly  
reasonable, if true, but perhaps then it should not aspire to serve  
as a vehicle for interchange.  As was asserted in the recent thread  
on ACID compliance, the indexes produced by a full-text indexer are  
not meant to serve as primary document storage.  It's common to need  
to move a TIFF or a text file from system to system.  It's not common  
to need to move a derived index.

Compatibility has its advantages.  It was pretty nice to be able to  
browse KinoSearch-generated indexes using Luke, once I managed to  
achieve compatibility for all-ascii source material.  But holy crow,  
was it tough to debug those indexes.  No human readable components.   
No fixed block sizes.  No facilities for resyncing a stream once it  
gets off.  All that on top of the "Modified UTF-8" and the String  

At this point I think the suggestion of turning the File Formats  
document from an ostensible spec into a piece of ordinary  
documentation is a worthy one.  FWIW, I've pretty much given up on  
the idea of making KinoSearch and Lucene file-format-compatible.  In  
my weaker moments I imagine that I might sell the Lucene community on  
the changes that would be necessary.  Then I remember that many of  
you live in a world where "Modified UTF-8" isn't an abomination.  ;)

Marvin Humphrey
Rectangular Research

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message