lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: storing term text internally as byte array and bytecount as prefix, etc.
Date Fri, 05 May 2006 15:15:05 GMT
Marvin Humphrey wrote:
> More problematic than the "Modified UTF-8" actually, is the definition 
> of a Lucene String.   According to the File Formats document, "Lucene 
> writes strings as a VInt representing the length, followed by the 
> character data."  The word "length" is ambiguous in that context, and at 
> first I took it to mean either length in Unicode code points or bytes.  
> It was a nasty shock to discover that it was actually Java chars.  
> Bizarre and painful contortions were suddenly required for 
> encoding/decoding a term dictionary which would otherwise have been 
> completely unnecessary.

Yes, this should be corrected.  The problem is that "length" refers to 
the length of the Java string, but that is not explicit.  Moreover, as 
you have pointed out, that is a bad choice for non-Java implementations.

> Ease of 
> interchange and ease of implementation do not seem to have been primary 
> design considerations -- which is perfectly reasonable, if true, but 
> perhaps then it should not aspire to serve as a vehicle for 
> interchange.

The index format document was written years after Lucene was written, 
after Lucene had alredy been ported to other languages.  It seemed like 
a good idea to document what folks were porting.  Ease of interchange 
and implementation were not primary considerations when Lucene was 
developed.  That said, at the time Lucene was first written (1997), 
Unicode was only 16-bit and there was no discrepancy between Java's 
modified encoding and UTF-8.

> At this point I think the suggestion of turning the File Formats 
> document from an ostensible spec into a piece of ordinary documentation 
> is a worthy one.  FWIW, I've pretty much given up on the idea of making 
> KinoSearch and Lucene file-format-compatible.  In my weaker moments I 
> imagine that I might sell the Lucene community on the changes that would 
> be necessary.

Please do.  But suggestions without working patches are not always acted 
on.  Most of us are busy with other projects, and only advance Lucene 
when we have a need, or someone provides a patch.  Ideally we need to 
find someone who *needs* an index format that's easily interchangeable 
between Java and other languages to push this forward.

> Then I remember that many of you live in a world where 
> "Modified UTF-8" isn't an abomination.  ;)

Modified UTF-8 is not anyone's choice.  It's simply what's used by Java. 
  What are we supposed to do, picket Sun?  If we move to make Lucene's 
file format an interchange format, then we must clearly move beyond it.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message