lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DM Smith <>
Subject Re: Lucene does NOT use UTF-8.
Date Tue, 30 Aug 2005 17:36:31 GMT

Ken Krugler wrote:

>> I think the VInt should be the numbers of bytes to be stored using 
>> the UTF-8
>> encoding.
>> It is trivial to use the String methods identified before to do the
>> conversion. The String(char[]) allocates a new char array.
>> For performance, you can use the actual CharSet encoding classes - 
>> avoiding
>> all of the lookups performed by the String class.
> Regardless of what underlying support is used, if you want to write 
> out the VInt value as UTF-8 bytes versus Java chars, the Java String 
> has to either be converted to UTF-8 in memory first, or pre-scanned. 
> The first is a memory hit, and the second is a performance hit. I 
> don't know the extent of either, but it's there.
> Note that since the VInt is a variable size, you can't write out the 
> bytes first and then fill in the correct value later.

Sure you can. Do a "tell" to get the position. Write any number. Write 
the text. Do another "tell" to note the position. Based on the 
difference between the two "tells", you have the length. Rewind to the 
first "tell" and write out the number. Then advance to the end.

I am not recommending this, but it can be done.

There may be other ways.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message