lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <>
Subject RE: Lucene does NOT use UTF-8.
Date Tue, 30 Aug 2005 16:54:23 GMT
>I think the VInt should be the numbers of bytes to be stored using the UTF-8
>It is trivial to use the String methods identified before to do the
>conversion. The String(char[]) allocates a new char array.
>For performance, you can use the actual CharSet encoding classes - avoiding
>all of the lookups performed by the String class.

Regardless of what underlying support is used, if you want to write 
out the VInt value as UTF-8 bytes versus Java chars, the Java String 
has to either be converted to UTF-8 in memory first, or pre-scanned. 
The first is a memory hit, and the second is a performance hit. I 
don't know the extent of either, but it's there.

Note that since the VInt is a variable size, you can't write out the 
bytes first and then fill in the correct value later.

-- Ken

>-----Original Message-----
>From: Doug Cutting []
>Sent: Monday, August 29, 2005 4:24 PM
>Subject: Re: Lucene does NOT use UTF-8.
>Ken Krugler wrote:
>>  The remaining issue is dealing with old-format indexes.
>I think that revving the version number on the segments file would be a
>good start.  This file must be read before any others.  Its current
>version is -1 and would become -2.  (All positive values are version 0,
>for back-compatibility.)  Implementations can be modified to pass the
>version around if they wish to be back-compatible, or they can simply
>throw exceptions for old format indexes.
>I would argue that the length written be the number of characters in the
>string, rather than the number of bytes written, since that can minimize
>string memory allocations.
>>  I'm going to take this off-list now [ ... ]
>Please don't.  It's better to have a record of the discussion.

Ken Krugler
TransPac Software, Inc.
+1 530-470-9200

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message