lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <>
Subject Re: Lucene does NOT use UTF-8.
Date Tue, 30 Aug 2005 15:41:19 GMT
>Ken Krugler wrote:
>>The remaining issue is dealing with old-format indexes.
>I think that revving the version number on the segments file would 
>be a good start.  This file must be read before any others.  Its 
>current version is -1 and would become -2.  (All positive values are 
>version 0, for back-compatibility.)  Implementations can be modified 
>to pass the version around if they wish to be back-compatible, or 
>they can simply throw exceptions for old format indexes.

After looking at it a bit more, I think there's no problem w/having 
the new code read both UTF-8 and Java modified UTF-8, and always 
write correct UTF-8. So the only compatibility issue would be new 
Lucene indexes w/non-BMP characters being processed by older versions 
of Lucene (or ports that weren't updated).

>I would argue that the length written be the number of characters in 
>the string, rather than the number of bytes written, since that can 
>minimize string memory allocations.

Agreed, though just to clarify, it's the number of UTF-16 code units 
(Java chars), not the number of Unicode code points (Unicode 

>>I'm going to take this off-list now [ ... ]
>Please don't.  It's better to have a record of the discussion.

No problem. I was worried that the discussion Marvin & I were having 
was turning into a two person IM chat via email.

-- Ken
Ken Krugler
TransPac Software, Inc.
+1 530-470-9200

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message