From Ken Krugler <>
Subject Re: Lucene does NOT use UTF-8
Date Tue, 30 Aug 2005 17:50:40 GMT
>Daniel Naber wrote:
>>On Monday 29 August 2005 19:56, Ken Krugler wrote:
>>>"Lucene writes strings as a VInt representing the length of the
>>>string in Java chars (UTF-16 code units), followed by the character
>>But wouldn't UTF-16 mean 2 bytes per character? That doesn't seem 
>>to be the case.
>UTF-16 is a fixed 2 byte/char representation.

I hate to keep beating this horse, but I want to emphasize that it's 
2 bytes per Java char (or UTF-16 code unit), not Unicode character 
(code point).

>But one cannot equate the character count with the byte count. Each 
>Java char is 2 bytes. I think all that is being said is that the 
>VInt is equal to str.length() as java gives it.
>On an unrelated project we are determining whether we should use a 
>denormalized (letter followed by an accents) or a normalized form 
>(letter with accents) of accented characters as we present the text 
>to a GUI. We have found that font support varies but appears to be 
>better for denormalized. This is not an issue for storage, as it can 
>be transformed before it goes to screen. However, it is useful to 
>know which form it is in.
>The reason I mention this is that I seem to remember that the length 
>of the java string varies with the representation.

String.length() is the number of Java chars, which always uses 
UTF-16. If you normalize text, then yes that can change the number of 
code units and thus the length of the string, but so can doing any 
kind of text munging (e.g. replacement) operation on characters in 
the string.

>So then the count would not be the number of glyphs that the user 
>sees. Please correct me if I am wrong.

All kinds of mxn mappings (both at the layout engine level, and using 
font tables) are possible when going from Unicode characters to 
display glyphs. Plus zero-width left-kerning glyphs would also alter 
the relationship between # of visual "characters" and backing store 

-- Ken
Ken Krugler
TransPac Software, Inc.
+1 530-470-9200

