lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DM Smith <dmsmith...@gmail.com>
Subject Re: Lucene does NOT use UTF-8
Date Tue, 30 Aug 2005 17:28:12 GMT
Daniel Naber wrote:

>On Monday 29 August 2005 19:56, Ken Krugler wrote:
>  
>
>>"Lucene writes strings as a VInt representing the length of the
>>string in Java chars (UTF-16 code units), followed by the character
>>data."
>>    
>>
>But wouldn't UTF-16 mean 2 bytes per character? That doesn't seem to be the 
>case.
>
UTF-16 is a fixed 2 byte/char representation. But one cannot equate the 
character count with the byte count. Each Java char is 2 bytes. I think 
all that is being said is that the VInt is equal to str.length() as java 
gives it.

On an unrelated project we are determining whether we should use a 
denormalized (letter followed by an accents) or a normalized form 
(letter with accents) of accented characters as we present the text to a 
GUI. We have found that font support varies but appears to be better for 
denormalized. This is not an issue for storage, as it can be transformed 
before it goes to screen. However, it is useful to know which form it is in.

The reason I mention this is that I seem to remember that the length of 
the java string varies with the representation. So then the count would 
not be the number of glyphs that the user sees. Please correct me if I 
am wrong.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message