lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: Lucene does NOT use UTF-8
Date Tue, 30 Aug 2005 16:59:37 GMT
>On Monday 29 August 2005 19:56, Ken Krugler wrote:
>
>>  "Lucene writes strings as a VInt representing the length of the
>>  string in Java chars (UTF-16 code units), followed by the character
>>  data."
>
>But wouldn't UTF-16 mean 2 bytes per character?

Yes, UTF-16 means two bytes per code unit. A Unicode character (code 
point) is encoded as either one or two UTF-16 code units.

>That doesn't seem to be the
>case.

The case where? You mean in what actually gets written out?

String.length() is the length in terms of Java chars, which means 
UTF-16 code units (well, sort of...see below). Looking at the code, 
IndexOutput.writeString() calls writeVInt() with the string length.

One related note. Java 1.4 supports Unicode 3.0, while Java 5.0 
supports Unicode 4.0. It was in Unicode 3.1 that supplementary 
characters (code points > U+0FFFF, ie outside of the BMP) were added, 
and the UTF-16 encoding formalized.

So I think the issue of non-BMP characters is currently a bit 
esoteric for Lucene, since I'm guessing there are other places in the 
code (e.g. JDK calls used by Lucene) where non-BMP characters won't 
be properly handled. Though some quick tests indicate that there is 
some knowledge of surrogate pairs in 1.4 (e.g. converting a String 
w/surrogate pairs to UTF-8 does the right thing).

-- Ken
-- 
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message