lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: Lucene does NOT use UTF-8.
Date Sat, 27 Aug 2005 14:08:46 GMT
On Aug 26, 2005, at 10:14 PM, jian chen wrote:

> It seems to me that in theory, Lucene storage code could use true  
> UTF-8 to
> store terms. Maybe it is just a legacy issue that the modified  
> UTF-8 is
> used?

It's not a matter of a simple switch.  The VInt count at the head of  
a Lucene string is not the number of Unicode code points the string  
contains.  It's the number of Java chars necessary to contain that  
string.  Code points above the BMP require 2 java chars, since they  
must be represented by surrogate pairs.  The same code point must be  
represented by one character in legal UTF-8.

If Plucene counts the number of legal UTF-8 characters and assigns  
that number as the VInt at the front of a string, when Java Lucene  
decodes the string it will allocate an array of char which is too  
small to hold the string.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message