lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: Lucene does NOT use UTF-8.
Date Sat, 27 Aug 2005 14:08:46 GMT
On Aug 26, 2005, at 10:14 PM, jian chen wrote:

> It seems to me that in theory, Lucene storage code could use true  
> UTF-8 to
> store terms. Maybe it is just a legacy issue that the modified  
> UTF-8 is
> used?

It's not a matter of a simple switch.  The VInt count at the head of  
a Lucene string is not the number of Unicode code points the string  
contains.  It's the number of Java chars necessary to contain that  
string.  Code points above the BMP require 2 java chars, since they  
must be represented by surrogate pairs.  The same code point must be  
represented by one character in legal UTF-8.

If Plucene counts the number of legal UTF-8 characters and assigns  
that number as the VInt at the front of a string, when Java Lucene  
decodes the string it will allocate an array of char which is too  
small to hold the string.

Marvin Humphrey
Rectangular Research

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message