lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jian chen <>
Subject Re: Lucene does NOT use UTF-8.
Date Sat, 27 Aug 2005 21:42:31 GMT
Hi, Ken,

Thanks for your email. You are right, I was meant to propose that Lucene 
switch to use true UTF-8, rather than having to work around this issue by 
fixing the caused problems elsewhere. 

Also, conforming to standards like UTF-8 will make the code easier for new 
developers to pick up.

Just my 2 cents.



On 8/27/05, Ken Krugler <> wrote:
> >On Aug 26, 2005, at 10:14 PM, jian chen wrote:
> >
> >>It seems to me that in theory, Lucene storage code could use true UTF-8 
> to
> >>store terms. Maybe it is just a legacy issue that the modified UTF-8 is
> >>used?
> The use of 0xC0 0x80 to encode a U+0000 Unicode code point is an
> aspect of Java serialization of character streams. Java uses what
> they call "a modified version of UTF-8", though that's a really bad
> way to describe it. It's a different Unicode encoding, one that
> resembles UTF-8, but that's it.
> >It's not a matter of a simple switch. The VInt count at the head of
> >a Lucene string is not the number of Unicode code points the string
> >contains. It's the number of Java chars necessary to contain that
> >string. Code points above the BMP require 2 java chars, since they
> >must be represented by surrogate pairs. The same code point must be
> >represented by one character in legal UTF-8.
> >
> >If Plucene counts the number of legal UTF-8 characters and assigns
> >that number as the VInt at the front of a string, when Java Lucene
> >decodes the string it will allocate an array of char which is too
> >small to hold the string.
> I think Jian was proposing that Lucene switch to using a true UTF-8
> encoding, which would make things a bit cleaner. And probably easier
> than changing all references to CEUS-8 :)
> And yes, given that the integer count is the number of UTF-16 code
> units required to represent the string, your code will need to do a
> bit more processing when calculating the character count, but that's
> a one-liner, right?
> -- Ken
> --
> Ken Krugler
> TransPac Software, Inc.
> <>
> +1 530-470-9200
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message