lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom White <tom.e.wh...@gmail.com>
Subject Re: Lucene does NOT use UTF-8
Date Tue, 30 Aug 2005 21:59:54 GMT
On 8/30/05, Ken Krugler <kkrugler_lists@transpac.com> wrote:
> 
> >Daniel Naber wrote:
> >
> >>On Monday 29 August 2005 19:56, Ken Krugler wrote:
> >>
> >>>"Lucene writes strings as a VInt representing the length of the
> >>>string in Java chars (UTF-16 code units), followed by the character
> >>>data."
> >>>
> >>>
> >>But wouldn't UTF-16 mean 2 bytes per character? That doesn't seem
> >>to be the case.
> >>
> >UTF-16 is a fixed 2 byte/char representation.
> 
> I hate to keep beating this horse, but I want to emphasize that it's
> 2 bytes per Java char (or UTF-16 code unit), not Unicode character
> (code point).


There's more horse beating on Java and Unicode 4 in this blog entry: 
http://weblogs.java.net/blog/joconner/archive/2005/08/how_long_is_you.html.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message