lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nel...@monkey.org (Nelson Minar)
Subject Re: Re:Added comments to InputStream and OutputStrea m
Date Fri, 12 Oct 2001 15:01:58 GMT
>Unicode is 16 bits.  UTF-8 needs 1 byte for a 7-bit character (ASCII),
>2 bytes for an 11-bit character (including ISO-8859-1), and 3 bytes for
>a 16-bit character.

This is partly true. Unicode itself is coding independent. I believe
Unicode is currently defined as having up to 2^31 positions, although
the current plan is for somewhere between 2^20 and 2^21 characters.
(2^16 characters was the old Unicode standard - dropped when someone
pointed out that Chinese alone has more than 2^16 characters).

Unicode needs to be encoded somehow as a sequence of words. UTF-8
encodes Unicode as sequences of 8 bit words - either 1, 2, or 3
depending on the character. UTF-16 encodes it as a sequence of 16 bit
words: 1 or 2. UTF-32 encodes it as a sequence of 32 bit words, always
1 per character.

UTF-8 is the most common encoding. It handles ISO-Latin-1 easily (fits
in 1 word).

Unicode is cool - if you want to learn more, see
  http://www.unicode.org/
  http://www.unicode.org/unicode/faq/utf_bom.html


I'm a bit confused about this discussion; Java does a great job of
hiding character encodings from you. Is Lucene turning byte arrays
into character arrays somewhere?

                                                     nelson@monkey.org
.       .      .     .    .   .  . . http://www.media.mit.edu/~nelson/

Mime
View raw message