lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <DCutt...@grandcentral.com>
Subject RE: Re:Added comments to InputStream and OutputStrea m
Date Fri, 12 Oct 2001 16:28:46 GMT
> From: nelson@monkey.org [mailto:nelson@monkey.org]
> 
> > Unicode is 16 bits.
> 
> Unicode is currently defined as having up to 2^31 positions, although
> the current plan is for somewhere between 2^20 and 2^21 characters.
> (2^16 characters was the old Unicode standard - dropped when someone
> pointed out that Chinese alone has more than 2^16 characters).

More importantly, Java characters have 16 bits:
 
http://java.sun.com/docs/books/jls/second_edition/html/typesValues.doc.html#
9151
So Lucene need only be concerned with storing 16-bit characters.

Also, to my understanding, the Chinese issue is not as simple as you
describe.  There are experts who think that Chinese, Japanese and Korean
have fewer than 2^16 characters, and that what folks wish to have encoded as
separate characters were better thought of as different typefaces.  The
problem became a political one, and unicode was enlarged to greater than 16
bits, but characters greater than 2^16 are extensions, not a part of the
core, 16-bit unicode.  But I am not an expert, and this is off topic...

Doug

Mime
View raw message