lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject RE: Re:Added comments to InputStream and OutputStrea m
Date Fri, 12 Oct 2001 15:36:13 GMT
> From: []
> I'm a bit confused about this discussion; Java does a great job of
> hiding character encodings from you. Is Lucene turning byte arrays
> into character arrays somewhere?

Lucene needs to intermix binary and character data in its index files, and
needs to do so very efficiently.  Java's built-in methods are not quite
suitable.  String.getBytes() allocates several byte arrays for every string
processed.  RandomAccessFile's readUTF() and writeUTF() methods are not any

So far as I know, there isn't a standard, efficient, public Java API for
converting chars to bytes and vice versa.  I suppose Lucene could use, but will every Java implementation have this class?

So Lucene has its own implementation of UTF8 encoding and decoding in and  This isn't really that
bad.  Lucene only needs to support a single character encoding in indexes,
and UTF8 is not complex, nor is it likely to change.


View raw message