lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yonik Seeley <>
Subject Re: Lucene does NOT use UTF-8.
Date Tue, 30 Aug 2005 15:09:31 GMT
The temporary char[] buffer is cached per InputStream instance, so the extra 
memory allocation shouldn't be a big deal. One could also use 
String(byte[],offset,len,"UTF-8"), and that creates a char[] that is used 
directly by the string instead of being copied. It remains to be seen how 
fast the native java char converter is though.

I like the idea of the length being the number of bytes... it encapsulates 
the content in case you want to rapidly skip over it (or rapidly copy it). 
It's more future proof w.r.t. alternate encodings (or binary), and if it had 
been number if bytes from the start, it wouldn't have to be changed now.


On 8/29/05, Doug Cutting <> wrote:

> I would argue that the length written be the number of characters in the
> string, rather than the number of bytes written, since that can minimize
> string memory allocations.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message