lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Engels" <reng...@ix.netcom.com>
Subject RE: Lucene does NOT use UTF-8.
Date Mon, 29 Aug 2005 23:37:40 GMT
I think the VInt should be the numbers of bytes to be stored using the UTF-8
encoding.

It is trivial to use the String methods identified before to do the
conversion. The String(char[]) allocates a new char array.

For performance, you can use the actual CharSet encoding classes - avoiding
all of the lookups performed by the String class.

-----Original Message-----
From: Doug Cutting [mailto:cutting@apache.org]
Sent: Monday, August 29, 2005 4:24 PM
To: java-dev@lucene.apache.org
Subject: Re: Lucene does NOT use UTF-8.


Ken Krugler wrote:
> The remaining issue is dealing with old-format indexes.

I think that revving the version number on the segments file would be a
good start.  This file must be read before any others.  Its current
version is -1 and would become -2.  (All positive values are version 0,
for back-compatibility.)  Implementations can be modified to pass the
version around if they wish to be back-compatible, or they can simply
throw exceptions for old format indexes.

I would argue that the length written be the number of characters in the
string, rather than the number of bytes written, since that can minimize
string memory allocations.

> I'm going to take this off-list now [ ... ]

Please don't.  It's better to have a record of the discussion.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message