Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Received-SPF: neutral (asf.osuosl.org: local policy)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
  s=dk20050327; d=ix.netcom.com;
  b=ahTZExlS1mMvvJvLM5ARWrG4GNYO0lZjV/VPEGh9sQNonhSCqPBdsLGDgnPTIOFk;
  h=Received:Reply-To:From:To:Subject:Date:Message-ID:MIME-Version:Content-Type:Content-Transfer-Encoding:X-Priority:X-MSMail-Priority:X-Mailer:In-Reply-To:X-MimeOLE:Importance:X-ELNK-Trace:X-Originating-IP;
Reply-To: <rengels@ix.netcom.com>
From: "Robert Engels" <rengels@ix.netcom.com>
To: <java-dev@lucene.apache.org>,
	<rengels@ix.netcom.com>
Subject: RE: Lucene does NOT use UTF-8.
Date: Tue, 30 Aug 2005 12:46:25 -0500
Message-ID: <LMENLAOACIBLMOIILNNNEEBOFEAA.rengels@ix.netcom.com>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="US-ASCII"
Content-Transfer-Encoding: 7bit
In-Reply-To: <LMENLAOACIBLMOIILNNNMEBNFEAA.rengels@ix.netcom.com>
Importance: Normal

At bit more clarity...

Using CharBuffer and ByteBuffer allows for easy reuse and expansion. You
also need to use the CharSetDecoder class as well.

-----Original Message-----
From: Robert Engels [mailto:rengels@ix.netcom.com]
Sent: Tuesday, August 30, 2005 12:40 PM
To: java-dev@lucene.apache.org
Subject: RE: Lucene does NOT use UTF-8.


I think you guys are WAY overcomplicating things, or you just don't know
enough about the Java class libraries.

If you use the java.nio.charset.CharsetEncoder class, then you can reuse the
byte[] array, and then it is a simple write of the length, and a blast copy
of the required number of bytes to the OutputStream (which will either fit
or expand its byte[]). You can perform all of this WITHOUT creating new
byte[] or char[] (as long as the existing one is large enough to fit the
encoded/decoded data).

There is no need to use any sort of file position mark/reset stuff.

R


-----Original Message-----
From: Ken Krugler [mailto:kkrugler@transpac.com]
Sent: Tuesday, August 30, 2005 11:54 AM
To: java-dev@lucene.apache.org
Subject: RE: Lucene does NOT use UTF-8.


>I think the VInt should be the numbers of bytes to be stored using the
UTF-8
>encoding.
>
>It is trivial to use the String methods identified before to do the
>conversion. The String(char[]) allocates a new char array.
>
>For performance, you can use the actual CharSet encoding classes - avoiding
>all of the lookups performed by the String class.

Regardless of what underlying support is used, if you want to write
out the VInt value as UTF-8 bytes versus Java chars, the Java String
has to either be converted to UTF-8 in memory first, or pre-scanned.
The first is a memory hit, and the second is a performance hit. I
don't know the extent of either, but it's there.

Note that since the VInt is a variable size, you can't write out the
bytes first and then fill in the correct value later.

-- Ken


>-----Original Message-----
>From: Doug Cutting [mailto:cutting@apache.org]
>Sent: Monday, August 29, 2005 4:24 PM
>To: java-dev@lucene.apache.org
>Subject: Re: Lucene does NOT use UTF-8.
>
>
>Ken Krugler wrote:
>>  The remaining issue is dealing with old-format indexes.
>
>I think that revving the version number on the segments file would be a
>good start.  This file must be read before any others.  Its current
>version is -1 and would become -2.  (All positive values are version 0,
>for back-compatibility.)  Implementations can be modified to pass the
>version around if they wish to be back-compatible, or they can simply
>throw exceptions for old format indexes.
>
>I would argue that the length written be the number of characters in the
>string, rather than the number of bytes written, since that can minimize
>string memory allocations.
>
>>  I'm going to take this off-list now [ ... ]
>
>Please don't.  It's better to have a record of the discussion.
>
>Doug


--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org