Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 96614 invoked from network); 30 Aug 2005 17:45:52 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 30 Aug 2005 17:45:52 -0000 Received: (qmail 33513 invoked by uid 500); 30 Aug 2005 17:45:49 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 33481 invoked by uid 500); 30 Aug 2005 17:45:48 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 33468 invoked by uid 99); 30 Aug 2005 17:45:48 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Aug 2005 10:45:48 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: neutral (asf.osuosl.org: local policy) Received: from [209.86.89.68] (HELO smtpauth08.mail.atl.earthlink.net) (209.86.89.68) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Aug 2005 10:46:03 -0700 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=dk20050327; d=ix.netcom.com; b=ahTZExlS1mMvvJvLM5ARWrG4GNYO0lZjV/VPEGh9sQNonhSCqPBdsLGDgnPTIOFk; h=Received:Reply-To:From:To:Subject:Date:Message-ID:MIME-Version:Content-Type:Content-Transfer-Encoding:X-Priority:X-MSMail-Priority:X-Mailer:In-Reply-To:X-MimeOLE:Importance:X-ELNK-Trace:X-Originating-IP; Received: from [66.245.68.111] (helo=ENGELSSERVER) by smtpauth08.mail.atl.earthlink.net with asmtp (Exim 4.34) id 1EAAB7-0004RA-LX; Tue, 30 Aug 2005 13:45:45 -0400 Reply-To: From: "Robert Engels" To: , Subject: RE: Lucene does NOT use UTF-8. Date: Tue, 30 Aug 2005 12:46:25 -0500 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook IMO, Build 9.0.6604 (9.0.2911.0) In-Reply-To: X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.2180 Importance: Normal X-ELNK-Trace: 33cbdd8ed9881ca8776432462e451d7bd15d05d9470ff710539ec42361dcaeee39d8e4f04788906b350badd9bab72f9c350badd9bab72f9c350badd9bab72f9c X-Originating-IP: 66.245.68.111 X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N At bit more clarity... Using CharBuffer and ByteBuffer allows for easy reuse and expansion. You also need to use the CharSetDecoder class as well. -----Original Message----- From: Robert Engels [mailto:rengels@ix.netcom.com] Sent: Tuesday, August 30, 2005 12:40 PM To: java-dev@lucene.apache.org Subject: RE: Lucene does NOT use UTF-8. I think you guys are WAY overcomplicating things, or you just don't know enough about the Java class libraries. If you use the java.nio.charset.CharsetEncoder class, then you can reuse the byte[] array, and then it is a simple write of the length, and a blast copy of the required number of bytes to the OutputStream (which will either fit or expand its byte[]). You can perform all of this WITHOUT creating new byte[] or char[] (as long as the existing one is large enough to fit the encoded/decoded data). There is no need to use any sort of file position mark/reset stuff. R -----Original Message----- From: Ken Krugler [mailto:kkrugler@transpac.com] Sent: Tuesday, August 30, 2005 11:54 AM To: java-dev@lucene.apache.org Subject: RE: Lucene does NOT use UTF-8. >I think the VInt should be the numbers of bytes to be stored using the UTF-8 >encoding. > >It is trivial to use the String methods identified before to do the >conversion. The String(char[]) allocates a new char array. > >For performance, you can use the actual CharSet encoding classes - avoiding >all of the lookups performed by the String class. Regardless of what underlying support is used, if you want to write out the VInt value as UTF-8 bytes versus Java chars, the Java String has to either be converted to UTF-8 in memory first, or pre-scanned. The first is a memory hit, and the second is a performance hit. I don't know the extent of either, but it's there. Note that since the VInt is a variable size, you can't write out the bytes first and then fill in the correct value later. -- Ken >-----Original Message----- >From: Doug Cutting [mailto:cutting@apache.org] >Sent: Monday, August 29, 2005 4:24 PM >To: java-dev@lucene.apache.org >Subject: Re: Lucene does NOT use UTF-8. > > >Ken Krugler wrote: >> The remaining issue is dealing with old-format indexes. > >I think that revving the version number on the segments file would be a >good start. This file must be read before any others. Its current >version is -1 and would become -2. (All positive values are version 0, >for back-compatibility.) Implementations can be modified to pass the >version around if they wish to be back-compatible, or they can simply >throw exceptions for old format indexes. > >I would argue that the length written be the number of characters in the >string, rather than the number of bytes written, since that can minimize >string memory allocations. > >> I'm going to take this off-list now [ ... ] > >Please don't. It's better to have a record of the discussion. > >Doug -- Ken Krugler TransPac Software, Inc. +1 530-470-9200 --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org