Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 64167 invoked from network); 30 Aug 2005 16:28:30 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 30 Aug 2005 16:28:30 -0000 Received: (qmail 89760 invoked by uid 500); 30 Aug 2005 16:28:28 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 89513 invoked by uid 500); 30 Aug 2005 16:28:27 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 89500 invoked by uid 99); 30 Aug 2005 16:28:27 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Aug 2005 09:28:27 -0700 X-ASF-Spam-Status: No, hits=0.1 required=10.0 tests=HTML_40_50,HTML_MESSAGE,RCVD_BY_IP,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: domain of yseeley@gmail.com designates 64.233.184.200 as permitted sender) Received: from [64.233.184.200] (HELO wproxy.gmail.com) (64.233.184.200) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Aug 2005 09:28:42 -0700 Received: by wproxy.gmail.com with SMTP id 67so822647wri for ; Tue, 30 Aug 2005 09:28:24 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=NvF2TLEoYGlWeuukg8l8gOYQtOHFBRu0in3HqtqVwbi1xW5JQ44QrJJvOsmwmJyweHUd7HMiK87ASihspBakoMtrLoZsAFCJDU8LEErMW1vBUQZBGwo+gtLwRZKVpGWzvKaa5qmuJfqmxkztihtEjKIfjilTF4VW+PnngTyGVOY= Received: by 10.54.105.6 with SMTP id d6mr7417982wrc; Tue, 30 Aug 2005 09:28:24 -0700 (PDT) Received: by 10.38.12.41 with HTTP; Tue, 30 Aug 2005 09:27:42 -0700 (PDT) Message-ID: Date: Tue, 30 Aug 2005 12:27:42 -0400 From: Yonik Seeley To: java-dev@lucene.apache.org Subject: Re: Lucene does NOT use UTF-8. In-Reply-To: <4.2.2.20050829213725.03e6b138@mail.nacimiento.com> Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_1160_31366410.1125419262881" References: <0A17D76B-DF8B-40B9-84F2-1A5A0C83E053@rectangular.com> <8B302856-A72E-497D-9858-59C32B55B9ED@rectangular.com> <4F608D92-3618-48A1-849E-42D10A28BE73@rectangular.com> <43137CE9.5060200@apache.org> <4.2.2.20050829213725.03e6b138@mail.nacimiento.com> X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N ------=_Part_1160_31366410.1125419262881 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline > How will the difference impact String memory allocations? Looking at the > String code, I can't see where it would make an impact.=20 This is from Lucene InputStream: public final String readString() throws IOException { int length =3D readVInt(); if (chars =3D=3D null || length > chars.length) chars =3D new char[length]; readChars(chars, 0, length); return new String(chars, 0, length); } If you know the length in bytes, you still have to allocate that many chars= =20 (even though the number of chars may be less than the number of bytes). Not= =20 a big deal IMHO. A bigger pain is on the writing side, where you can't stream things because= =20 you don't know what the length is going to be (in either bytes *or* UTF-8= =20 chars). So it turns out that Java's 16 bit chars were just a waste... it's still a= =20 multibyte format *and* it takes up more space. UTF-8 would have been nice -= =20 no conversions necessary. -Yonik Now hiring -- http://tinyurl.com/7m67g ------=_Part_1160_31366410.1125419262881--