Return-Path: Delivered-To: apmail-db-derby-dev-archive@www.apache.org Received: (qmail 21410 invoked from network); 22 May 2007 22:31:31 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 22 May 2007 22:31:31 -0000 Received: (qmail 20403 invoked by uid 500); 22 May 2007 22:31:07 -0000 Delivered-To: apmail-db-derby-dev-archive@db.apache.org Received: (qmail 20259 invoked by uid 500); 22 May 2007 22:31:06 -0000 Mailing-List: contact derby-dev-help@db.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: Delivered-To: mailing list derby-dev@db.apache.org Received: (qmail 20023 invoked by uid 99); 22 May 2007 22:31:06 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 22 May 2007 15:31:05 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=UNPARSEABLE_RELAY X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: local policy) Received: from [192.18.1.36] (HELO gmp-ea-fw-1.sun.com) (192.18.1.36) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 22 May 2007 15:30:54 -0700 Received: from d1-emea-10.sun.com ([192.18.2.120]) by gmp-ea-fw-1.sun.com (8.13.6+Sun/8.12.9) with ESMTP id l4MMUXvf013725 for ; Tue, 22 May 2007 22:30:33 GMT Received: from conversion-daemon.d1-emea-10.sun.com by d1-emea-10.sun.com (Sun Java System Messaging Server 6.2-6.01 (built Apr 3 2006)) id <0JIG00A01ROJPL00@d1-emea-10.sun.com> (original mail from Knut.Hatlen@Sun.COM) for derby-dev@db.apache.org; Tue, 22 May 2007 23:30:33 +0100 (BST) Received: from localhost ([193.71.105.147]) by d1-emea-10.sun.com (Sun Java System Messaging Server 6.2-6.01 (built Apr 3 2006)) with ESMTPSA id <0JIG001H1RUWEVA5@d1-emea-10.sun.com> for derby-dev@db.apache.org; Tue, 22 May 2007 23:30:33 +0100 (BST) Date: Wed, 23 May 2007 00:29:16 +0200 From: Knut Anders Hatlen Subject: Re: Modified UTF-8 or UTF-16 for temporary Clobs? In-reply-to: <46533864.8030706@Sun.com> Sender: Knut.Hatlen@Sun.COM To: derby-dev@db.apache.org Message-id: Organization: Sun Microsystems MIME-version: 1.0 Content-type: text/plain; charset=us-ascii Content-transfer-encoding: 7BIT References: <46533864.8030706@Sun.com> User-Agent: Gnus/5.110006 (No Gnus v0.6) Emacs/22.1.50 (usg-unix-v) X-Virus-Checked: Checked by ClamAV on apache.org Kristian Waagan writes: > Hello, > > In my work on DERBY-2646, I have stumbled upon some issues that can > greatly affect the performance of accessing Clobs, especially updating > them. [....] > To summarize my view on this... > > > Pros, UTF-8 : more space efficient for US-ASCII, same as used by store > Pros, UTF-16: direct mapping between char/byte pos (easier logic) > > Cons, UTF-8 : requires "counting"/decoding to find byte position > Cons, UTF-16: space overhead for US-ASCII, must be converted when/if > Clob goes back into the database > > I'm sure there are other aspects, and I would like some opinions and > feedback on what to do. My two current alternatives on the table are > using the naive counting technique, or changing to UTF-16. The former > requires the least code changes. Please correct me if I got it wrong, but based on what you wrote above, it seems like we now have the following situation: To allow updates of a Clob at random positions (that is, with Clob.setString()), we create a copy of the Clob in a temporary file. However, we need to read the temporary file sequentially from the beginning for each operation in order to find the correct byte position. So we only have sequential access to the file that is supposed to give us random access to the Clob. If the purpose of the temporary file is to give random access to the Clob, then I definitely think UTF-16 is a better choice than UTF-8. I'm not sure how important the space overhead for 7-bit ASCII is, as long as there is zero or negative overhead for all non-ASCII characters. Space and performance considerations aside, the simpler relation between byte positions and character positions in UTF-16 would probably make it easier to write bug-free code. Since all chars are treated equally, we wouldn't have to come up with a great number of tests testing all possible combinations of single-byte chars and multi-byte chars, and it would therefore be easier to gain confidence in the correctness of the code. -- Knut Anders