Return-Path: Delivered-To: apmail-db-derby-dev-archive@www.apache.org Received: (qmail 35263 invoked from network); 22 May 2007 23:03:53 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 22 May 2007 23:03:53 -0000 Received: (qmail 65510 invoked by uid 500); 22 May 2007 23:03:58 -0000 Delivered-To: apmail-db-derby-dev-archive@db.apache.org Received: (qmail 65479 invoked by uid 500); 22 May 2007 23:03:58 -0000 Mailing-List: contact derby-dev-help@db.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: Delivered-To: mailing list derby-dev@db.apache.org Received: (qmail 65470 invoked by uid 99); 22 May 2007 23:03:58 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 22 May 2007 16:03:58 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=UNPARSEABLE_RELAY X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: local policy) Received: from [192.18.1.36] (HELO gmp-ea-fw-1.sun.com) (192.18.1.36) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 22 May 2007 16:03:51 -0700 Received: from d1-emea-10.sun.com ([192.18.2.120]) by gmp-ea-fw-1.sun.com (8.13.6+Sun/8.12.9) with ESMTP id l4MN3TZ8014281 for ; Tue, 22 May 2007 23:03:29 GMT Received: from conversion-daemon.d1-emea-10.sun.com by d1-emea-10.sun.com (Sun Java System Messaging Server 6.2-6.01 (built Apr 3 2006)) id <0JIG00M01T9DT000@d1-emea-10.sun.com> (original mail from Kristian.Waagan@Sun.COM) for derby-dev@db.apache.org; Wed, 23 May 2007 00:03:29 +0100 (BST) Received: from [129.159.112.188] by d1-emea-10.sun.com (Sun Java System Messaging Server 6.2-6.01 (built Apr 3 2006)) with ESMTPSA id <0JIG001C3TDTEQH5@d1-emea-10.sun.com> for derby-dev@db.apache.org; Wed, 23 May 2007 00:03:29 +0100 (BST) Date: Wed, 23 May 2007 01:03:29 +0200 From: Kristian Waagan Subject: Re: Modified UTF-8 or UTF-16 for temporary Clobs? In-reply-to: Sender: Kristian.Waagan@Sun.COM To: derby-dev@db.apache.org Message-id: <465376C1.2060701@Sun.com> Organization: Sun Microsystems Inc. MIME-version: 1.0 Content-type: text/plain; format=flowed; charset=ISO-8859-1 Content-transfer-encoding: 7BIT References: <46533864.8030706@Sun.com> User-Agent: Mozilla/5.0 (X11; U; SunOS i86pc; no-NO; rv:1.8.1.3) Gecko/20070419 Thunderbird/2.0.0.0 Mnenhy/0.7.5.0 X-Virus-Checked: Checked by ClamAV on apache.org Knut Anders Hatlen wrote: > Kristian Waagan writes: > >> Hello, >> >> In my work on DERBY-2646, I have stumbled upon some issues that can >> greatly affect the performance of accessing Clobs, especially updating >> them. > > [....] > >> To summarize my view on this... >> >> >> Pros, UTF-8 : more space efficient for US-ASCII, same as used by store >> Pros, UTF-16: direct mapping between char/byte pos (easier logic) >> >> Cons, UTF-8 : requires "counting"/decoding to find byte position >> Cons, UTF-16: space overhead for US-ASCII, must be converted when/if >> Clob goes back into the database >> >> I'm sure there are other aspects, and I would like some opinions and >> feedback on what to do. My two current alternatives on the table are >> using the naive counting technique, or changing to UTF-16. The former >> requires the least code changes. > > Please correct me if I got it wrong, but based on what you wrote above, > it seems like we now have the following situation: > > To allow updates of a Clob at random positions (that is, with > Clob.setString()), we create a copy of the Clob in a temporary > file. However, we need to read the temporary file sequentially from the > beginning for each operation in order to find the correct byte > position. So we only have sequential access to the file that is supposed > to give us random access to the Clob. > Your description of the current situation is generally correct, but not quite accurate. We don't start at the beginning of the temporary file when trying to find the byte position for a character position. The method doing the lookup is supplied with a "position hint", which is a byte position. However, this will not always work and it is a bug in the current implementation. Even though we are able to locate *the next* character after the hint position, and thus its byte position, we do now know which character position it has. This bug was causing the UTFDataFormatException that has been observed when using characters that are encoded with more than one byte in UTF-8. To me it seems like the method is not used the way it was intended to be used. Looking more at it, it seems the charPos is relative to the byte/hint position. I will be rewriting it anyway, and it is no secret that working with UTF-8 is more complex than UTF-16 when it comes to mapping character positions to byte positions. -- Kristian [ snip - argumentation for using UTF-16 ]