Mailing-List: contact derby-dev-help@db.apache.org; run by ezmlm
Precedence: bulk
Reply-To: <derby-dev@db.apache.org>
Received-SPF: pass (herse.apache.org: local policy)
Date: Wed, 23 May 2007 01:03:29 +0200
From: Kristian Waagan <Kristian.Waagan@Sun.COM>
Subject: Re: Modified UTF-8 or UTF-16 for temporary Clobs?
In-reply-to: <x7lkfgh4qr.fsf@Sun.COM>
Sender: Kristian.Waagan@Sun.COM
To: derby-dev@db.apache.org
Message-id: <465376C1.2060701@Sun.com>
Organization: Sun Microsystems Inc.
MIME-version: 1.0
Content-type: text/plain; format=flowed; charset=ISO-8859-1
Content-transfer-encoding: 7BIT
References: <46533864.8030706@Sun.com> <x7lkfgh4qr.fsf@Sun.COM>
User-Agent: Mozilla/5.0 (X11; U; SunOS i86pc; no-NO;
 rv:1.8.1.3) Gecko/20070419
 Thunderbird/2.0.0.0 Mnenhy/0.7.5.0

Knut Anders Hatlen wrote:
> Kristian Waagan <Kristian.Waagan@Sun.COM> writes:
> 
>> Hello,
>>
>> In my work on DERBY-2646, I have stumbled upon some issues that can
>> greatly affect the performance of accessing Clobs, especially updating
>> them.
> 
> [....]
> 
>> To summarize my view on this...
>>
>>
>> Pros, UTF-8 : more space efficient for US-ASCII, same as used by store
>> Pros, UTF-16: direct mapping between char/byte pos (easier logic)
>>
>> Cons, UTF-8 : requires "counting"/decoding to find byte position
>> Cons, UTF-16: space overhead for US-ASCII, must be converted when/if
>> Clob goes back into the database
>>
>> I'm sure there are other aspects, and I would like some opinions and
>> feedback on what to do. My two current alternatives on the table are
>> using the naive counting technique, or changing to UTF-16. The former
>> requires the least code changes.
> 
> Please correct me if I got it wrong, but based on what you wrote above,
> it seems like we now have the following situation:
> 
> To allow updates of a Clob at random positions (that is, with
> Clob.setString()), we create a copy of the Clob in a temporary
> file. However, we need to read the temporary file sequentially from the
> beginning for each operation in order to find the correct byte
> position. So we only have sequential access to the file that is supposed
> to give us random access to the Clob.
> 

Your description of the current situation is generally correct, but not 
quite accurate. We don't start at the beginning of the temporary file 
when trying to find the byte position for a character position.
The method doing the lookup is supplied with a "position hint", which is 
a byte position. However, this will not always work and it is a bug in 
the current implementation.

Even though we are able to locate *the next* character after the hint 
position, and thus its byte position, we do now know which character 
position it has.

This bug was causing the UTFDataFormatException that has been observed 
when using characters that are encoded with more than one byte in UTF-8.


To me it seems like the method is not used the way it was intended to be 
used. Looking more at it, it seems the charPos is relative to the 
byte/hint position. I will be rewriting it anyway, and it is no secret 
that working with UTF-8 is more complex than UTF-16 when it comes to 
mapping character positions to byte positions.


-- 
Kristian

[ snip - argumentation for using UTF-16 ]