db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kristian Waagan <Kristian.Waa...@Sun.COM>
Subject Re: Modified UTF-8 or UTF-16 for temporary Clobs?
Date Tue, 22 May 2007 23:06:33 GMT
Kristian Waagan wrote:
> Knut Anders Hatlen wrote:
>> Kristian Waagan <Kristian.Waagan@Sun.COM> writes:
>>> Hello,
>>> In my work on DERBY-2646, I have stumbled upon some issues that can
>>> greatly affect the performance of accessing Clobs, especially updating
>>> them.
>> [....]
>>> To summarize my view on this...
>>> Pros, UTF-8 : more space efficient for US-ASCII, same as used by store
>>> Pros, UTF-16: direct mapping between char/byte pos (easier logic)
>>> Cons, UTF-8 : requires "counting"/decoding to find byte position
>>> Cons, UTF-16: space overhead for US-ASCII, must be converted when/if
>>> Clob goes back into the database
>>> I'm sure there are other aspects, and I would like some opinions and
>>> feedback on what to do. My two current alternatives on the table are
>>> using the naive counting technique, or changing to UTF-16. The former
>>> requires the least code changes.
>> Please correct me if I got it wrong, but based on what you wrote above,
>> it seems like we now have the following situation:
>> To allow updates of a Clob at random positions (that is, with
>> Clob.setString()), we create a copy of the Clob in a temporary
>> file. However, we need to read the temporary file sequentially from the
>> beginning for each operation in order to find the correct byte
>> position. So we only have sequential access to the file that is supposed
>> to give us random access to the Clob.
> Your description of the current situation is generally correct, but not 
> quite accurate. We don't start at the beginning of the temporary file 
> when trying to find the byte position for a character position.
> The method doing the lookup is supplied with a "position hint", which is 
> a byte position. However, this will not always work and it is a bug in 
> the current implementation.
> Even though we are able to locate *the next* character after the hint 
> position, and thus its byte position, we do now know which character 
> position it has.

Doh, too late...

It should be "we do *not* know".

sorry for the type/mindslip and the noise

> This bug was causing the UTFDataFormatException that has been observed 
> when using characters that are encoded with more than one byte in UTF-8.
> To me it seems like the method is not used the way it was intended to be 
> used. Looking more at it, it seems the charPos is relative to the 
> byte/hint position. I will be rewriting it anyway, and it is no secret 
> that working with UTF-8 is more complex than UTF-16 when it comes to 
> mapping character positions to byte positions.

View raw message