db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kristian Waagan <Kristian.Waa...@Sun.COM>
Subject Modified UTF-8 or UTF-16 for temporary Clobs?
Date Tue, 22 May 2007 18:37:24 GMT

In my work on DERBY-2646, I have stumbled upon some issues that can 
greatly affect the performance of accessing Clobs, especially updating them.

Currently Clobs are stored on disk in the modified UTF-8 encoding. This 
uses one to three bytes to represent a single character. Since the 
number of bytes per character varies, there is no easy way to calculate 
the byte position from the character position, or vice versa. The naive, 
and maybe even the only feasible way, is to start decoding the bytes 
from the start of the Clob.

Note that the storage I speak of is the temporary storage of Clob 
copies. This is initiated when the user attempts to modify the Clob. I 
am not considering the case where the Clob is stored in the database itself.

Obviously, reading the Clob from the start every time you need to 
reposition is not very efficient. One optimization is to keep track of 
the "current position", but it might not help that much (depending on 
access pattern). This requires full knowledge about update actions, 
including on the various streams/writers.
Another option is storing the Clob in UTF-16. This would allow direct 
mapping between byte and character positions, as far as I have 
understood (I had brief contact with Dag and Bernt offline), even in the 
case of surrogate pairs.

However, using UTF-16 imposes space overhead when operating on Clobs 
with US-ASCII characters, in fact the overhead is 100% (each character 
represented by two bytes instead of one). For some other languages 
(and/or character sets), using UTF-16 reduces the space requirements 
(two versus three bytes).

To summarize my view on this...

Pros, UTF-8 : more space efficient for US-ASCII, same as used by store
Pros, UTF-16: direct mapping between char/byte pos (easier logic)

Cons, UTF-8 : requires "counting"/decoding to find byte position
Cons, UTF-16: space overhead for US-ASCII, must be converted when/if 
Clob goes back into the database

I'm sure there are other aspects, and I would like some opinions and 
feedback on what to do. My two current alternatives on the table are 
using the naive counting technique, or changing to UTF-16. The former 
requires the least code changes.

To bound the scope of potential changes, I do plan to get this done for 


View raw message