db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Matrigali <mikem_...@sbcglobal.net>
Subject Re: Modified UTF-8 or UTF-16 for temporary Clobs?
Date Tue, 22 May 2007 23:12:35 GMT


Kristian Waagan wrote:
> Hello,
> 
> In my work on DERBY-2646, I have stumbled upon some issues that can 
> greatly affect the performance of accessing Clobs, especially updating 
> them.
> 
> Currently Clobs are stored on disk in the modified UTF-8 encoding. This 
> uses one to three bytes to represent a single character. Since the 
> number of bytes per character varies, there is no easy way to calculate 
> the byte position from the character position, or vice versa. The naive, 
> and maybe even the only feasible way, is to start decoding the bytes 
> from the start of the Clob.
> 
> Note that the storage I speak of is the temporary storage of Clob 
> copies. This is initiated when the user attempts to modify the Clob. I 
> am not considering the case where the Clob is stored in the database 
> itself.
> 
> Obviously, reading the Clob from the start every time you need to 
> reposition is not very efficient. One optimization is to keep track of 
> the "current position", but it might not help that much (depending on 
> access pattern). This requires full knowledge about update actions, 
> including on the various streams/writers.
> Another option is storing the Clob in UTF-16. This would allow direct 
> mapping between byte and character positions, as far as I have 
> understood (I had brief contact with Dag and Bernt offline), even in the 
> case of surrogate pairs.
> 
> However, using UTF-16 imposes space overhead when operating on Clobs 
> with US-ASCII characters, in fact the overhead is 100% (each character 
> represented by two bytes instead of one). For some other languages 
> (and/or character sets), using UTF-16 reduces the space requirements 
> (two versus three bytes).
> 
> 
> 
> To summarize my view on this...
> 
> 
> Pros, UTF-8 : more space efficient for US-ASCII, same as used by store
> Pros, UTF-16: direct mapping between char/byte pos (easier logic)
> 
> Cons, UTF-8 : requires "counting"/decoding to find byte position
> Cons, UTF-16: space overhead for US-ASCII, must be converted when/if 
> Clob goes back into the database
Can you describe more in what situations you are proposing to use UTF-16
vs. UTF-8.  I know that there is a lot of performance overhead in
converting from one to the other, and I know in the past Derby had often
even converted back and forth through bad code.  Are the changes you
are proposing going to affect the non-update case.

It would be nice if the following happens:
1) INSERT
    From whatever input (stream, clob, string ...) we convert it once
    to the modified UTF-8 format that store uses on disk.  In the case
    of stream we should read it only once and never flow it to object
    or disk before getting it into store.

2) SELECT
    Should be converted once from modified utf-8 into whatever format
    is requested by the select with no intermediate object or disk
    copies.


What is the expected usage pattern for an update on a clob that uses
these "temporary" clobs?  What is the usual input format, what is the
usual output format?  Do you expect more than one update usually?  Does
an update have to rewrite the end of the file on and shrink or expand of
a middle of the clob?
> 
> I'm sure there are other aspects, and I would like some opinions and 
> feedback on what to do. My two current alternatives on the table are 
> using the naive counting technique, or changing to UTF-16. The former 
> requires the least code changes.
> 
> 
> To bound the scope of potential changes, I do plan to get this done for 
> 10.3...
> 
> 
> 
> thanks,


Mime
View raw message