db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kristian Waagan <Kristian.Waa...@Sun.COM>
Subject Re: Modified UTF-8 or UTF-16 for temporary Clobs?
Date Wed, 23 May 2007 08:13:01 GMT
Mike Matrigali wrote:
> Kristian Waagan wrote:
>> Hello,
>> In my work on DERBY-2646, I have stumbled upon some issues that can 
>> greatly affect the performance of accessing Clobs, especially updating 
>> them.
>> Currently Clobs are stored on disk in the modified UTF-8 encoding. 
>> This uses one to three bytes to represent a single character. Since 
>> the number of bytes per character varies, there is no easy way to 
>> calculate the byte position from the character position, or vice 
>> versa. The naive, and maybe even the only feasible way, is to start 
>> decoding the bytes from the start of the Clob.
>> Note that the storage I speak of is the temporary storage of Clob 
>> copies. This is initiated when the user attempts to modify the Clob. I 
>> am not considering the case where the Clob is stored in the database 
>> itself.
>> Obviously, reading the Clob from the start every time you need to 
>> reposition is not very efficient. One optimization is to keep track of 
>> the "current position", but it might not help that much (depending on 
>> access pattern). This requires full knowledge about update actions, 
>> including on the various streams/writers.
>> Another option is storing the Clob in UTF-16. This would allow direct 
>> mapping between byte and character positions, as far as I have 
>> understood (I had brief contact with Dag and Bernt offline), even in 
>> the case of surrogate pairs.
>> However, using UTF-16 imposes space overhead when operating on Clobs 
>> with US-ASCII characters, in fact the overhead is 100% (each character 
>> represented by two bytes instead of one). For some other languages 
>> (and/or character sets), using UTF-16 reduces the space requirements 
>> (two versus three bytes).
>> To summarize my view on this...
>> Pros, UTF-8 : more space efficient for US-ASCII, same as used by store
>> Pros, UTF-16: direct mapping between char/byte pos (easier logic)
>> Cons, UTF-8 : requires "counting"/decoding to find byte position
>> Cons, UTF-16: space overhead for US-ASCII, must be converted when/if 
>> Clob goes back into the database
> Can you describe more in what situations you are proposing to use UTF-16
> vs. UTF-8.  I know that there is a lot of performance overhead in
> converting from one to the other, and I know in the past Derby had often
> even converted back and forth through bad code.  Are the changes you
> are proposing going to affect the non-update case.

Hi Mike,

Briefly stated, the encoding issue comes into play when the first update 
to the clob is issued. After this everything goes via the temporary 
copy, which is in memory or on disk depending on size.
Non-update cases are not affected.

If the user never issues a modification operation (setString, 
setCharacterStream, setAsciiStream, truncate), only streams from store 
will be used to fetch the required data.

> It would be nice if the following happens:
>    From whatever input (stream, clob, string ...) we convert it once
>    to the modified UTF-8 format that store uses on disk.  In the case
>    of stream we should read it only once and never flow it to object
>    or disk before getting it into store.

For insertions of new clobs (or other appropriate data types) through 
PreparedStatement, this does happen - although I haven't checked if 
using setAsciiStream causes a byte-char-byte conversion or not.

Note that there are two different types of setCharacterStream methods;
  A) PreparedStatement.setCharacterStream(column, Reader)
  B) Writer writer = Clob.setCharacterStream(pos)

Using B, a temporary clob will be created and the contents will spill to 
disk when the size threshold is reached.

>    Should be converted once from modified utf-8 into whatever format
>    is requested by the select with no intermediate object or disk
>    copies.

I believe this is also the current state. Again, it is only after an 
update a temporary clob is used.

> What is the expected usage pattern for an update on a clob that uses
> these "temporary" clobs?

Use of temporary clobs are triggered by updates to the clob. If clobs 
are updated, how many updates per transaction there will be is totally 
application dependent.

 > What is the usual input format, what is the
> usual output format?  

The input formats are those of String, Reader and InputStream.
The output formats are those of String, Writer and OutputStream.

> Do you expect more than one update usually?  

I don't know...

 > Does an update have to rewrite the end of the file on and shrink or 
expand of
> a middle of the clob?

The expansion and shrinking certainly is necessary when using UTF-8.
When you do a setString(pos, str), you basically overwrite a range of 
existing characters with the characters in the insertion string. If the 
characters replaced are not represented with the same number of bytes as 
the inserted characters, the byte array on disk (or in memory) must be 
expanded or shrunk accordingly.

If we had used an encoding with a fixed number of bytes per char in 
store, we could have gotten away with skipping to the right position and 
then just stream the new value into store.
As I see it, this is not possible when using UTF-8 in store.

There is no functionality for inserting a new string without overwriting 
existing content, but appending to the value is possible. You can also 
truncate the clob.

Last, calling Connection.createClob() will always result in a temporary 
clob being created.


>> I'm sure there are other aspects, and I would like some opinions and 
>> feedback on what to do. My two current alternatives on the table are 
>> using the naive counting technique, or changing to UTF-16. The former 
>> requires the least code changes.
>> To bound the scope of potential changes, I do plan to get this done 
>> for 10.3...
>> thanks,

View raw message