Hello,
In my work on DERBY-2646, I have stumbled upon some issues that can
greatly affect the performance of accessing Clobs, especially updating them.
Currently Clobs are stored on disk in the modified UTF-8 encoding. This
uses one to three bytes to represent a single character. Since the
number of bytes per character varies, there is no easy way to calculate
the byte position from the character position, or vice versa. The naive,
and maybe even the only feasible way, is to start decoding the bytes
from the start of the Clob.
Note that the storage I speak of is the temporary storage of Clob
copies. This is initiated when the user attempts to modify the Clob. I
am not considering the case where the Clob is stored in the database itself.
Obviously, reading the Clob from the start every time you need to
reposition is not very efficient. One optimization is to keep track of
the "current position", but it might not help that much (depending on
access pattern). This requires full knowledge about update actions,
including on the various streams/writers.
Another option is storing the Clob in UTF-16. This would allow direct
mapping between byte and character positions, as far as I have
understood (I had brief contact with Dag and Bernt offline), even in the
case of surrogate pairs.
However, using UTF-16 imposes space overhead when operating on Clobs
with US-ASCII characters, in fact the overhead is 100% (each character
represented by two bytes instead of one). For some other languages
(and/or character sets), using UTF-16 reduces the space requirements
(two versus three bytes).
To summarize my view on this...
Pros, UTF-8 : more space efficient for US-ASCII, same as used by store
Pros, UTF-16: direct mapping between char/byte pos (easier logic)
Cons, UTF-8 : requires "counting"/decoding to find byte position
Cons, UTF-16: space overhead for US-ASCII, must be converted when/if
Clob goes back into the database
I'm sure there are other aspects, and I would like some opinions and
feedback on what to do. My two current alternatives on the table are
using the naive counting technique, or changing to UTF-16. The former
requires the least code changes.
To bound the scope of potential changes, I do plan to get this done for
10.3...
thanks,
--
Kristian
|