Mailing-List: contact derby-dev-help@db.apache.org; run by ezmlm
Precedence: bulk
Reply-To: <derby-dev@db.apache.org>
Received-SPF: pass (herse.apache.org: local policy)
Date: Wed, 23 May 2007 00:29:16 +0200
From: Knut Anders Hatlen <Knut.Hatlen@Sun.COM>
Subject: Re: Modified UTF-8 or UTF-16 for temporary Clobs?
In-reply-to: <46533864.8030706@Sun.com>
Sender: Knut.Hatlen@Sun.COM
To: derby-dev@db.apache.org
Message-id: <x7lkfgh4qr.fsf@Sun.COM>
Organization: Sun Microsystems
MIME-version: 1.0
Content-type: text/plain; charset=us-ascii
Content-transfer-encoding: 7BIT
References: <46533864.8030706@Sun.com>
User-Agent: Gnus/5.110006 (No Gnus v0.6) Emacs/22.1.50 (usg-unix-v)

Kristian Waagan <Kristian.Waagan@Sun.COM> writes:

> Hello,
>
> In my work on DERBY-2646, I have stumbled upon some issues that can
> greatly affect the performance of accessing Clobs, especially updating
> them.

[....]

> To summarize my view on this...
>
>
> Pros, UTF-8 : more space efficient for US-ASCII, same as used by store
> Pros, UTF-16: direct mapping between char/byte pos (easier logic)
>
> Cons, UTF-8 : requires "counting"/decoding to find byte position
> Cons, UTF-16: space overhead for US-ASCII, must be converted when/if
> Clob goes back into the database
>
> I'm sure there are other aspects, and I would like some opinions and
> feedback on what to do. My two current alternatives on the table are
> using the naive counting technique, or changing to UTF-16. The former
> requires the least code changes.

Please correct me if I got it wrong, but based on what you wrote above,
it seems like we now have the following situation:

To allow updates of a Clob at random positions (that is, with
Clob.setString()), we create a copy of the Clob in a temporary
file. However, we need to read the temporary file sequentially from the
beginning for each operation in order to find the correct byte
position. So we only have sequential access to the file that is supposed
to give us random access to the Clob.

If the purpose of the temporary file is to give random access to the
Clob, then I definitely think UTF-16 is a better choice than UTF-8. I'm
not sure how important the space overhead for 7-bit ASCII is, as long as
there is zero or negative overhead for all non-ASCII characters.

Space and performance considerations aside, the simpler relation between
byte positions and character positions in UTF-16 would probably make it
easier to write bug-free code. Since all chars are treated equally, we
wouldn't have to come up with a great number of tests testing all
possible combinations of single-byte chars and multi-byte chars, and it
would therefore be easier to gain confidence in the correctness of the
code.

-- 
Knut Anders