db-derby-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kristian Waagan <Kristian.Waa...@Sun.COM>
Subject Re: Overhead for varchar and clob
Date Tue, 30 Jun 2009 10:55:49 GMT
Donald McLean wrote:
> I'm working on something where I need to store strings of
> indeterminate length and I'm curious about the difference in storage
> costs for varchar vs clob. I'm not going to make my decision based on
> the difference - disk space is too cheap for it to be that big of a
> deal, but I don't like making decisions without better information.
> If some kind soul who is familiar with the inner workings of these two
> types could talk about them a bit, I would be grateful.

Hi Donald,

I'm giving you some high-level information here, as I don't know the 
details regarding overhead of overflow pages etc down in the store.

If we ignore the store page layout metadata (others may know the 
details), a cost which you may have to pay for both types depending on 
the length, we have:
 a) a content header (mandatory)
 b) the encoded content
 c) a content end marker (depends on a)

For VARCHAR and CLOB, the content encoding is the same; modified UTF-8. 
This means that depending on your content, each character requires one 
to three bytes (six bytes for some rather special/exotic characters - 
surrogate pairs). For instance, ASCII requires one byte per character, 
whereas the CJK (Chinese Japanese Korean) characters require three bytes 
per character.

The content header differs somewhat between the two types. VARCHAR uses 
a two byte header, where as CLOB uses a five byte header from 10.5. If 
the value is non-NULL, the header is always present and specifies a 
length for the value (either in bytes or characters). Note that the 
length may be unknown or too big to represent in the header, in which 
case a content end marker is required.

The content end marker is a byte sequence of three bytes, used to mark 
the end of the value in cases where Derby doesn't know the length of the 
value or the length is too big to represent in the header. The former is 
currently the case when the "lengthless overrides" of JDBC 4 are used to 
insert data into the database. These methods are convenience methods, 
and the JDBC / Java SE JavaDoc states that in some cases they may be 
less effective than the methods taking a length argument.

Other relevant questions regarding the choice of data type may be:
 - How often are the fields used in "everyday processing"?
 - What's the ratio of values fitting inside a VARCHAR versus those that 
have to be stored as a CLOB?
 - When the values are used, how are they used?

Finally, as I wrote in an earlier post in a different thread, the code 
for the CLOB data type is more complex than for VARCHAR. If you know you 
can manage with VARCHAR, I would recommend using it. If you need to 
store large(r) amounts of text, and typically access them though 
streams, then use CLOB. Using CLOBs also require a little extra care 
with respect to transactions / auto-commit, as the life-time of a CLOB 
is until the end of the transaction.

As an indication of the performance penalty for small CLOBs (100 
characters in this case), the single record select tests (found in the 
code repository) show a normalized throughput of 1.00 when using VARCHAR 
and a throughput of around 0.77 when using CLOB [1].
FYI, the difference was a lot larger in 10.4.

I can't give you data on how the page utilization will be affected.

Hope this helps,

[1] The load was CPU bound.

> Donald
> -------------------------------------------------------------
> Family photographs are a critical legacy for
> ourselves and our descendants. Protect that
> legacy with a digital backup and recovery plan.

View raw message