db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kristian Waagan <Kristian.Waa...@Sun.COM>
Subject Storing length information for CLOB on disk
Date Thu, 28 Aug 2008 09:58:10 GMT

It is apparent that we would benefit from extending the length 
information stored about CLOBs in the Derby on-disk format.

Currently we set aside two bytes at the head of the data stream for 
storing the byte length of the CLOB. This is not sufficient, and can 
only be used for small CLOBs. For larger CLOBs we have to read through 
all the data to find the length.

I see three major steps in this process:
  a) Determine which information we want to store.
  b) Determine how and when the information is obtained and stored.
  c) Specify the actual on-disk format.

Regarding a), we have a few options:
  1) Store byte length only.
  2) Store character length only.
  3) Store both byte length and character length.
  4) Could also add distribution data for 1, 2, and 3 byte representations?

 From a JDBC API point of view, the character length is the most 
relevant one. Internally, Derby might benefit from knowing the byte 
length as well, since it can't be easily calculated from the character 
length. The reason for this is that we encode data using the modified 
UTF-8 format, where a single character can take up from one to three bytes.

Note that there are three aspects of finding the character length from 
the raw bytes:
  - reading all the raw bytes
  - decoding the raw bytes
  - potential page cache "pollution"

The length often has to be obtained to validate user input (positions 
and lengths), which means Derby has to do a lot of unnecessary work.

I'm bringing up b) because JDBC has methods for inserting data without 
specifying the length. Throwing out some ideas:
  - obtain required information on the fly, update immediately after 
insert is done
  - insert data without length information, then
     * update on first subsequent request
     * use a background task to update information
     * add a "maintenance routine" for updating the information

I'm asking people to share their opinions on this matter, and I'll try 
to answer any questions you might have to take the discussion forwards.


View raw message