db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Matrigali <mikem_...@sbcglobal.net>
Subject Re: Derby architecture/design documents
Date Tue, 01 Feb 2005 19:21:12 GMT
answers/comments embedded below


Dibyendu Majumdar wrote:

> Jean,
> 
> Attached is an initial document on Derby page formats. I used forrest 0.6,
> so it should be easy to integrate.
> 
> Is it possible to put this up without publishing it - so that others can
> review and comment on it?
> 
> BTW, it is not complete ....
> 
> Regards
> 
> Dibyendu
> 
> 
> ------------------------------------------------------------------------
> 
> <?xml version="1.0"?>
> <!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd">
> <document> 
>   <header> 
>     <title>Derby On Disk Page Format</title>
>     <abstract>This document describes the storage format of Derby disk pages. 
>     </abstract>
>   </header>
>   <body>
>     <section id="introduction"> 
>       <title> Inroduction </title>
>       <p>Derby stores table and index data in Containers, which currently map 
>         to files in the 
>         <code>seg0</code>
>         directory of the database. Data is stored in pages within the container.</p>
>       <fixme author="Dibyendu Majumdar"> Do all containers map to a single file,
or does each container map 
>         to a file? </fixme>
In the current derby implementation there is a a 1 to 1 mapping of
containers to files.  2 containers never map to a single file and 1
container never maps to multiple containers.
>       <p>A page contains a set of records, which can be accessed by "slot", which

>         defines the order of the records on the page, or by "id" which defines 
>         the identity of the records on the page. Clients access records by both 
>         slot and id, depending on their needs.</p>
>       <p>There are two types of pages - Raw Stored Pages which hold data, and 
>         Raw Stored Alloc Pages which hold page allocation information.</p>
I usually describe the system as having 3 types of pages, though the
Header Page is actually just a specialized version of the Alloc Page.
It looks like you include this later - up to you on how best to describe
the distinction.  I know I always draw the following kind of picture

--------
| header |
----------
| data    |
-----------
| data    |
-----------
| ...     |
-----------
| alloc    |
-----------
| data |

Header Page - This is currently always page 0 of the container.  It
contains information that raw store needs to maintain about the
container once per container, it is currently implemented as an Alloc
Page which "borrows" space from the alloc page for it's information.
The original decision was that we did not want to waste a whole page for
hdr information, so we used part of it and then put the 1st allocation
map on the 2nd half of it.  See AllocPage.java for info about layout and
borrowing.

Allocation Page - After page 0, all subsequent Allocation pages only
have allocation bit maps.

Data page - as you describe
>       <p>A Table or a BTree index provides a row-based access mechanism (row-based

>         access interface is known as conglomerate). Rows are mapped to records 
>         in pages, in case of a table, a single row can span multiple records in 
>         multiple pages.</p>
>     </section>
>     <section id="storedpage"> 
>       <title>Data Page Format</title>
>       <p>A data page is broken into five sections. 
>         <img src="page-format.png" alt=""/>
>       </p>
>       <section id="formatid"> 
>         <title>Format Id </title>
>         <p> The formatId is a 4 bytes array, it contains the format Id of this

>           page. The possible values are RAW_STORE_STORED_PAGE or RAW_STORE_ALLOC_PAGE.</p>
>       </section>
>       <section id="pageheader"> 
>         <title> Page Header </title>
>         <p> The page header is a fixed size, 56 bytes. </p>
>           <table> 
>             <tr> 
>               <th>Size</th>
>               <th>Type</th>
>               <th>Description</th>
>             </tr>
>             <tr> 
>               <td>1 byte</td>
>               <td>boolean</td>
>               <td>is page an overflow page</td>
>             </tr>
>             <tr> 
>               <td>1 byte</td>
>               <td>byte</td>
>               <td><p>page status is either VALID_PAGE or INVALID_PAGE(a field

>                   maintained in base page)</p>
>                 <p>page goes thru the following transition: 
>                   <br/>
>                   VALID_PAGE &lt;-&gt; deallocated page -&gt; free page &lt;-&gt;

>                   VALID_PAGE</p>
>                 <p>deallocated and free page are both INVALID_PAGE as far as BasePage

>                   is concerned. 
>                   <br/>
>                   When a page is deallocated, it transitioned from VALID_PAGE 
>                   to INVALID_PAGE. 
>                   <br/>
>                   When a page is allocated, it trnasitioned from INVALID_PAGE 
>                   to VALID_PAGE.</p></td>
>             </tr>
>             <tr> 
>               <td>8 bytes</td>
>               <td>long</td>
>               <td>pageVersion (a field maintained in base page)</td>
>             </tr>
>             <tr> 
>               <td>2 bytes</td>
>               <td>unsigned short</td>
>               <td>number of slots in slot offset table</td>
>             </tr>
>             <tr> 
>               <td>4 bytes</td>
>               <td>integer</td>
>               <td>next record identifier</td>
>             </tr>
>             <tr> 
>               <td>4 bytes</td>
>               <td>integer</td>
>               <td>generation number of this page (Future Use)</td>
>             </tr>
>             <tr> 
>               <td>4 bytes</td>
>               <td>integer</td>
>               <td>previous generation of this page (Future Use)</td>
>             </tr>
>             <tr> 
>               <td>8 bytes</td>
>               <td>bipLocation</td>
>               <td>the location of the beforeimage page (Future Use)</td>
>             </tr>
>             <tr> 
>               <td>2 bytes</td>
>               <td>unsigned short</td>
>               <td>number of deleted rows on page. (new release 2.0)</td>
>             </tr>
>             <tr> 
>               <td>2 bytes</td>
>               <td>unsigned short</td>
>               <td>% of the page to keep free for updates</td>
>             </tr>
>             <tr> 
>               <td>2 bytes</td>
>               <td>short</td>
>               <td>spare for future use</td>
>             </tr>
>             <tr> 
>               <td>4 bytes</td>
>               <td>long</td>
>               <td>spare for future use (encryption uses to write random bytes 
>                 here).</td>
>             </tr>
>             <tr> 
>               <td>8 bytes</td>
>               <td>long</td>
>               <td>spare for future use</td>
>             </tr>
>             <tr> 
>               <td>8 bytes</td>
>               <td>long</td>
>               <td>spare for future use</td>
>             </tr>
>           </table>
>           <note>Spare space is guaranteed to be writen with "0", so that future

>             use of field should not either not use "0" as a valid data item or 
>             pick 0 as a valid default value so that on the fly upgrade can assume 
>             that 0 means field was never assigned. </note>
>         
>       </section>
>       <section id="records"> 
>         <title> Records </title>
>         
>           <p>The records section contains zero or more records. Each record starts

>             with a Record Header</p>
>           <table> 
>             <caption>Record Header</caption>
>             <tr> 
>               <th>Type</th>
>               <th>Description</th>
>             </tr>
>             <tr> 
>               <td>1 byte</td>
>               <td> <p>Status bits for the record header</p>
>                 <table> 
>                   <tr> 
>                     <td>RECORD_INITIAL</td>
>                     <td>used when record header is first initialized</td>
>                   </tr>
>                   <tr> 
>                     <td>RECORD_DELETED</td>
>                     <td>used to indicate the record has been deleted</td>
>                   </tr>
>                   <tr> 
>                     <td>RECORD_OVERFLOW</td>
>                     <td>used to indicate the record has been overflowed, it will

>                       point to the overflow page and ID</td>
>                   </tr>
>                   <tr> 
>                     <td>RECORD_HAS_FIRST_FIELD</td>
>                     <td>used to indicate that firstField is stored will be stored.

>                       When RECORD_OVERFLOW and RECORD_HAS_FIRST_FIELD both are 
>                       set, part of record is on the page, the record header also 
>                       stores the overflow point to the next part of the record.</td>
>                   </tr>
>                   <tr> 
>                     <td>RECORD_VALID_MASK</td>
>                     <td>A mask of valid bits that can be set currently, such that

>                       the following assert can be made: </td>
>                   </tr>
>                 </table></td>
>             </tr>
>             <tr> 
>               <td>compressed int</td>
>               <td>record identifier</td>
>             </tr>
>             <tr> 
>               <td>compressed long</td>
>               <td>overflow page only if RECORD_OVERFLOW is set</td>
>             </tr>
>             <tr> 
>               <td>compressed int</td>
>               <td>overflow id only if RECORD_OVERFLOW is set</td>
>             </tr>
>             <tr> 
>               <td>compressed int</td>
>               <td>first field only if RECORD_HAS_FIRST_FIELD is set - otherwise

>                 0</td>
>             </tr>
>             <tr> 
>               <td>compressed int</td>
>               <td>number of fields in this portion - only if RECORD_OVERFLOW is

>                 false OR RECORD_HAS_FIRST_FIELD is true - otherwise 0</td>
>             </tr>
>           </table>
>           <note label="Long Rows"> A row is long if all of it's columns can't fit
on a single page. 
>             When storing a long row, the segment of the row which fits on the 
>             page is left there, and a pointer column is added at the end of the 
>             row. It points to another row in the same container on a different 
>             page. That row will contain the next set of columns and a continuation 
>             pointer if necessary. The overflow portion will be on an "overflow" 
>             page, and that page may have overflow portions of other rows on it 
>             (unlike overflow columns). </note>
>           <p>The Record Header is followed by one or more fields. Each field contains

>             a Field Header and optional Field Data.</p>
>           <table> 
>             <caption>Stored Field Header Format</caption>
>             <tr> 
>               <td>status</td>
>               <td> <p> The status is 1 byte, it indicates the state of the
field. 
>                   A FieldHeader can be in the following states: </p>
>                   <table> 
>                     <tr> 
>                       <td>NULL</td>
>                       <td>if the field is NULL, no field data length is stored</td>
>                     </tr>
>                     <tr> 
>                       <td>OVERFLOW</td>
>                       <td>indicates the field has been overflowed to another page.

>                         overflow page and overflow ID is stored at the end of 
>                         the user data. field data length must be a number greater 
>                         or equal to 0, indicating the length of the field that 
>                         is stored on the current page. The format looks like this: 
>                         <img src="field-header-overflow.png" alt=""/>
>                         overflowPage will be written as compressed long, overflowId 
>                         will be written as compressed Int</td>
>                     </tr>
>                     <tr> 
>                       <td>NONEXISTENT</td>
>                       <td>the field no longer exists, e.g. column has been dropped

>                         during an alter table</td>
>                     </tr>
>                     <tr> 
>                       <td>EXTENSIBLE</td>
>                       <td>the field is of user defined data type. The field may

>                         be tagged.</td>
>                     </tr>
>                     <tr> 
>                       <td>TAGGED</td>
>                       <td>the field is TAGGED if and only if it is EXTENSIBLE.</td>
>                     </tr>
>                     <tr> 
>                       <td>FIXED</td>
>                       <td>the field is FIXED if and only if it is used in the 
>                         log records for version 1.2 and higher.</td>
>                     </tr>
>                   </table>
>                 </td>
>             </tr>
>             <tr> 
>               <td>fieldDataLength</td>
>               <td> The fieldDataLength is only set if the field is not NULL. It

>                 is the length of the field that is stored on the current page. 
>                 The fieldDataLength is a variable length CompressedInt. </td>
>             </tr>
>             <tr> 
>               <td>fieldData</td>
>               <td><p> Overflow page and overflow id are stored as field data.
If 
>                 the overflow bit in status is set, the field data is the overflow 
>                 information. When the overflow bit is not set in status, then, 
>                 fieldData is the actually user data for the field. That means, 
>                 field header consists only field status, and field data length. 
>                 <br/>
>                 A non-overflow field: 
>                 <br/> <img src="field-header-non-overflow.png" alt=""/> <br/>
>                 An overflow field: 
>                 <br/> <img src="field-header-overflow.png" alt=""/> <br/>
<strong>overflowPage 
>                   and overflowID</strong> <br/>
>                 The overflowPage is a variable length CompressedLong, overflowID 
>                 is a variable Length CompressedInt. They are only stored when 
>                 the field state is OVERFLOW. And they are not stored in the field 
>                 header. Instead, they are stored at the end of the field data. 
>                 The reason we do that is to save a copy if the field has to overflow.
</p>
>               </td>
>             </tr>
>           </table>
>           <note label="Long Columns"> A column is long if it can't fit on a single
page. A long column 
>             is marked as long in the base row, and it's field contains a pointer 
>             to a chain of other rows in the same container with contain the data 
>             of the row. Each of the subsequent rows is on a page to itself. Each 
>             subsquent row, except for the last piece has 2 columns, the first 
>             is the next segment of the row and the second is the pointer to the 
>             the following segment. The last segment only has the data segment. 
>           </note>
>         
>       </section>
>       <section id="slottable"> 
>         <title>Slot Offset Table</title>
>         <p>The slot offset table is a table of 6 or 12 bytes per record, depending

>           on the pageSize being less or greater than 64K: </p>
>           <table> 
>             <caption>Slot Table Record</caption>
>             <tr> 
>               <th>Size</th>
>               <th>Content</th>
>             </tr>
>             <tr> 
>               <td>2 bytes (unsigned short) or 4 bytes (int)</td>
>               <td>page offset for the record that is assigned to the slot</td>
>             </tr>
>             <tr> 
>               <td>2 bytes (unsigned short) or 4 bytes (int)</td>
>               <td>the length of the record on this page.</td>
>             </tr>
>             <tr> 
>               <td>2 bytes (unsigned short) or 4 bytes (int)</td>
>               <td>the length of the reserved number of bytes for this record on

>                 this page.</td>
>             </tr>
>           </table>
> 		  <p>
>           First slot is slot 0. The slot table grows backwards. Slots are never 
>           left empty. </p>
>       </section>
>       <section id="checksum"> 
>         <title>Checksum</title>
>         <p>8 bytes of a java.util.zip.CRC32 checksum of the entire's page contents

>           without the 8 bytes representing the checksum.</p>
>       </section>
> 	</section>
>     <section id="allocpage"> 
>       <title>Allocation Page</title>
>       <p> An allocation page of the file container extends a normal Stored page,

>         with the exception that a hunk of space may be 'borrowed' by the file 
>         container to store the file header.</p>
>       <p> The borrowed space is not visible to the alloc page even though it is

>         present in the page data array. It is accessed directly by the FileContainer.

>         Any change made to the borrowed space is not managed or seen by the allocation

>         page.</p>
>       <p> The reason for having this borrowed space is so that the container header

>         does not need to have a page of its own. </p>
>       <p> 
>         <strong>Page Format</strong>
>         <br/>
>         An allocation page extends a stored page, the on disk format is different 
>         from a stored page in that N bytes are 'borrowed' by the container and 
>         the page header of an allocation page will be slightly bigger than a normal 
>         stored page. This N bytes are stored between the page header and the record 
>         space.</p>
>       <p> The reason why this N bytes can't simply be a row is because it needs

>         to be statically accessible by the container object to avoid a chicken 
>         and egg problem of the container object needing to instantiate an alloc 
>         page object before it can be objectified, and an alloc page object needing 
>         to instantiate a container object before it can be objectified. So this 
>         N bytes must be stored outside of the normal record interface yet it must 
>         be settable because only the first alloc page has this borrowed space. 
>         Other (non-first) alloc page have N == 0. 
>         <br/>
>         <img src="alloc-page.png" alt=""/>
>       </p>
>     </section>
>   </body>
>   <footer> 
>     <legal>This is a legal notice, so it is 
>       <strong>important</strong>
>       .</legal>
>   </footer>
> </document>

Mime
View raw message