Return-Path: Mailing-List: contact derby-commits-help@db.apache.org; run by ezmlm Delivered-To: mailing list derby-commits@db.apache.org Received: (qmail 27726 invoked by uid 500); 2 Feb 2005 03:29:29 -0000 Delivered-To: apmail-incubator-derby-cvs@incubator.apache.org Received: (qmail 27708 invoked by uid 99); 2 Feb 2005 03:29:28 -0000 X-ASF-Spam-Status: No, hits=-9.8 required=10.0 tests=ALL_TRUSTED,NO_REAL_NAME X-Spam-Check-By: apache.org Received: from minotaur.apache.org (HELO minotaur.apache.org) (209.237.227.194) by apache.org (qpsmtpd/0.28) with SMTP; Tue, 01 Feb 2005 19:29:27 -0800 Received: (qmail 87284 invoked by uid 65534); 2 Feb 2005 03:29:25 -0000 Message-ID: <20050202032925.87283.qmail@minotaur.apache.org> Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Mailer: svnmailer-1.0.0-dev Date: Wed, 02 Feb 2005 03:29:25 -0000 Subject: svn commit: r149475 - in incubator/derby/site/trunk: build/site/papers/container-format.png build/site/papers/pageformats.html src/documentation/content/xdocs/papers/container-format.aart src/documentation/content/xdocs/papers/pageformats.xml To: derby-cvs@incubator.apache.org From: jta@apache.org X-Virus-Checked: Checked Author: jta Date: Tue Feb 1 19:29:24 2005 New Revision: 149475 URL: http://svn.apache.org/viewcvs?view=3Drev&rev=3D149475 Log: Committed changes to papers/pageformats by Dibyendu Majumdar . Added: incubator/derby/site/trunk/build/site/papers/container-format.png (wi= th props) incubator/derby/site/trunk/src/documentation/content/xdocs/papers/conta= iner-format.aart (with props) Modified: incubator/derby/site/trunk/build/site/papers/pageformats.html incubator/derby/site/trunk/src/documentation/content/xdocs/papers/pagef= ormats.xml Added: incubator/derby/site/trunk/build/site/papers/container-format.png URL: http://svn.apache.org/viewcvs/incubator/derby/site/trunk/build/site/pa= pers/container-format.png?view=3Dauto&rev=3D149475 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D Binary file - no diff available. Propchange: incubator/derby/site/trunk/build/site/papers/container-format.p= ng ---------------------------------------------------------------------------= --- svn:mime-type =3D application/octet-stream Modified: incubator/derby/site/trunk/build/site/papers/pageformats.html URL: http://svn.apache.org/viewcvs/incubator/derby/site/trunk/build/site/pa= pers/pageformats.html?view=3Ddiff&r1=3D149474&r2=3D149475 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D --- incubator/derby/site/trunk/build/site/papers/pageformats.html (original) +++ incubator/derby/site/trunk/build/site/papers/pageformats.html Tue Feb = 1 19:29:24 2005 @@ -188,10 +188,15 @@

Derby On Disk Page Format

This document describes the storage format of Derb= y disk pages.=20 + This is a work-in-progress derived from Javadoc comments and=20 + from explanations Mike Matrigali posted to the Derby lists.=20 + Please post questions, comments, and corrections to=20 + derby-dev@db.apache.org. +
- - =20

Introduction

Derby stores table and index data in Containers, which currently map=20 - to files in the=20 - seg0 - directory of the database. Data is stored in pages within the cont= ainer.

-
-
Fixme (Dibyendu Majumdar)
-
Do all containers map to a single file, or does ea= ch container map=20 - to a file?
-
+ + to files in the seg0 + + directory of the database. In the current Derby implementation the= re is a a 1 to 1 mapping of + + containers to files. Two containers never map to a single file and= 1 + + container never maps to multiple containers.

+

+ + Data is stored in pages within the container.

A page contains a set of records, which can be accessed by "slot", whic= h=20 + defines the order of the records on the page, or by "id" which def= ines=20 + the identity of the records on the page. Clients access records by= both=20 + slot and id, depending on their needs.

-

There are two types of pages - Raw Stored Pages which hold data, and=20 - Raw Stored Alloc Pages which hold page allocation information.

A Table or a BTree index provides a row-based access mechanism (row-bas= ed=20 + access interface is known as conglomerate). Rows are mapped to rec= ords=20 - in pages, in case of a table, a single row can span multiple recor= ds in=20 + + in data pages; in case of a table, a single row can span multiple = records in=20 + multiple pages.

+

A container can have three types of pages:

+
    + =20 +
  • Header Page - which is just a specialized version of the Alloc Page. + =20 +
  • Data Pages which hold data, and
  • + =20 +
  • Alloc Pages which hold page allocation information. An Alloc page is a= specialized verion of the Data page.
  • + =20 +
+

The container can be visualised as:
+3D""

+

+ +Header Page is currently always page 0 of the container. It + +contains information that raw store needs to maintain about the + +container once per container, and is currently implemented as an Alloc + +Page which "borrows" space from the alloc page for it's information. + +The original decision was that the designers did not want to waste a whole= page for + +header information, so a part of the page was used and the first allocation + +map was put on the second half of it. See AllocPa= ge.java for info about layout and + +borrowing. + +

+

+ =20 + Allocation Page - After page 0, all subsequent= Allocation pages only + +have allocation bit maps. + +

=20 - +

Data Page Format

A data page is broken into five sections.=20 - 3D"" -

- + + 3D""

+

Format Id

The formatId is a 4 bytes array, it contains the format Id of this=20 + page. The possible values are RAW_STORE_STORED_PAGE or RAW_STORE= _ALLOC_PAGE.

- +

Page Header

The page header is a fixed size, 56 bytes.

-=20 +
+ =20 + =20 -=20 - =20 - - - =20 + + + =20 + =20 + =20 -=20 - =20 - - - =20 + + + =20 + =20 + =20 -=20 - =20 - - + =20 + + =20 + =20 + =20 -=20 - =20 - - - =20 + + + =20 + =20 + =20 -=20 - =20 - - - =20 + + + =20 + =20 + =20 -=20 - =20 - - - =20 + + + =20 + =20 + =20 -=20 - =20 - - - =20 + + + =20 + =20 + =20 -=20 - =20 - - - =20 + + + =20 + =20 + =20 -=20 - =20 - - - =20 + + + =20 + =20 + =20 -=20 - =20 - - - =20 + + + =20 + =20 + =20 -=20 - =20 - - - =20 + + + =20 + =20 + =20 -=20 - =20 - - - =20 + + + =20 + =20 + =20 -=20 - =20 - - + - =20 + =20 + =20 + =20 -=20 - =20 - - - =20 + + + =20 + =20 + =20 -=20 - =20 - - - =20 - + + =20 + + =20
SizeTypeDescriptionTypeDescription
1 bytebooleanis page an overflow pagebooleanis page an overflow page
1 bytebyte + byte + =20

page status is either VALID_PAGE or INVALID_PAGE(a field=20 + maintained in base page)

- =20 + =20

page goes thru the following transition:=20 +
+ VALID_PAGE <-> deallocated page -> free page &l= t;->=20 + VALID_PAGE

- =20 + =20

deallocated and free page are both INVALID_PAGE as far as BasePage=20 + is concerned.=20 +
+ When a page is deallocated, it transitioned from VALID_P= AGE=20 + to INVALID_PAGE.=20 +
+ When a page is allocated, it trnasitioned from INVALID_P= AGE=20 + to VALID_PAGE.

-
8 byteslongpageVersion (a field maintai= ned in base page)longpageVersion (a field maintaine= d in base page)
2 bytesunsigned shortnumber of slots in slot offs= et tableunsigned shortnumber of slots in slot offset= table
4 bytesintegernext record identifierintegernext record identifier
4 bytesintegergeneration number of this pa= ge (Future Use)integergeneration number of this page= (Future Use)
4 bytesintegerprevious generation of this = page (Future Use)integerprevious generation of this pa= ge (Future Use)
8 bytesbipLocationthe location of the beforeim= age page (Future Use)bipLocationthe location of the beforeimag= e page (Future Use)
2 bytesunsigned shortnumber of deleted rows on pa= ge. (new release 2.0)unsigned shortnumber of deleted rows on page= . (new release 2.0)
2 bytesunsigned short% of the page to keep free f= or updatesunsigned short% of the page to keep free for= updates
2 bytesshortspare for future useshortspare for future use
4 byteslongspare for future use (encryp= tion uses to write random bytes=20 + longspare for future use (encrypti= on uses to write random bytes=20 + here).
8 byteslongspare for future uselongspare for future use
8 byteslongspare for future use
longspare for future use
Note
Spare space is guaranteed to be writen with "0", so= that future=20 + use of field should not either not use "0" as a valid data ite= m or=20 + pick 0 as a valid default value so that on the fly upgrade can= assume=20 + that 0 means field was never assigned.
- +

Records

The records section contains zero or more records. Each record starts=20 + with a Record Header

-=20 - =20 +
+ =20 + =20 + =20 -=20 - =20 - - =20 + + =20 + =20 + =20 -=20 - =20 - =20 + + =20 + =20 + =20 -=20 - =20 - - =20 + + =20 + =20 + =20 -=20 - =20 - - =20 + + =20 + =20 + =20 -=20 - =20 - - =20 + + =20 + =20 + =20 -=20 - =20 - - =20 + =20 + =20 + =20 -=20 - =20 - - =20 - =20 + + =20
Record Header
TypeDescriptionDescription
1 byte=20 + + =20

Status bits for the record header

+ =20 + =20 -
=20 + =20 -=20 - =20 - - =20 + + =20 + =20 + =20 -=20 - =20 - - =20 + + =20 + =20 + =20 -=20 - =20 - - =20 + =20 + =20 + =20 -=20 - =20 - - =20 + =20 + =20 + =20 -=20 - =20 - - =20 - =20 + + =20
RECORD_INITIALused when record heade= r is first initializedused when record header = is first initialized
RECORD_DELETEDused to indicate the r= ecord has been deletedused to indicate the rec= ord has been deleted
RECORD_OVERFLOWused to indicate the r= ecord has been overflowed, it will=20 + used to indicate the rec= ord has been overflowed, it will=20 + point to the overflow page and ID
RECORD_HAS_FIRST_FIELDused to indicate that = firstField is stored will be stored.=20 + used to indicate that fi= rstField is stored will be stored.=20 + When RECORD_OVERFLOW and RECORD_HAS_FIRST_FIELD both= are=20 + set, part of record is on the page, the record heade= r also=20 + stores the overflow point to the next part of the re= cord.
RECORD_VALID_MASKA mask of valid bits t= hat can be set currently, such that=20 + A mask of valid bits tha= t can be set currently, such that=20 + the following assert can be made:
-
compressed intrecord identifierrecord identifier
compressed longoverflow page only if RECORD= _OVERFLOW is setoverflow page only if RECORD_O= VERFLOW is set
compressed intoverflow id only if RECORD_O= VERFLOW is setoverflow id only if RECORD_OVE= RFLOW is set
compressed intfirst field only if RECORD_H= AS_FIRST_FIELD is set - otherwise=20 + first field only if RECORD_HAS= _FIRST_FIELD is set - otherwise=20 + 0
compressed intnumber of fields in this por= tion - only if RECORD_OVERFLOW is=20 + number of fields in this porti= on - only if RECORD_OVERFLOW is=20 + false OR RECORD_HAS_FIRST_FIELD is true - otherwise 0
Long Rows
A row is long if all of it's columns can't fit on = a single page.=20 + When storing a long row, the segment of the row which fits on = the=20 + page is left there, and a pointer column is added at the end o= f the=20 + row. It points to another row in the same container on a diffe= rent=20 + page. That row will contain the next set of columns and a cont= inuation=20 + pointer if necessary. The overflow portion will be on an "over= flow"=20 + page, and that page may have overflow portions of other rows o= n it=20 + (unlike overflow columns).

The Record Header is followed by one or more fields. Each field contain= s=20 + a Field Header and optional Field Data.

-=20 - =20 +
+ =20 + =20 + =20 -=20 - =20 - =20 + + =20 + =20 + =20 -=20 - =20 - + =20 + =20 -=20 - =20 - =20 - + =20 + + =20
Stored Field Header Format
status=20 + + =20

The status is 1 byte, it indicates the state of the field.=20 + A FieldHeader can be in the following states:

+ =20 + + =20 + =20 -
=20 - =20 -=20 - =20 - - =20 + + =20 - =20 -=20 - =20 + =20 + + =20 - - =20 + =20 - =20 -=20 - =20 + =20 + + =20 - - =20 + =20 - =20 -=20 - =20 + =20 + + =20 - - =20 + =20 - =20 -=20 - =20 + =20 + + =20 - - =20 + + =20 - =20 -=20 - =20 + =20 + + =20 - - =20 + =20 - =20 + =20
NULLif the field is NULL= , no field data length is storedif the field is NULL, no= field data length is stored
OVERFLOWindicates the field = has been overflowed to another page.=20 + indicates the field has = been overflowed to another page.=20 + overflow page and overflow ID is stored at the end= of=20 + the user data. field data length must be a number = greater=20 + or equal to 0, indicating the length of the field = that=20 + is stored on the current page. The format looks li= ke this:=20 + 3D"" + overflowPage will be written as compressed long, o= verflowId=20 + will be written as compressed Int
NONEXISTENTthe field no longer = exists, e.g. column has been dropped=20 + the field no longer exis= ts, e.g. column has been dropped=20 + during an alter table
EXTENSIBLEthe field is of user= defined data type. The field may=20 + the field is of user def= ined data type. The field may=20 + be tagged.
TAGGEDthe field is TAGGED = if and only if it is EXTENSIBLE.the field is TAGGED if a= nd only if it is EXTENSIBLE.
FIXEDthe field is FIXED i= f and only if it is used in the=20 + the field is FIXED if an= d only if it is used in the=20 + log records for version 1.2 and higher.
- =20 -
fieldDataLength The fieldDataLength is only= set if the field is not NULL. It=20 + The fieldDataLength is only s= et if the field is not NULL. It=20 + is the length of the field that is stored on the current p= age.=20 + The fieldDataLength is a variable length CompressedInt. - =20 + =20
fieldData + + =20

Overflow page and overflow id are stored as field data. If=20 + the overflow bit in status is set, the field data is the o= verflow=20 + information. When the overflow bit is not set in status, t= hen,=20 + fieldData is the actually user data for the field. That me= ans,=20 + field header consists only field status, and field data le= ngth.=20 +
+ A non-overflow field:=20 -
=20 -3D""
+ +
+3D""
+ An overflow field:=20 -
=20 -3D""
=20 + +
+3D""
overflowPage=20 - and overflowID=20 + + and overflowID
+ The overflowPage is a variable length CompressedLong, over= flowID=20 + is a variable Length CompressedInt. They are only stored w= hen=20 + the field state is OVERFLOW. And they are not stored in th= e field=20 + header. Instead, they are stored at the end of the field d= ata.=20 + The reason we do that is to save a copy if the field has t= o overflow.

- =20 -
Long Columns
A column is long if it can't fit on a single page.= A long column=20 + is marked as long in the base row, and it's field contains a p= ointer=20 + to a chain of other rows in the same container with contain th= e data=20 + of the row. Each of the subsequent rows is on a page to itself= . Each=20 + subsquent row, except for the last piece has 2 columns, the fi= rst=20 + is the next segment of the row and the second is the pointer t= o the=20 + the following segment. The last segment only has the data segm= ent.=20 +
- +

Slot Offset Table

The slot offset table is a table of 6 or 12 bytes per record, depending=20 + on the pageSize being less or greater than 64K:

-=20 - =20 +
+ =20 + =20 + =20 -=20 - =20 - - =20 + + =20 + =20 + =20 -=20 - =20 - =20 + + =20 + =20 + =20 -=20 - =20 - =20 + + =20 + =20 + =20 -=20 - =20 - =20 - =20 + + =20
Slot Table Record
SizeContentContent
2 bytes (unsigned short) or 4 bytes (int)<= /td> - page offset for the record t= hat is assigned to the slotpage offset for the record tha= t is assigned to the slot
2 bytes (unsigned short) or 4 bytes (int)<= /td> - the length of the record on = this page.the length of the record on th= is page.
2 bytes (unsigned short) or 4 bytes (int)<= /td> - the length of the reserved n= umber of bytes for this record on=20 + the length of the reserved num= ber of bytes for this record on=20 + this page.

+ First slot is slot 0. The slot table grows backwards. Slots are = never=20 + left empty.

- +

Checksum

8 bytes of a java.util.zip.CRC32 checksum of the entire's page contents=20 + without the 8 bytes representing the checksum.

=20 - +

Allocation Page

An allocation page of the file container extends a normal Stored page,=20 + with the exception that a hunk of space may be 'borrowed' by the f= ile=20 + container to store the file header.

The borrowed space is not visible to the alloc page even though it is=20 + present in the page data array. It is accessed directly by the Fil= eContainer.=20 + Any change made to the borrowed space is not managed or seen by th= e allocation=20 + page.

The reason for having this borrowed space is so that the container hea= der=20 + does not need to have a page of its own.

-

=20 +

=20 Page Format =20
+ An allocation page extends a stored page, the on disk format is di= fferent=20 + from a stored page in that N bytes are 'borrowed' by the container= and=20 + the page header of an allocation page will be slightly bigger than= a normal=20 + stored page. This N bytes are stored between the page header and t= he record=20 + space.

The reason why this N bytes can't simply be a row is because it needs=20 + to be statically accessible by the container object to avoid a chi= cken=20 + and egg problem of the container object needing to instantiate an = alloc=20 + page object before it can be objectified, and an alloc page object= needing=20 + to instantiate a container object before it can be objectified. So= this=20 + N bytes must be stored outside of the normal record interface yet = it must=20 + be settable because only the first alloc page has this borrowed sp= ace.=20 + Other (non-first) alloc page have N =3D=3D 0.=20 +
+3D""

+

+ + N is a byte that indicates the size of the borrowed space. Once an alloc + + page is initialized, the value of N cannot change. + +

+

+ + The maximum space that can be borrowed by the container is 256 bytes. + +

+

+ + The allocation pages are of the same page size as any other pages in the + + container. The first allocation page of the FileContainer starts at the + + first physical byte of the container. Subsequent allocation pages are + + chained via the nextAllocPageOffset. Each allocation page is expected to + + manage at least 1000 user pages (for 1K page size) so this chaining may n= ot + + be a severe performance hit. The logical -> physical mapping of an + + allocation page is stored in the previous allocation page. The container + + object will need to maintain this mapping.

+

+ + The following fields are stored in the page header: + +

+ + =20 + + =20 + + =20 + + + =20 + + =20 + + =20 + + + =20 + + =20 + + =20 + + + =20 + + =20 + + =20 + + + =20 + + =20 + + =20 + + + =20 + + =20 + + =20 + + + =20 + + =20 + + =20 + + + =20 + + =20 + + =20 + + + =20 + + =20 + + =20 + + + =20 + =20 -3D"" + + =20 + + + =20 + + =20 + + =20 + + + =20 + + =20 + + =20 + + + =20 + + =20 +
+ + Format of Alloc Page + +
+ + Type + + + + Description + +
+ + int + + + + FormatId + +
StoredPageHeadersee Stor= ed Page Header
longnextAllocPageNumber - the next a= llocation page's number
longnextAllocPageOffset - the file o= ffset of the next allocation page
longreserved1 - reserved for future = usage
longreserved2 - reserved for future = usage
longreserved3 - reserved for future = usage
longreserved4 - reserved for future = usage
byteN - the size of the borrowed con= tainer info
byte[N]containerInfo - the content of t= he borrowed container info
AllocExtentThe one and only extent on this = alloc page.
+

+ + The allocation page contains allocation extent rows. In this first cut + + implementation, there is only 1 allocation extent row per allocation page. + +

+

+ + The allocation extent row is an externalizable object and is directly + + written on to the page by the alloc page. In other words, it will not be + + converted in to a storeableRow. This is to cut down overhead, enhance + + performance and gives more control of the size and layout of the allocati= on + + extent row to the alloc page. + +

+ +

+ + Alloc Page detailed implementation notes

+

+ + Create Container - an embryonic allocation page is formatted on disk by a + + special static function to avoid instantiating a full AllocPage object. + + This embryonic allocation has enough information that it can find the + + file header and not much else. Then the allocation page is properly + + initialized by creating the first extent. + +

+

+ + Open Container - A static AllocPage method will be used to read off the + + container information directly from disk. Even if + + the first alloc page (page 0) is already in the page cache, it will not be + + used because cleaning the alloc page will introduce a deadlock if the + + container is not in the container cache. Long term, the first alloc page + + should probably live in the container cache rather than in the page cache. + +

+

+ + Get Page - The first alloc page (page 0) will be read into the page cache. + + Continue to follow the alloc page chain until the alloc page that manages + + the specified page is found. From the alloc page, the physical offset of + + the specified page is located. + +

+

+ + Cleaning alloc page - the alloc page is written out the same way any page + + is written out. The container object will provide a call back to the all= oc + + page to write the current version of the container object back into the + + borrowed space before the alloc page itself is written out. + +

+

+ + Cleaning the container object - get the the first alloc page, dirty it and + + clean it (which will cause it to call the container object to write itself + + out into the borrowed space). The versioning of the container is + + independent of the versioning of the alloc page. The container version is + + stored inside the borrowed space and is opaque to the alloc page. + +

+

For the fields in an allocation extent row.

+
+ =20 + +

Allocation Extent

+
+

+ + An allocation extent row manages the page status of page in the extent. + + AllocExtent is externalizable and is written to the AllocPage directly, + + without being converted to a row first. + +

+ + =20 + + =20 + + =20 + + + =20 + + =20 + + =20 + + + =20 + + =20 + + =20 + + + =20 + + =20 + + =20 + + + =20 + + =20 + + =20 + + + =20 + + =20 + + =20 + + + =20 + + =20 + + =20 + + + =20 + + =20 + + =20 + + + =20 + + =20 + + =20 + + + =20 + + =20 + + =20 + + + =20 + + =20 + + =20 + + + =20 + + =20 + + =20 + + + =20 + + =20 +
Format of Allocation Extent
TypeDescription
longextentOffset - the begin physica= l byte offset of the first page of this extent
longextentStart - the first logical = page mananged by this extent.
longextentEnd - the last page this e= xtent can ever hope to manage.
intextentLength - the number of pag= es allocated in this extent
int + =20 +

extentStatus - status bits for the whole extent. + +
HAS_DEALLOCATED - most likely, this extent has a deallocated=20 + + page somewhere. If !HAD_DEALLOCATED, the extent ha= s no deallocated page. + +
HAS_FREE - most likely, this extent has a free page somewhere. + + If !HAS_FREE, there is no free page in the extent. + +
ALL_FREE - most likely, this extent only has free pages, good=20 + + candidate for shrinking the file. + + If !ALL_FREE, the extent is not all free. + +
HAS_UNFILLED_PAGES - most likely, this extent has unfilled pages. + + if !HAS_UNFILLED_PAGES, all pages are filled. + +
KEEP_UNFILLED_PAGES - this extent keeps track of unfilled pages + + (post v1.3). If not set, this extent has no notion of + + unfilled page and has no unFilledPage bitmap. + +
NO_DEALLOC_PAGE_MAP - this extents do not have a dealloc and a + + free page bit maps. Prior to 2.0, there are 2 bit + + maps, a deallocate page bit map and a free page bit + + map. Cloudscape 2.0 and later merged the dealloc page + + bit map into the free page bit map. + +
RETIRED - this extent contains only 'retired' pages, never use=20 + + any page from this extent. The pages don't actual= ly=20 + + exist, i.e., it maps to nothing (physicalOffset is=20 + + garbage). The purpose of this extent is to blot o= ut a=20 + + range of logical page numbers that no longer exist= s=20 + + for this container. Use this to reuse a physical = page + + when a logical page has exhausted all recordId or = for + + logical pages that has been shrunk out. + +

+ =20 +
intpreAllocLength - the number of p= ages that have been preallocated
intreserved1
longreserved2 - reserved for future = use
longreserved3 - reserved for future = use
FreePages(bit)Bitmap of free pages. Bit[i] is = ON if page is free for immediate (re)use.
unFilledPages(bit)Bitmap of pages that have free s= pace. Bit[i] is ON if page is likely to be < 1/2 full.
+

+ + org.apache.derby.iapi.services.io.FormatableBitSet is used to store the = bit map. =20 + + FormatableBitSet is an externalizable class. + +

+

+ + A page can have the following logical state: + +
Free - a page that is free to be used + +
Valid - a page that is currently in use + +

+

+ + There is another type of transitional pages which pages that have been + + allocated on disk but has not yet been used. These pages are Free. + +

+

+ + Bit[K] freePages + + Bit[i] is ON iff page i maybe free for reuse. User must get the + + dealloc page lock on the free page to make sure the transaction. + +

+

+ + K is the size of the bit array, it must be >=3D length. +

=20 Added: incubator/derby/site/trunk/src/documentation/content/xdocs/papers/co= ntainer-format.aart URL: http://svn.apache.org/viewcvs/incubator/derby/site/trunk/src/documenta= tion/content/xdocs/papers/container-format.aart?view=3Dauto&rev=3D149475 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D --- incubator/derby/site/trunk/src/documentation/content/xdocs/papers/conta= iner-format.aart (added) +++ incubator/derby/site/trunk/src/documentation/content/xdocs/papers/conta= iner-format.aart Tue Feb 1 19:29:24 2005 @@ -0,0 +1,13 @@ ++--------+ +| header | ++--------+ +| data | ++--------+ +| data | ++--------+ +| ... | ++--------+ +| alloc | ++--------+ +| data | ++--------+ Propchange: incubator/derby/site/trunk/src/documentation/content/xdocs/pape= rs/container-format.aart ---------------------------------------------------------------------------= --- svn:eol-style =3D native Modified: incubator/derby/site/trunk/src/documentation/content/xdocs/papers= /pageformats.xml URL: http://svn.apache.org/viewcvs/incubator/derby/site/trunk/src/documenta= tion/content/xdocs/papers/pageformats.xml?view=3Ddiff&r1=3D149474&r2=3D1494= 75 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D --- incubator/derby/site/trunk/src/documentation/content/xdocs/papers/pagef= ormats.xml (original) +++ incubator/derby/site/trunk/src/documentation/content/xdocs/papers/pagef= ormats.xml Tue Feb 1 19:29:24 2005 @@ -1,372 +1,855 @@ -=20 -
=20 + +
Derby On Disk Page Format This document describes the storage format of Derby disk pag= es.=20 + This is a work-in-progress derived from Javadoc comments and=20 + from explanations Mike Matrigali posted to the Derby lists.=20 + Please post questions, comments, and corrections to=20 + derby-dev@db.apache.org. +
- - -
=20 +
Introduction

Derby stores table and index data in Containers, which currently = map=20 - to files in the=20 - seg0 - directory of the database. Data is stored in pages within the cont= ainer.

- Do all containers map to a sing= le file, or does each container map=20 - to a file? + + to files in the seg0 + + directory of the database. In the current Derby implementation the= re is a a 1 to 1 mapping of + + containers to files. Two containers never map to a single file and= 1 + + container never maps to multiple containers.

+

+ + Data is stored in pages within the container.

A page contains a set of records, which can be accessed by "slot"= , which=20 + defines the order of the records on the page, or by "id" which def= ines=20 + the identity of the records on the page. Clients access records by= both=20 + slot and id, depending on their needs.

-

There are two types of pages - Raw Stored Pages which hold data, = and=20 - Raw Stored Alloc Pages which hold page allocation information.

A Table or a BTree index provides a row-based access mechanism (r= ow-based=20 + access interface is known as conglomerate). Rows are mapped to rec= ords=20 - in pages, in case of a table, a single row can span multiple recor= ds in=20 + + in data pages; in case of a table, a single row can span multiple = records in=20 + multiple pages.

+

A container can have three types of pages:

+
    +
  • Header Page - which is just a specialized version of the Alloc= Page.
  • +
  • Data Pages which hold data, and
  • +
  • Alloc Pages which hold page allocation information. An Alloc p= age is a specialized verion of the Data page.
  • +
+

The container can be visualised as:
3D""

+

+ +Header Page is currently always page 0 of the container. It + +contains information that raw store needs to maintain about the + +container once per container, and is currently implemented as an Alloc + +Page which "borrows" space from the alloc page for it's information. + +The original decision was that the designers did not want to waste a whole= page for + +header information, so a part of the page was used and the first allocation + +map was put on the second half of it. See AllocPage.java for = info about layout and + +borrowing. + +

+

+ Allocation Page - After page 0, all su= bsequent Allocation pages only + +have allocation bit maps. + +

-
=20 +
Data Page Format

A data page is broken into five sections.=20 - 3D""/ -

-
=20 + + 3D""

+
Format Id

The formatId is a 4 bytes array, it contains the format Id of = this=20 + page. The possible values are RAW_STORE_STORED_PAGE or RAW_STORE= _ALLOC_PAGE.

-
=20 +
Page Header

The page header is a fixed size, 56 bytes.

- =20 - =20 - - - - - =20 - - - - - =20 - - - + + + + + + + + + + +
SizeTypeDescription
1 bytebooleanis page an overflow page
1 bytebyte

page status is either VALID_PAGE or INVALID_PAGE(a fi= eld=20 + + + + + + + + + + + + + + + - - =20 - - - - - =20 - - - - - =20 - - - - - =20 - - - - - =20 - - - - - =20 - - - - - =20 - - - - - =20 - - - - - =20 - - - - - =20 - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + - - =20 - - - - - =20 - - - - -
SizeTypeDescription
1 bytebooleanis page an overflow page
1 bytebyte +

page status is either VALID_PAGE or INVALID_PAGE(a field=20 + maintained in base page)

-

page goes thru the following transition:=20 +

page goes thru the following transition:=20 +
+ VALID_PAGE <-> deallocated page -> free page &l= t;->=20 + VALID_PAGE

-

deallocated and free page are both INVALID_PAGE as far = as BasePage=20 +

deallocated and free page are both INVALID_PAGE as far as= BasePage=20 + is concerned.=20 +
+ When a page is deallocated, it transitioned from VALID_P= AGE=20 + to INVALID_PAGE.=20 +
+ When a page is allocated, it trnasitioned from INVALID_P= AGE=20 - to VALID_PAGE.

8 byteslongpageVersion (a field maintained in base page)
2 bytesunsigned shortnumber of slots in slot offset table
4 bytesintegernext record identifier
4 bytesintegergeneration number of this page (Future Use)
4 bytesintegerprevious generation of this page (Future Use)
8 bytesbipLocationthe location of the beforeimage page (Future Use)
2 bytesunsigned shortnumber of deleted rows on page. (new release 2.0)
2 bytesunsigned short% of the page to keep free for updates
2 bytesshortspare for future use
4 byteslongspare for future use (encryption uses to write random by= tes=20 + + to VALID_PAGE.

+
8 byteslongpageVersion (a field maintained in base page)
2 bytesunsigned shortnumber of slots in slot offset table
4 bytesintegernext record identifier
4 bytesintegergeneration number of this page (Future Use)
4 bytesintegerprevious generation of this page (Future Use)
8 bytesbipLocationthe location of the beforeimage page (Future Use)
2 bytesunsigned shortnumber of deleted rows on page. (new release 2.0)
2 bytesunsigned short% of the page to keep free for updates
2 bytesshortspare for future use
4 byteslongspare for future use (encryption uses to write random byte= s=20 + here).
8 byteslongspare for future use
8 byteslongspare for future use
- Spare space is guaranteed to be writen with "0", so that f= uture=20 +

8 byteslongspare for future use
8 byteslongspare for future use
+ Spare space is guaranteed to be writen with "0", so that fut= ure=20 + use of field should not either not use "0" as a valid data ite= m or=20 + pick 0 as a valid default value so that on the fly upgrade can= assume=20 + that 0 means field was never assigned. - =20
-
=20 +
Records - =20 -

The records section contains zero or more records. Each recor= d starts=20 +

The records section contains zero or more records. Each record = starts=20 + with a Record Header

- =20 - - =20 - - - - =20 - - +
Record Header
TypeDescription
1 byte

Status bits for the record header

- =20 - =20 - - - - =20 - - - - =20 - - + + + + + + + + + + + + + + + + - - =20 - - + + + - -
RECORD_INITIALused when record header is first initialized
RECORD_DELETEDused to indicate the record has been deleted
RECORD_OVERFLOWused to indicate the record has been overflowed, i= t will=20 + + + + + + + + + - - =20 - - - - =20 - - - - =20 - - - - =20 - - +
Record Header
TypeDescription
1 byte +

Status bits for the record header

+ + + + + + + + + + + + - - =20 - - + + + - - =20 - - + + + - -
RECORD_INITIALused when record header is first initialized
RECORD_DELETEDused to indicate the record has been deleted
RECORD_OVERFLOWused to indicate the record has been overflowed, it = will=20 + point to the overflow page and ID
RECORD_HAS_FIRST_FIELDused to indicate that firstField is stored will be= stored.=20 +
RECORD_HAS_FIRST_FIELDused to indicate that firstField is stored will be s= tored.=20 + When RECORD_OVERFLOW and RECORD_HAS_FIRST_FIELD both= are=20 + set, part of record is on the page, the record heade= r also=20 + stores the overflow point to the next part of the re= cord.
RECORD_VALID_MASKA mask of valid bits that can be set currently, su= ch that=20 +
RECORD_VALID_MASKA mask of valid bits that can be set currently, such= that=20 + the following assert can be made:
compressed intrecord identifier
compressed longoverflow page only if RECORD_OVERFLOW is set
compressed intoverflow id only if RECORD_OVERFLOW is set
compressed intfirst field only if RECORD_HAS_FIRST_FIELD is set - othe= rwise=20 +
+
compressed intrecord identifier
compressed longoverflow page only if RECORD_OVERFLOW is set
compressed intoverflow id only if RECORD_OVERFLOW is set
compressed intfirst field only if RECORD_HAS_FIRST_FIELD is set - otherw= ise=20 + 0
compressed intnumber of fields in this portion - only if RECORD_OVERFL= OW is=20 +
compressed intnumber of fields in this portion - only if RECORD_OVERFLOW= is=20 + false OR RECORD_HAS_FIRST_FIELD is true - otherwise 0
- A row is long if all of it's columns = can't fit on a single page.=20 +
+ A row is long if all of it's columns ca= n't fit on a single page.=20 + When storing a long row, the segment of the row which fits on = the=20 + page is left there, and a pointer column is added at the end o= f the=20 + row. It points to another row in the same container on a diffe= rent=20 + page. That row will contain the next set of columns and a cont= inuation=20 + pointer if necessary. The overflow portion will be on an "over= flow"=20 + page, and that page may have overflow portions of other rows o= n it=20 + (unlike overflow columns). -

The Record Header is followed by one or more fields. Each fie= ld contains=20 +

The Record Header is followed by one or more fields. Each field= contains=20 + a Field Header and optional Field Data.

- =20 - - =20 - - + +
Stored Field Header Format
status

The status is 1 byte, it indicates the state of the= field.=20 + + + + + + + + + - =20 - - + + + - -
Stored Field Header Format
status +

The status is 1 byte, it indicates the state of the fiel= d=2E=20 + A FieldHeader can be in the following states:

- =20 - =20 - - - - =20 - - - - =20 - - +
NULLif the field is NULL, no field data length is st= ored
OVERFLOWindicates the field has been overflowed to anoth= er page.=20 + + + + + + + + - - =20 - - + + + - - =20 - - + + + - - =20 - - - - =20 - - + + + + + + + - -
NULLif the field is NULL, no field data length is stored=
OVERFLOWindicates the field has been overflowed to another p= age.=20 + overflow page and overflow ID is stored at the end= of=20 + the user data. field data length must be a number = greater=20 + or equal to 0, indicating the length of the field = that=20 + is stored on the current page. The format looks li= ke this:=20 - 3D""/ + + 3D"" + overflowPage will be written as compressed long, o= verflowId=20 + will be written as compressed Int
NONEXISTENTthe field no longer exists, e.g. column has been= dropped=20 +
NONEXISTENTthe field no longer exists, e.g. column has been dro= pped=20 + during an alter table
EXTENSIBLEthe field is of user defined data type. The fiel= d may=20 +
EXTENSIBLEthe field is of user defined data type. The field ma= y=20 + be tagged.
TAGGEDthe field is TAGGED if and only if it is EXTENSI= BLE.
FIXEDthe field is FIXED if and only if it is used in = the=20 +
TAGGEDthe field is TAGGED if and only if it is EXTENSIBLE.=
FIXEDthe field is FIXED if and only if it is used in the=20 + log records for version 1.2 and higher.
-
fieldDataLength The fieldDataLength is only set if the field is not NUL= L=2E It=20 +
+
fieldDataLength The fieldDataLength is only set if the field is not NULL.= It=20 + is the length of the field that is stored on the current p= age.=20 + The fieldDataLength is a variable length CompressedInt. -
fieldData

Overflow page and overflow id are stored as field da= ta. If=20 +

fieldData +

Overflow page and overflow id are stored as field data. = If=20 + the overflow bit in status is set, the field data is the o= verflow=20 + information. When the overflow bit is not set in status, t= hen,=20 + fieldData is the actually user data for the field. That me= ans,=20 + field header consists only field status, and field data le= ngth.=20 +
+ A non-overflow field:=20 -
3D""/=
+ +
3D""=
+ An overflow field:=20 -
3D""/ overflowPage=20 - and overflowID
+ +
3D""
overflowPage=20 + + and overflowID
+ The overflowPage is a variable length CompressedLong, over= flowID=20 + is a variable Length CompressedInt. They are only stored w= hen=20 + the field state is OVERFLOW. And they are not stored in th= e field=20 + header. Instead, they are stored at the end of the field d= ata.=20 + The reason we do that is to save a copy if the field has t= o overflow.

-
- A column is long if it can't fit o= n a single page. A long column=20 +

+ A column is long if it can't fit on = a single page. A long column=20 + is marked as long in the base row, and it's field contains a p= ointer=20 + to a chain of other rows in the same container with contain th= e data=20 + of the row. Each of the subsequent rows is on a page to itself= . Each=20 + subsquent row, except for the last piece has 2 columns, the fi= rst=20 + is the next segment of the row and the second is the pointer t= o the=20 + the following segment. The last segment only has the data segm= ent.=20 + - =20
-
=20 +
Slot Offset Table

The slot offset table is a table of 6 or 12 bytes per record, d= epending=20 + on the pageSize being less or greater than 64K:

- =20 - - =20 - - - - =20 - - - =20 - - - - =20 - - +
Slot Table Record
SizeContent
2 bytes (unsigned short) or 4 bytes (int)page offset for the record that is assigned to the slot<= /td> -
2 bytes (unsigned short) or 4 bytes (int)the length of the record on this page.
2 bytes (unsigned short) or 4 bytes (int)the length of the reserved number of bytes for this reco= rd on=20 + + + + + + + + + + + + + + + + - -
Slot Table Record
SizeContent
2 bytes (unsigned short) or 4 bytes (int)page offset for the record that is assigned to the slot +
2 bytes (unsigned short) or 4 bytes (int)the length of the record on this page.
2 bytes (unsigned short) or 4 bytes (int)the length of the reserved number of bytes for this record= on=20 + this page.
-

+

+

+ First slot is slot 0. The slot table grows backwards. Slots are = never=20 + left empty.

-
=20 +
Checksum

8 bytes of a java.util.zip.CRC32 checksum of the entire's page = contents=20 + without the 8 bytes representing the checksum.

-
-
=20 +
+
Allocation Page

An allocation page of the file container extends a normal Stored= page,=20 + with the exception that a hunk of space may be 'borrowed' by the f= ile=20 + container to store the file header.

The borrowed space is not visible to the alloc page even though = it is=20 + present in the page data array. It is accessed directly by the Fil= eContainer.=20 + Any change made to the borrowed space is not managed or seen by th= e allocation=20 + page.

The reason for having this borrowed space is so that the contain= er header=20 + does not need to have a page of its own.

-

=20 +

Page Format
+ An allocation page extends a stored page, the on disk format is di= fferent=20 + from a stored page in that N bytes are 'borrowed' by the container= and=20 + the page header of an allocation page will be slightly bigger than= a normal=20 + stored page. This N bytes are stored between the page header and t= he record=20 + space.

The reason why this N bytes can't simply be a row is because it = needs=20 + to be statically accessible by the container object to avoid a chi= cken=20 + and egg problem of the container object needing to instantiate an = alloc=20 + page object before it can be objectified, and an alloc page object= needing=20 + to instantiate a container object before it can be objectified. So= this=20 + N bytes must be stored outside of the normal record interface yet = it must=20 + be settable because only the first alloc page has this borrowed sp= ace.=20 + Other (non-first) alloc page have N =3D=3D 0.=20 -
- 3D""/ + +
3D""

+

+ + N is a byte that indicates the size of the borrowed space. Once an alloc + + page is initialized, the value of N cannot change. + +

+

+ + The maximum space that can be borrowed by the container is 256 bytes. + +

+

+ + The allocation pages are of the same page size as any other pages in the + + container. The first allocation page of the FileContainer starts at the + + first physical byte of the container. Subsequent allocation pages are + + chained via the nextAllocPageOffset. Each allocation page is expected to + + manage at least 1000 user pages (for 1K page size) so this chaining may n= ot + + be a severe performance hit. The logical -> physical mapping of an + + allocation page is stored in the previous allocation page. The container + + object will need to maintain this mapping.

+

+ + The following fields are stored in the page header: + +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + Format of Alloc Page + +
+ + Type + + + + Description + +
+ + int + + + + FormatId + +
StoredPageHeadersee Stored Page Header
longnextAllocPageNumber - the next allocation page's number
longnextAllocPageOffset - the file offset of the next allocation= page
longreserved1 - reserved for future usage
longreserved2 - reserved for future usage
longreserved3 - reserved for future usage
longreserved4 - reserved for future usage
byteN - the size of the borrowed container info
byte[N]containerInfo - the content of the borrowed container info +
AllocExtentThe one and only extent on this alloc page.
+

+ + The allocation page contains allocation extent rows. In this first cut + + implementation, there is only 1 allocation extent row per allocation page. + +

+

+ + The allocation extent row is an externalizable object and is directly + + written on to the page by the alloc page. In other words, it will not be + + converted in to a storeableRow. This is to cut down overhead, enhance + + performance and gives more control of the size and layout of the allocati= on + + extent row to the alloc page. + +

+
+ + + Alloc Page detailed implementation notes +

+ + Create Container - an embryonic allocation page is formatted on disk by a + + special static function to avoid instantiating a full AllocPage object. + + This embryonic allocation has enough information that it can find the + + file header and not much else. Then the allocation page is properly + + initialized by creating the first extent. + +

+

+ + Open Container - A static AllocPage method will be used to read off the + + container information directly from disk. Even if + + the first alloc page (page 0) is already in the page cache, it will not be + + used because cleaning the alloc page will introduce a deadlock if the + + container is not in the container cache. Long term, the first alloc page + + should probably live in the container cache rather than in the page cache. + +

+

+ + Get Page - The first alloc page (page 0) will be read into the page cache. + + Continue to follow the alloc page chain until the alloc page that manages + + the specified page is found. From the alloc page, the physical offset of + + the specified page is located. + +

+

+ + Cleaning alloc page - the alloc page is written out the same way any page + + is written out. The container object will provide a call back to the all= oc + + page to write the current version of the container object back into the + + borrowed space before the alloc page itself is written out. + +

+

+ + Cleaning the container object - get the the first alloc page, dirty it and + + clean it (which will cause it to call the container object to write itself + + out into the borrowed space). The versioning of the container is + + independent of the versioning of the alloc page. The container version is + + stored inside the borrowed space and is opaque to the alloc page. + +

+

For the fields in an allocation extent row.

+
+
+
+ Allocation Extent +

+ + An allocation extent row manages the page status of page in the extent. + + AllocExtent is externalizable and is written to the AllocPage directly, + + without being converted to a row first. + +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Format of Allocation Extent
TypeDescription
longextentOffset - the begin physical byte offset of the first p= age of this extent
longextentStart - the first logical page mananged by this extent= .
longextentEnd - the last page this extent can ever hope to manag= e=2E
intextentLength - the number of pages allocated in this extent<= /td> +
int +

extentStatus - status bits for the whole extent. + +
HAS_DEALLOCATED - most likely, this extent has a deallocated=20 + + page somewhere. If !HAD_DEALLOCATED, the extent ha= s no deallocated page. + +
HAS_FREE - most likely, this extent has a free page somewhere. + + If !HAS_FREE, there is no free page in the extent. + +
ALL_FREE - most likely, this extent only has free pages, good=20 + + candidate for shrinking the file. + + If !ALL_FREE, the extent is not all free. + +
HAS_UNFILLED_PAGES - most likely, this extent has unfilled pages. + + if !HAS_UNFILLED_PAGES, all pages are filled. + +
KEEP_UNFILLED_PAGES - this extent keeps track of unfilled pages + + (post v1.3). If not set, this extent has no notion of + + unfilled page and has no unFilledPage bitmap. + +
NO_DEALLOC_PAGE_MAP - this extents do not have a dealloc and a + + free page bit maps. Prior to 2.0, there are 2 bit + + maps, a deallocate page bit map and a free page bit + + map. Cloudscape 2.0 and later merged the dealloc page + + bit map into the free page bit map. + +
RETIRED - this extent contains only 'retired' pages, never use=20 + + any page from this extent. The pages don't actual= ly=20 + + exist, i.e., it maps to nothing (physicalOffset is=20 + + garbage). The purpose of this extent is to blot o= ut a=20 + + range of logical page numbers that no longer exist= s=20 + + for this container. Use this to reuse a physical = page + + when a logical page has exhausted all recordId or = for + + logical pages that has been shrunk out. + +

+
intpreAllocLength - the number of pages that have been prealloc= ated
intreserved1
longreserved2 - reserved for future use
longreserved3 - reserved for future use
FreePages(bit)Bitmap of free pages. Bit[i] is ON if page is free for immed= iate (re)use.
unFilledPages(bit)Bitmap of pages that have free space. Bit[i] is ON if page i= s likely to be < 1/2 full.
+

+ + org.apache.derby.iapi.services.io.FormatableBitSet is used to store the = bit map. =20 + + FormatableBitSet is an externalizable class. + +

+

+ + A page can have the following logical state: + +
Free - a page that is free to be used + +
Valid - a page that is currently in use + +

+

+ + There is another type of transitional pages which pages that have been + + allocated on disk but has not yet been used. These pages are Free. + +

+

+ + Bit[K] freePages + + Bit[i] is ON iff page i maybe free for reuse. User must get the + + dealloc page lock on the free page to make sure the transaction. + +

+

+ + K is the size of the bit array, it must be >=3D length. +

-