From derby-dev-return-2003-apmail-db-derby-dev-archive=db.apache.org@db.apache.org Wed Feb 02 00:52:55 2005 Return-Path: Delivered-To: apmail-db-derby-dev-archive@www.apache.org Received: (qmail 13207 invoked from network); 2 Feb 2005 00:52:54 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 2 Feb 2005 00:52:54 -0000 Received: (qmail 56013 invoked by uid 500); 2 Feb 2005 00:52:54 -0000 Delivered-To: apmail-db-derby-dev-archive@db.apache.org Received: (qmail 55784 invoked by uid 500); 2 Feb 2005 00:52:53 -0000 Mailing-List: contact derby-dev-help@db.apache.org; run by ezmlm Precedence: bulk list-help: list-unsubscribe: list-post: List-Id: Reply-To: "Derby Development" Delivered-To: mailing list derby-dev@db.apache.org Received: (qmail 55771 invoked by uid 99); 2 Feb 2005 00:52:53 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (hermes.apache.org: local policy) Received: from anchor-post-34.mail.demon.net (HELO anchor-post-34.mail.demon.net) (194.217.242.92) by apache.org (qpsmtpd/0.28) with ESMTP; Tue, 01 Feb 2005 16:52:51 -0800 Received: from mazumdar.demon.co.uk ([80.177.27.104] helo=pavilion) by anchor-post-34.mail.demon.net with smtp (Exim 4.42) id 1Cw8lD-000Iwj-Do for derby-dev@db.apache.org; Wed, 02 Feb 2005 00:52:48 +0000 Message-ID: <002901c508c1$977eb120$0200000a@lan> From: "Dibyendu Majumdar" To: "Derby Development" References: <3dee901305011721231508061c@mail.gmail.com> <41ED5411.7070302@bristowhill.com> <001201c4fd9b$2eb50530$1401a8c0@rpws002> <41EEAD7C.7020108@debrunners.com> <001801c4fe68$e5118100$0200000a@lan> <41F17533.7090700@bristowhill.com> <002a01c50004$91634b40$0200000a@lan> <41F18C91.1080706@bristowhill.com> <001001c50338$13595280$0200000a@lan> <41F6E08B.7090404@bristowhill.com> <000401c50340$88521c40$0200000a@lan> <41F6F0AC.5060705@bristowhill.com> <41F6F794.5060607@bristowhill.com> <002101c50732$7b9a5bc0$0200000a@lan> <41FE5F3D.5000308@bristowhill.com> <001701c507c2$ce128fe0$0200000a@lan> <41FE8158.8060605@bristowhill.com> <003501c507c8$815f7ea0$0200000a@lan> <41FE91BD.4030001@bristowhill.com> <41FEBFD0.5010201@bristowhill.com> Subject: Re: Derby architecture/design documents Date: Wed, 2 Feb 2005 00:53:21 -0000 MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_NextPart_000_0026_01C508C1.9753F7A0" X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2800.1409 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1409 X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N This is a multi-part message in MIME format. ------=_NextPart_000_0026_01C508C1.9753F7A0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit From: "Jean T. Anderson" > Dibyendu Majumdar's doc on "Derby On Disk Page Format" is now live at > http://incubator.apache.org/derby/papers/pageformats.html. Jean, Attached is an updated version and an additional aart file. All, I am going to work on the following areas in the next few days: a) Logging/recovery b) BTree notes I will post stuff as I progress. Regards Dibyendu ------=_NextPart_000_0026_01C508C1.9753F7A0 Content-Type: text/xml; name="pageformats.xml" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="pageformats.xml"
Derby On Disk Page Format This document describes the storage format of Derby disk = pages.=20 This is a work-in-progress derived from Javadoc comments and=20 from explanations Mike Matrigali posted to the Derby lists.=20 Please post questions, comments, and corrections to=20 derby-dev@db.apache.org.
Introduction

Derby stores table and index data in Containers, which = currently map=20 to files in the seg0 directory of the database. In the current Derby implementation = there is a a 1 to 1 mapping of containers to files. Two containers never map to a single file = and 1 container never maps to multiple containers.

Data is stored in pages within the container.

A page contains a set of records, which can be accessed by = "slot", which=20 defines the order of the records on the page, or by "id" which = defines=20 the identity of the records on the page. Clients access records = by both=20 slot and id, depending on their needs.

A Table or a BTree index provides a row-based access mechanism = (row-based=20 access interface is known as conglomerate). Rows are mapped to = records=20 in data pages; in case of a table, a single row can span = multiple records in=20 multiple pages.

A container can have three types of pages:

  • Header Page - which is just a specialized version of the = Alloc Page.
  • Data Pages which hold data, and
  • Alloc Pages which hold page allocation information. An Alloc = page is a specialized verion of the Data page.

The container can be visualised as:
3D""

Header Page is currently always page 0 of the container. It contains information that raw store needs to maintain about the container once per container, and is currently implemented as an Alloc Page which "borrows" space from the alloc page for it's information. The original decision was that the designers did not want to waste a = whole page for header information, so a part of the page was used and the first = allocation map was put on the second half of it. See AllocPage.java = for info about layout and borrowing.

Allocation Page - After page 0, all = subsequent Allocation pages only have allocation bit maps.

Data Page Format

A data page is broken into five sections.=20 3D""

Format Id

The formatId is a 4 bytes array, it contains the format Id = of this=20 page. The possible values are RAW_STORE_STORED_PAGE or = RAW_STORE_ALLOC_PAGE.

Page Header

The page header is a fixed size, 56 bytes.

Size Type Description
1 byte boolean is page an overflow page
1 byte byte

page status is either VALID_PAGE or INVALID_PAGE(a = field=20 maintained in base page)

page goes thru the following transition:=20
VALID_PAGE <-> deallocated page -> free page = <->=20 VALID_PAGE

deallocated and free page are both INVALID_PAGE as far = as BasePage=20 is concerned.=20
When a page is deallocated, it transitioned from = VALID_PAGE=20 to INVALID_PAGE.=20
When a page is allocated, it trnasitioned from = INVALID_PAGE=20 to VALID_PAGE.

8 bytes long pageVersion (a field maintained in base page)
2 bytes unsigned short number of slots in slot offset table
4 bytes integer next record identifier
4 bytes integer generation number of this page (Future Use)
4 bytes integer previous generation of this page (Future Use)
8 bytes bipLocation the location of the beforeimage page (Future Use)
2 bytes unsigned short number of deleted rows on page. (new release 2.0)
2 bytes unsigned short % of the page to keep free for updates
2 bytes short spare for future use
4 bytes long spare for future use (encryption uses to write random = bytes=20 here).
8 bytes long spare for future use
8 bytes long spare for future use
Spare space is guaranteed to be writen with "0", so that = future=20 use of field should not either not use "0" as a valid data = item or=20 pick 0 as a valid default value so that on the fly upgrade = can assume=20 that 0 means field was never assigned.
Records

The records section contains zero or more records. Each = record starts=20 with a Record Header

Record Header
Type Description
1 byte

Status bits for the record header

RECORD_INITIAL used when record header is first initialized
RECORD_DELETED used to indicate the record has been deleted
RECORD_OVERFLOW used to indicate the record has been overflowed, = it will=20 point to the overflow page and ID
RECORD_HAS_FIRST_FIELD used to indicate that firstField is stored will be = stored.=20 When RECORD_OVERFLOW and RECORD_HAS_FIRST_FIELD = both are=20 set, part of record is on the page, the record = header also=20 stores the overflow point to the next part of the = record.
RECORD_VALID_MASK A mask of valid bits that can be set currently, = such that=20 the following assert can be made:
compressed int record identifier
compressed long overflow page only if RECORD_OVERFLOW is set
compressed int overflow id only if RECORD_OVERFLOW is set
compressed int first field only if RECORD_HAS_FIRST_FIELD is set - = otherwise=20 0
compressed int number of fields in this portion - only if = RECORD_OVERFLOW is=20 false OR RECORD_HAS_FIRST_FIELD is true - otherwise = 0
A row is long if all of it's columns = can't fit on a single page.=20 When storing a long row, the segment of the row which fits = on the=20 page is left there, and a pointer column is added at the end = of the=20 row. It points to another row in the same container on a = different=20 page. That row will contain the next set of columns and a = continuation=20 pointer if necessary. The overflow portion will be on an = "overflow"=20 page, and that page may have overflow portions of other rows = on it=20 (unlike overflow columns).

The Record Header is followed by one or more fields. Each = field contains=20 a Field Header and optional Field Data.

Stored Field Header Format
status

The status is 1 byte, it indicates the state of the = field.=20 A FieldHeader can be in the following states:

NULL if the field is NULL, no field data length is = stored
OVERFLOW indicates the field has been overflowed to another = page.=20 overflow page and overflow ID is stored at the = end of=20 the user data. field data length must be a = number greater=20 or equal to 0, indicating the length of the = field that=20 is stored on the current page. The format looks = like this:=20 3D"" overflowPage will be written as compressed long, = overflowId=20 will be written as compressed Int
NONEXISTENT the field no longer exists, e.g. column has been = dropped=20 during an alter table
EXTENSIBLE the field is of user defined data type. The field = may=20 be tagged.
TAGGED the field is TAGGED if and only if it is = EXTENSIBLE.
FIXED the field is FIXED if and only if it is used in = the=20 log records for version 1.2 and higher.
fieldDataLength The fieldDataLength is only set if the field is not = NULL. It=20 is the length of the field that is stored on the current = page.=20 The fieldDataLength is a variable length CompressedInt. =
fieldData

Overflow page and overflow id are stored as field = data. If=20 the overflow bit in status is set, the field data is the = overflow=20 information. When the overflow bit is not set in status, = then,=20 fieldData is the actually user data for the field. That = means,=20 field header consists only field status, and field data = length.=20
A non-overflow field:=20
3D""
An overflow field:=20
3D""
overflowPage=20 and overflowID
The overflowPage is a variable length CompressedLong, = overflowID=20 is a variable Length CompressedInt. They are only stored = when=20 the field state is OVERFLOW. And they are not stored in = the field=20 header. Instead, they are stored at the end of the field = data.=20 The reason we do that is to save a copy if the field has = to overflow.

A column is long if it can't fit = on a single page. A long column=20 is marked as long in the base row, and it's field contains a = pointer=20 to a chain of other rows in the same container with contain = the data=20 of the row. Each of the subsequent rows is on a page to = itself. Each=20 subsquent row, except for the last piece has 2 columns, the = first=20 is the next segment of the row and the second is the pointer = to the=20 the following segment. The last segment only has the data = segment.=20
Slot Offset Table

The slot offset table is a table of 6 or 12 bytes per record, = depending=20 on the pageSize being less or greater than 64K:

Slot Table Record
Size Content
2 bytes (unsigned short) or 4 bytes (int) page offset for the record that is assigned to the = slot
2 bytes (unsigned short) or 4 bytes (int) the length of the record on this page.
2 bytes (unsigned short) or 4 bytes (int) the length of the reserved number of bytes for this = record on=20 this page.

First slot is slot 0. The slot table grows backwards. Slots = are never=20 left empty.

Checksum

8 bytes of a java.util.zip.CRC32 checksum of the entire's = page contents=20 without the 8 bytes representing the checksum.

Allocation Page

An allocation page of the file container extends a normal = Stored page,=20 with the exception that a hunk of space may be 'borrowed' by the = file=20 container to store the file header.

The borrowed space is not visible to the alloc page even = though it is=20 present in the page data array. It is accessed directly by the = FileContainer.=20 Any change made to the borrowed space is not managed or seen by = the allocation=20 page.

The reason for having this borrowed space is so that the = container header=20 does not need to have a page of its own.

Page Format
An allocation page extends a stored page, the on disk format is = different=20 from a stored page in that N bytes are 'borrowed' by the = container and=20 the page header of an allocation page will be slightly bigger = than a normal=20 stored page. This N bytes are stored between the page header and = the record=20 space.

The reason why this N bytes can't simply be a row is because = it needs=20 to be statically accessible by the container object to avoid a = chicken=20 and egg problem of the container object needing to instantiate = an alloc=20 page object before it can be objectified, and an alloc page = object needing=20 to instantiate a container object before it can be objectified. = So this=20 N bytes must be stored outside of the normal record interface = yet it must=20 be settable because only the first alloc page has this borrowed = space.=20 Other (non-first) alloc page have N =3D=3D 0.=20
3D""

N is a byte that indicates the size of the borrowed space. Once an = alloc page is initialized, the value of N cannot change.

The maximum space that can be borrowed by the container is 256 bytes.

The allocation pages are of the same page size as any other pages in = the container. The first allocation page of the FileContainer starts at the first physical byte of the container. Subsequent allocation pages are chained via the nextAllocPageOffset. Each allocation page is expected = to manage at least 1000 user pages (for 1K page size) so this chaining may = not be a severe performance hit. The logical -> physical mapping of an allocation page is stored in the previous allocation page. The = container object will need to maintain this mapping.

The following fields are stored in the page header:

Format of Alloc Page
Type Description
int FormatId
StoredPageHeader see Stored Page Header
long nextAllocPageNumber - the next allocation page's = number
long nextAllocPageOffset - the file offset of the next = allocation page
long reserved1 - reserved for future usage
long reserved2 - reserved for future usage
long reserved3 - reserved for future usage
long reserved4 - reserved for future usage
byte N - the size of the borrowed container info
byte[N] containerInfo - the content of the borrowed container = info
AllocExtent The one and only extent on this alloc page.

The allocation page contains allocation extent rows. In this first cut implementation, there is only 1 allocation extent row per allocation = page.

The allocation extent row is an externalizable object and is directly written on to the page by the alloc page. In other words, it will not = be converted in to a storeableRow. This is to cut down overhead, enhance performance and gives more control of the size and layout of the = allocation extent row to the alloc page.

Alloc Page detailed implementation notes

Create Container - an embryonic allocation page is formatted on disk by = a special static function to avoid instantiating a full AllocPage object. This embryonic allocation has enough information that it can find the file header and not much else. Then the allocation page is properly initialized by creating the first extent.

Open Container - A static AllocPage method will be used to read off the container information directly from disk. Even if the first alloc page (page 0) is already in the page cache, it will not = be used because cleaning the alloc page will introduce a deadlock if the container is not in the container cache. Long term, the first alloc = page should probably live in the container cache rather than in the page = cache.

Get Page - The first alloc page (page 0) will be read into the page = cache. Continue to follow the alloc page chain until the alloc page that = manages the specified page is found. From the alloc page, the physical offset = of the specified page is located.

Cleaning alloc page - the alloc page is written out the same way any = page is written out. The container object will provide a call back to the = alloc page to write the current version of the container object back into the borrowed space before the alloc page itself is written out.

Cleaning the container object - get the the first alloc page, dirty it = and clean it (which will cause it to call the container object to write = itself out into the borrowed space). The versioning of the container is independent of the versioning of the alloc page. The container version = is stored inside the borrowed space and is opaque to the alloc page.

For the fields in an allocation extent row.

Allocation Extent

An allocation extent row manages the page status of page in the extent. AllocExtent is externalizable and is written to the AllocPage directly, without being converted to a row first.

Format of Allocation Extent
Type Description
long extentOffset - the begin physical byte offset of the first = page of this extent
long extentStart - the first logical page mananged by this = extent.
long extentEnd - the last page this extent can ever hope to = manage.
int extentLength - the number of pages allocated in this = extent
int

extentStatus - status bits for the whole extent.
HAS_DEALLOCATED - most likely, this extent has a deallocated=20 page somewhere. If !HAD_DEALLOCATED, the extent = has no deallocated page.
HAS_FREE - most likely, this extent has a free page somewhere. If !HAS_FREE, there is no free page in the extent.
ALL_FREE - most likely, this extent only has free pages, good=20 candidate for shrinking the file. If !ALL_FREE, the extent is not all free.
HAS_UNFILLED_PAGES - most likely, this extent has unfilled = pages. if !HAS_UNFILLED_PAGES, all pages are filled.
KEEP_UNFILLED_PAGES - this extent keeps track of unfilled pages (post v1.3). If not set, this extent has no notion of unfilled page and has no unFilledPage bitmap.
NO_DEALLOC_PAGE_MAP - this extents do not have a dealloc and a free page bit maps. Prior to 2.0, there are 2 bit maps, a deallocate page bit map and a free page bit map. Cloudscape 2.0 and later merged the dealloc page bit map into the free page bit map.
RETIRED - this extent contains only 'retired' pages, never use=20 any page from this extent. The pages don't = actually=20 exist, i.e., it maps to nothing (physicalOffset = is=20 garbage). The purpose of this extent is to blot = out a=20 range of logical page numbers that no longer = exists=20 for this container. Use this to reuse a = physical page when a logical page has exhausted all recordId = or for logical pages that has been shrunk out.

int preAllocLength - the number of pages that have been = preallocated
int reserved1
long reserved2 - reserved for future use
long reserved3 - reserved for future use
FreePages(bit) Bitmap of free pages. Bit[i] is ON if page is free for = immediate (re)use.
unFilledPages(bit) Bitmap of pages that have free space. Bit[i] is ON if page = is likely to be < 1/2 full.

org.apache.derby.iapi.services.io.FormatableBitSet is used to store = the bit map. =20 FormatableBitSet is an externalizable class.

A page can have the following logical state:
Free - a page that is free to be used
Valid - a page that is currently in use

There is another type of transitional pages which pages that have been allocated on disk but has not yet been used. These pages are Free.

Bit[K] freePages Bit[i] is ON iff page i maybe free for reuse. User must get the dealloc page lock on the free page to make sure the transaction.

K is the size of the bit array, it must be >=3D length.

------=_NextPart_000_0026_01C508C1.9753F7A0 Content-Type: application/octet-stream; name="container-format.aart" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="container-format.aart" +--------+ | header | +--------+ | data | +--------+ | data | +--------+ | ... | +--------+ | alloc | +--------+ | data | +--------+ ------=_NextPart_000_0026_01C508C1.9753F7A0--