Added: websites/production/db/content/derby/papers/pageformats.html ============================================================================== --- websites/production/db/content/derby/papers/pageformats.html (added) +++ websites/production/db/content/derby/papers/pageformats.html Wed Dec 19 18:20:21 2012 @@ -0,0 +1,1440 @@ + + + + + + + +Derby On Disk Page Format + + + + + + + + + +
+ +
+apache > db +
+ +
+ + + + +
+Apache DB Project +
+ + + + +
+
+
+
+ +
+ + +
+ +
+ +   +
+ + + + + +
+
Font size: +   +   +   +
+

Derby On Disk Page Format

+
This document describes the storage format of Derby disk pages. + + This is a work-in-progress derived from Javadoc comments and + + from explanations Mike Matrigali posted to the Derby lists. + + Please post questions, comments, and corrections to + + derby-dev@db.apache.org. + +
+ + + +

Introduction

+
+

Derby stores table and index data in Containers, which currently map + + to files in the seg0 + + directory of the database. In the current Derby implementation there is a 1 to 1 mapping of + + containers to files. Two containers never map to a single file and 1 + + container never maps to multiple files.

+

+ + Data is stored in pages within the container.

+

A page contains a set of records, which can be accessed by "slot", which + + defines the order of the records on the page, or by "id" which defines + + the identity of the records on the page. Clients access records by both + + slot and id, depending on their needs.

+

A Table or a BTree index provides a row-based access mechanism (row-based + + access interface is known as conglomerate). Rows are mapped to records + + in data pages; in case of a table, a single row can span multiple records in + + multiple pages.

+

A container can have three types of pages:

+
    + +
  • Header Page - which is just a specialized version of the Alloc Page.
  • + +
  • Data Pages which hold data, and
  • + +
  • Alloc Pages which hold page allocation information. An Alloc page is a specialized verion of the Data page.
  • + +
+

The container can be visualised as:
+

+

+ +Header Page is currently always page 0 of the container. It + +contains information that raw store needs to maintain about the + +container once per container, and is currently implemented as an Alloc + +Page which "borrows" space from the alloc page for it's information. + +The original decision was that the designers did not want to waste a whole page for + +header information, so a part of the page was used and the first allocation + +map was put on the second half of it. See AllocPage.java for info about layout and + +borrowing. + +

+

+ + Allocation Page - After page 0, all subsequent Allocation pages only + +have allocation bit maps. + +

+
+ + +

Data Page Format

+
+

A data page is broken into five sections. + +

+ +

Format Id

+

The formatId is a 4 bytes array, it contains the format Id of this + + page. The possible values are RAW_STORE_STORED_PAGE or RAW_STORE_ALLOC_PAGE.

+ +

Page Header

+

The page header is a fixed size, 56 bytes.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
SizeTypeDescription
1 bytebooleanis page an overflow page
1 bytebyte + +

page status is either VALID_PAGE or INVALID_PAGE(a field + + maintained in base page)

+ +

page goes thru the following transition: + +
+ + VALID_PAGE <-> deallocated page -> free page <-> + + VALID_PAGE

+ +

deallocated and free page are both INVALID_PAGE as far as BasePage + + is concerned. + +
+ + When a page is deallocated, it transitioned from VALID_PAGE + + to INVALID_PAGE. + +
+ + When a page is allocated, it trnasitioned from INVALID_PAGE + + to VALID_PAGE.

+ +
8 byteslongpageVersion (a field maintained in base page)
2 bytesunsigned shortnumber of slots in slot offset table
4 bytesintegernext record identifier
4 bytesintegergeneration number of this page (Future Use)
4 bytesintegerprevious generation of this page (Future Use)
8 bytesbipLocationthe location of the beforeimage page (Future Use)
2 bytesunsigned shortnumber of deleted rows on page. (new release 2.0)
2 bytesshortspare for future use
4 bytesintegerspare for future use (encryption uses to write random bytes + + here).
8 byteslongspare for future use
8 byteslongspare for future use
+
+
Note
+
Spare space is guaranteed to be writen with "0", so that future + + use of field should not either not use "0" as a valid data item or + + pick 0 as a valid default value so that on the fly upgrade can assume + + that 0 means field was never assigned.
+
+ +

Records

+

The records section contains zero or more records. Each record starts + + with a Record Header

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Record Header
TypeDescription
1 byte + +

Status bits for the record header

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
RECORD_DELETEDused to indicate the record has been deleted
RECORD_OVERFLOWused to indicate the record has been overflowed, it will + + point to the overflow page and ID
RECORD_HAS_FIRST_FIELDused to indicate that firstField is stored will be stored. + + When RECORD_OVERFLOW and RECORD_HAS_FIRST_FIELD both are + + set, part of record is on the page, the record header also + + stores the overflow point to the next part of the record.
RECORD_VALID_MASKA mask of valid bits that can be set currently, such that + + the following assert can be made:
+ +
compressed intrecord identifier
compressed longoverflow page only if RECORD_OVERFLOW is set
compressed intoverflow id only if RECORD_OVERFLOW is set
compressed intfirst field only if RECORD_HAS_FIRST_FIELD is set - otherwise + + 0
compressed intnumber of fields in this portion - only if RECORD_OVERFLOW is + + false OR RECORD_HAS_FIRST_FIELD is true - otherwise 0
+
+
Long Rows
+
A row is long if all of it's columns can't fit on a single page. + + When storing a long row, the segment of the row which fits on the + + page is left there, and a pointer column is added at the end of the + + row. It points to another row in the same container on a different + + page. That row will contain the next set of columns and a continuation + + pointer if necessary. The overflow portion will be on an "overflow" + + page, and that page may have overflow portions of other rows on it + + (unlike overflow columns).
+
+

The Record Header is followed by one or more fields. Each field contains + + a Field Header and optional Field Data.

+ + + + + + + + + + + + + + + + + + + + + + + + + +
Stored Field Header Format
status + +

The status is 1 byte, it indicates the state of the field. + + A FieldHeader can be in the following states:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NULLif the field is NULL, no field data length is stored
OVERFLOWindicates the field has been overflowed to another page. + + overflow page and overflow ID is stored at the end of + + the user data. field data length must be a number greater + + or equal to 0, indicating the length of the field that + + is stored on the current page. The format looks like this: + + + + overflowPage will be written as compressed long, overflowId + + will be written as compressed Int
NONEXISTENTthe field no longer exists, e.g. column has been dropped + + during an alter table
EXTENSIBLEthe field is of user defined data type. The field may + + be tagged.
TAGGEDthe field is TAGGED if and only if it is EXTENSIBLE.
FIXEDthe field is FIXED if and only if it is used in the + + log records for version 1.2 and higher.
+ +
fieldDataLength The fieldDataLength is only set if the field is not NULL. It + + is the length of the field that is stored on the current page. + + The fieldDataLength is a variable length CompressedInt.
fieldData + +

Overflow page and overflow id are stored as field data. If + + the overflow bit in status is set, the field data is the overflow + + information. When the overflow bit is not set in status, then, + + fieldData is the actually user data for the field. That means, + + field header consists only field status, and field data length. + +
+ + A non-overflow field: + +
+
+ + An overflow field: + +
+
+overflowPage + + and overflowID +
+ + The overflowPage is a variable length CompressedLong, overflowID + + is a variable Length CompressedInt. They are only stored when + + the field state is OVERFLOW. And they are not stored in the field + + header. Instead, they are stored at the end of the field data. + + The reason we do that is to save a copy if the field has to overflow.

+ +
+
+
Long Columns
+
A column is long if it can't fit on a single page. A long column + + is marked as long in the base row, and it's field contains a pointer + + to a chain of other rows in the same container with contain the data + + of the row. Each of the subsequent rows is on a page to itself. Each + + subsquent row, except for the last piece has 2 columns, the first + + is the next segment of the row and the second is the pointer to the + + the following segment. The last segment only has the data segment. + +
+
+ +

Slot Offset Table

+

The slot offset table is a table of 6 or 12 bytes per record, depending + + on the pageSize being less or greater than 64K:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Slot Table Record
SizeContent
2 bytes (unsigned short) or 4 bytes (int)page offset for the record that is assigned to the slot
2 bytes (unsigned short) or 4 bytes (int)the length of the record on this page.
2 bytes (unsigned short) or 4 bytes (int)the length of the reserved number of bytes for this record on + + this page.
+

+ + First slot is slot 0. The slot table grows backwards. Slots are never + + left empty.

+ +

Checksum

+

8 bytes of a java.util.zip.CRC32 checksum of the entire's page contents + + without the 8 bytes representing the checksum.

+
+ + +

Allocation Page

+
+

An allocation page of the file container extends a normal Stored page, + + with the exception that a hunk of space may be 'borrowed' by the file + + container to store the file header.

+

The borrowed space is not visible to the alloc page even though it is + + present in the page data array. It is accessed directly by the FileContainer. + + Any change made to the borrowed space is not managed or seen by the allocation + + page.

+

The reason for having this borrowed space is so that the container header + + does not need to have a page of its own.

+

+ +Page Format + +
+ + An allocation page extends a stored page, the on disk format is different + + from a stored page in that N bytes are 'borrowed' by the container and + + the page header of an allocation page will be slightly bigger than a normal + + stored page. This N bytes are stored between the page header and the record + + space.

+

The reason why this N bytes can't simply be a row is because it needs + + to be statically accessible by the container object to avoid a chicken + + and egg problem of the container object needing to instantiate an alloc + + page object before it can be objectified, and an alloc page object needing + + to instantiate a container object before it can be objectified. So this + + N bytes must be stored outside of the normal record interface yet it must + + be settable because only the first alloc page has this borrowed space. + + Other (non-first) alloc page have N == 0. + +
+

+

+ + N is a byte that indicates the size of the borrowed space. Once an alloc + + page is initialized, the value of N cannot change. + +

+

+ + The maximum space that can be borrowed by the container is 256 bytes. + +

+

+ + The allocation pages are of the same page size as any other pages in the + + container. The first allocation page of the FileContainer starts at the + + first physical byte of the container. Subsequent allocation pages are + + chained via the nextAllocPageOffset. Each allocation page is expected to + + manage at least 1000 user pages (for 1K page size) so this chaining may not + + be a severe performance hit. The logical -> physical mapping of an + + allocation page is stored in the previous allocation page. The container + + object will need to maintain this mapping.

+

+ + The following fields are stored in the page header: + +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + Format of Alloc Page + +
+ + Type + + + + Description + +
+ + int + + + + FormatId (Although 4 bytes are allocated, this uses only the first 2 bytes. Next 2 bytes are unused.) + +
StoredPageHeadersee Stored Page Header
longnextAllocPageNumber - the next allocation page's number
longnextAllocPageOffset - the file offset of the next allocation page
longreserved1 - reserved for future usage
longreserved2 - reserved for future usage
longreserved3 - reserved for future usage
longreserved4 - reserved for future usage
byteN - the size of the borrowed container info
byte[N]containerInfo - the content of the borrowed container info
AllocExtentThe one and only extent on this alloc page.
+

+ + The allocation page contains allocation extent rows. In this first cut + + implementation, there is only 1 allocation extent row per allocation page. + +

+

+ + The allocation extent row is an externalizable object and is directly + + written on to the page by the alloc page. In other words, it will not be + + converted in to a storeableRow. This is to cut down overhead, enhance + + performance and gives more control of the size and layout of the allocation + + extent row to the alloc page. + +

+ +

+ + Alloc Page detailed implementation notes

+

+ + Create Container - an embryonic allocation page is formatted on disk by a + + special static function to avoid instantiating a full AllocPage object. + + This embryonic allocation has enough information that it can find the + + file header and not much else. Then the allocation page is properly + + initialized by creating the first extent. + +

+

+ + Open Container - A static AllocPage method will be used to read off the + + container information directly from disk. Even if + + the first alloc page (page 0) is already in the page cache, it will not be + + used because cleaning the alloc page will introduce a deadlock if the + + container is not in the container cache. Long term, the first alloc page + + should probably live in the container cache rather than in the page cache. + +

+

+ + Get Page - The first alloc page (page 0) will be read into the page cache. + + Continue to follow the alloc page chain until the alloc page that manages + + the specified page is found. From the alloc page, the physical offset of + + the specified page is located. + +

+

+ + Cleaning alloc page - the alloc page is written out the same way any page + + is written out. The container object will provide a call back to the alloc + + page to write the current version of the container object back into the + + borrowed space before the alloc page itself is written out. + +

+

+ + Cleaning the container object - get the the first alloc page, dirty it and + + clean it (which will cause it to call the container object to write itself + + out into the borrowed space). The versioning of the container is + + independent of the versioning of the alloc page. The container version is + + stored inside the borrowed space and is opaque to the alloc page. + +

+

For the fields in an allocation extent row.

+
+ + +

Allocation Extent

+
+

+ + An allocation extent row manages the page status of page in the extent. + + AllocExtent is externalizable and is written to the AllocPage directly, + + without being converted to a row first. + +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Format of Allocation Extent
TypeDescription
longextentOffset - the begin physical byte offset of the first page of this extent
longextentStart - the first logical page mananged by this extent.
longextentEnd - the last page this extent can ever hope to manage.
intextentLength - the number of pages allocated in this extent
int + +

extentStatus - status bits for the whole extent. + +
HAS_DEALLOCATED - most likely, this extent has a deallocated + + page somewhere. If !HAD_DEALLOCATED, the extent has no deallocated page. + +
HAS_FREE - most likely, this extent has a free page somewhere. + + If !HAS_FREE, there is no free page in the extent. + +
ALL_FREE - most likely, this extent only has free pages, good + + candidate for shrinking the file. + + If !ALL_FREE, the extent is not all free. + +
HAS_UNFILLED_PAGES - most likely, this extent has unfilled pages. + + if !HAS_UNFILLED_PAGES, all pages are filled. + +
KEEP_UNFILLED_PAGES - this extent keeps track of unfilled pages + + (post v1.3). If not set, this extent has no notion of + + unfilled page and has no unFilledPage bitmap. + +
NO_DEALLOC_PAGE_MAP - this extents do not have a dealloc and a + + free page bit maps. Prior to 2.0, there are 2 bit + + maps, a deallocate page bit map and a free page bit + + map. Cloudscape 2.0 and later merged the dealloc page + + bit map into the free page bit map. + +
RETIRED - this extent contains only 'retired' pages, never use + + any page from this extent. The pages don't actually + + exist, i.e., it maps to nothing (physicalOffset is + + garbage). The purpose of this extent is to blot out a + + range of logical page numbers that no longer exists + + for this container. Use this to reuse a physical page + + when a logical page has exhausted all recordId or for + + logical pages that has been shrunk out. + +

+ +
intpreAllocLength - the number of pages that have been preallocated
intreserved1
longreserved2 - reserved for future use
longreserved3 - reserved for future use
FreePages(bit)Bitmap of free pages. Bit[i] is ON if page is free for immediate (re)use.
unFilledPages(bit)Bitmap of pages that have free space. Bit[i] is ON if page is likely to be < 1/2 full.
+

+ + org.apache.derby.iapi.services.io.FormatableBitSet is used to store the bit map. + + FormatableBitSet is an externalizable class. + +

+

+ + A page can have the following logical state: + +
Free - a page that is free to be used + +
Valid - a page that is currently in use + +

+

+ + There is another type of transitional pages which pages that have been + + allocated on disk but has not yet been used. These pages are Free. + +

+

+ + Bit[K] freePages + + Bit[i] is ON iff page i maybe free for reuse. User must get the + + dealloc page lock on the free page to make sure the transaction. + +

+

+ + K is the size of the bit array, it must be >= length. + +

+
+ +
+ +
 
+
+ + + Added: websites/production/db/content/derby/papers/recovery.html ============================================================================== --- websites/production/db/content/derby/papers/recovery.html (added) +++ websites/production/db/content/derby/papers/recovery.html Wed Dec 19 18:20:21 2012 @@ -0,0 +1,918 @@ + + + + + + + +Derby Logging and Recovery + + + + + + + + + +
+ +
+apache > db +
+ +
+ + + + +
+Apache DB Project +
+ + + + +
+
+
+
+ +
+ + +
+ +
+ +   +
+ + + + + +
+
Font size: +   +   +   +
+

Derby Logging and Recovery

+
This document describes how Derby implements logging and recovery. + This is a work-in-progress derived from Javadoc comments and from explanations + Mike Matrigali and others posted to the Derby lists. Please post questions, + comments, and corrections to derby-dev@db.apache.org.
+ + + +

Introduction

+
+

Derby transaction logging and recovery is based upon the ARIES algorithm.

+
+ + +

ARIES - An Overview

+
+

Following is a brief description of the main principles behind ARIES.

+

Firstly, in ARIES, changes always take the system forward. That is to say, +even transaction rollbacks are treated as if they are updates to the system. +This is counter-inituitive to what the user thinks, because when a user asks for a +transaction to be rolled back, they assume that the system is going back +to a previous state of affairs. However, from the perspective of ARIES, there +is no such thing as going back. For example, if a transaction changes A to B +and then rolls back, ARIES treats the rollback as simply an update that +changes B to A. The forward change from A to B (redo) and the reversal of B to +A (undo) are both recorded as updates to the system. Changes during normal +operations are recorded as Redo-Undo log records. As the name implies, these +log records can be 'redone' in case of a system crash, or 'undone' in case a +rollback is required. Changes made during rollbacks, however, are recorded as +Redo-only log records. These log records are called Compensation Log Records +(CLRs). The reason these are redo only is that by definition a rollback does +not need to be undone, whereas normal updates need to be undone if the +transaction decides to rollback. +

+

The second basic principle of ARIES is that during recovery, history + is repeated. This can be explained as follows.

+

When a system crashes, there would be some transactions that have completed + (committed or aborted), and others that are still active. The WAL protocol + ensures that changes made by completed transactions have been recorded + in the Log. Changes made by incomplete transactions may also be present + in the Log, because Log Records are created in the same order as the changes + are made by the system.

+

During recovery, ARIES initially replays the Log to the bring the system + back to a state close to that when the crash occurred. This means that + ARIES replays the effects of not only those transactions that committed + or aborted, but also those that were active at the time of the crash. + Having brought the system to this state, ARIES then identifies transactions + that were incomplete, and rolls them back. The basic idea is to repeat + the entire history upto the point of crash, and then undo failed transactions.

+

This approach has the advantage that during the redo phase, changes can + be replayed at a fairly low level, for example, the level of a disk page. + ARIES calls this page oriented redo. This feature is significant because + it means that until the redo phase is over, the system does not need to + know about higher level data structures such as Indexes. Only during the + undo phase, when incomplete transactions are being rolled back, does the + system need to know about high level data structures.

+
+ + +

Features of ARIES

+
+

ARIES includes a number of optimisations to reduce the amount of work + required during normal operations and recovery.

+

One optimisation is to avoid application of log records unnecessarily. + The LSN of the most recently generated log record is stored in each disk + page. This is known as the PageLsn. The PageLsn allows ARIES to determine + during the redo phase, whether the changes represented by a log record + have been applied to the page or not.

+

ARIES chains log records for transactions in such a way that those records + that are no longer necessary, are skipped during recovery. For example, + if a transaction changed A to B, and then rolled back, generating a log + record for changing B to A, then during recovery, ARIES would automatically + skip the log record that represents the change from A to B. This is made + possible by maintaining a UndoLsn pointer in every Log Record. The UndoLsn + normally points to the previous log record generated by the transaction. + However, in log records generated during Rollback (known as Compensation + Log Records), the UndoLsn is made to point to the Log record preceding + the one that is being undone. To take an example, let us assume that a + transaction generated log record 1, containing change from A to B, then + log record 2 containing change from B to C. At this point the transaction + decides to rollback the change from B to C. It therefore generates a new + log record 3, containing a change from C to B. The UndoLsn of this log + record is made to point at log record 1, instead of log record 2. When + following the UndoLsn chain, ARIES would skip log record 2.

+

ARIES also supports efficient checkpoints. During a checkpoint, it is + not necessary to flush all database pages to disk. Instead ARIES records + a list of dirty buffer pages along with their RecoveryLsn(s). The RecoveryLsn + of a page is the LSN of the earliest log record that represents a change + to the page since it was read from disk. By using this list, ARIES is + able to determine during recovery, where to start replaying the Log.

+

ARIES supports nested top-level action concept whereby part of a transaction + can be commited even if the transaction aborts. This is useful for situations + where a structural change should not be undone even if the transaction + aborts. Nested top level actions are implemented using Dummy Compensation + Log Records - and make use of the ability to skip logs records using the + UndoLsn pointer as described previously.

+
+ + +

References

+
+
    + +
  1. +

    For a full description of ARIES, please see + Mohan, C., Haderle, D., Lindsay, B., Pirahesh, H., Schwarz, P. ARIES: + A Transaction Recovery Method Supporting Fine-Granularity Locking and + Partial Rollbacks Using Write-Ahead Logging, ACM Transactions on Database + Systems, Vol. 17, No. 1, March 1992, pp94-162. + A version of this document is freely available as + IBM Research + Report RJ6649.

    + +
  2. + +
  3. +

    A good description of Write Ahead Logging, and how a log is typically + implemented, can be found in + + Transaction + Processing: Concepts and Techniques + , by Jim Gray and Andreas Reuter, 1993, Morgan Kaufmann Publishers + .

    + +
  4. + +
+
+ + +

Derby implementation of ARIES

+
+

I shall only describe how Derby differs from standard ARIES implementation. + Therefore, for a full understanding of the logging and recovery mechanisms + in Derby, it is necessary to consult above mentioned papers on ARIES.

+

Derby uses Log Sequence Numbers to identify Log records. In Derby terminology, + LSNs are called LogInstants. LogCounter is an implementation of LogInstant.

+

Although Derby uses LogInstant, it does not save this with the page data. + Instead, a page version number is stored. The page version number is also + stored in the log records associated with the page. During recovery (redo), + Derby uses the page version to determine whether the page needs redo or + not. Here is a comment on the rationale behind this:

+

+ +Mike Matrigali: + +
+ Am going to defer on page version vs. LSN question, but at least mention + some guesses, not researched info. You are right bout what exists. I spoke + with some colleagues and the best we can come up with is that the implementor + wanted to separate the page and the log, in case we ever did a different + log format. I will try to do some more research here. I also vaguely remember + the implementor mentioning if we ever wanted to implement the LSN on the + page, we had space to do so. It may simply have been easier to code the + page versions, since in the current system the LSN is the address in the + log file (which and it may not be available when the code wants to write + it on the page). +
+ As you say in derby all the log records are associated with a page, and + thus have a page version number. That page version number in the log is + compared with the page version number of the page during redo to determine + if the log record needs to be applied. This also has helped us in the + past to catch some bugs as we can sanity check during redo that the page + is progressing along the expected path, ie. it is a bug during redo to + be applying a page version 10 log record to page that is at page version + 8. I haven't seen this sanity check in many years, but was useful when + the product was first coded.

+

Derby does not write the dirty pages list within a Checkpoint record. + Instead, during checkpoint, Derby flushes all database pages to + disk. The redo Low Water Mark (redoLWM) is set to the current LSN when the + checkpoint starts. The undo Low Water Mark (undoLWM) is set to the + starting LSN of the oldest active transaction. At restart, Derby replays + the log from redoLWM or undoLWM whichever is earlier. For a good description + of concepts behind the checkpoint method used by Derby, and the use of redo/undo Low + Water Marks, please refer to TPCT book (Section 11.3).

+

Derby uses 'internal' transactions instead of nested top-level actions + to separate structural changes from normal operations. Internal transactions + have the property that they are always page-oriented and do not require + logical undo, ie, undo is always physical. Also, during recovery, incomplete + internal transactions + are undone before any regular transactions. In ARIES, no special processing + is required to handle this, as nested top-level actions are automatically + handled as part of normal redo, and are skipped during undo unless they + are incomplete, in which case they are undone.

+

ARIES uses three passes during recovery. The first pass is the analysis + pass when ARIES collects information and determines where redo must start. + This is followed by the redo pass, and then by the undo pass. Derby omits + the analysis pass as this is not required due to the way checkpoints are + done.

+
+ + +

Derby recovery process

+
+

Implemented in org.apache.derby.impl.store.raw.log.LogToFile.recover() +

+

Following is a high level description of Derby recovery process in Derby.

+

In this implementation, the log is a stream of log records stored in + one or more flat files. Recovery is done in 2 passes: redo and undo.

+
+ +
Redo pass
+ +
In the redo pass, reconstruct the state of the rawstore by repeating + exactly what happened before as recorded in the log.
+ +
Undo pass
+ +
In the undo pass, all incomplete transactions are rolled back in + the order from the most recently started to the oldest.
+ +
+
+ + +

Recovery Redo pass

+
+

Implemented in org.apache.derby.impl.store.raw.log.FileLogger.redo() +

+

The log stream is scanned from the beginning (or + from the undo low water mark of a checkpoint) forward until the end. + The purpose of the redo pass is to repeat history, i.e, to repeat + exactly the same set of changes the rawStore went thru right before it + stopped. With each log record that is encountered in the redo pass:

+
    + +
  1. if it isFirst(), then the transaction factory is called upon to + create a new transaction object.
  2. + +
  3. if it needsRedo(), its doMe() is called (if it is a compensation + operation, then the undoable operation needs to be created first + before the doMe is called).
  4. + +
  5. if it isComplete(), then the transaction object is closed.
  6. + +
+
+ + +

Recovery Undo pass

+
+

Implemented in org.apache.derby.impl.store.raw.xact.XactFactory.rollbackAllTransactions() +

+

Rollback all active transactions that has updated the raw store. + Transactions are rolled back in the following order:

+
    + +
  1. Internal transactions in reversed beginXact chronological order
  2. + +
  3. all other transactions in reversed beginXact chronological order
  4. + +
+
+ + +

Checkpoints

+
+

Implemented in org.apache.derby.impl.store.raw.log.LogToFile.checkpoint() +

+

Only one checkpoint is to be taking place at any given time.

+

The steps of a checkpoint are:

+
    + +
  1. +

    Switch to a new log file if possible.

    + +
      + +
    1. Freeze the log (for the transition to a new log file)
    2. + +
    3. Flush current log file
    4. + +
    5. Create and flush the new log file (with file number 1 higher + than the previous log file). The new log file becomes the + current log file.
    6. + +
    7. Unfreeze the log
    8. + +
    + +
  2. + +
  3. Start checkpoint transaction
  4. + +
  5. +

    Gather interesting information about the rawStore:

    + +
      + +
    1. The current log instant (redoLWM)
    2. + +
    3. The earliest active transaction begin tran log record instant (undoLWM) + , all the truncation LWM set by clients of raw store + (replication)
    4. + +
    + +
  6. + +
  7. Clean the buffer cache
  8. + +
  9. Log the next checkpoint log record, which contains (repPoint, + undoLWM, redoLWM) and commit checkpoint transaction.
  10. + +
  11. Synchronously write the control file containing the next checkpoint + log record log instant
  12. + +
  13. The new checkpoint becomes the current checkpoint. Somewhere near + the beginning of each log file should be a checkpoint log record (not + guarenteed to be there)
  14. + +
  15. See if the log can be truncated
  16. + +
+

The earliest useful log record is determined by the repPoint and the + undoLWM, whichever is earlier.

+

Every log file whose log file number is smaller than the earliest useful + log record's log file number can be deleted.

+

Transactions can be at the following states w/r to a checkpoint - + consider the log as a continous stream and not as series of log files for + the sake of clarity:
+ + +

+
+|(BT)-------(ET)| marks the begin and end of a transaction.
+.                          checkpoint started
+.       |__undoLWM          |
+.       V                   |___redoLWM
+.                           |___TruncationLWM
+.                           |
+.                           V
+1 |-----------------|
+2       |--------------------------------|
+3           |-------|
+4               |--------------------------------------(end of log)
+5                                       |-^-|
+.                                   Checkpoint Log Record
+---A--->|<-------B--------->|<-------------C-----------
+
+

+ There are only 3 periods of interest :
+ A) before undoLWM, B) between undo and redo LWM, C) after redoLWM. +

+

+ Transaction 1 started in A and terminates in B.
+ During redo, we should only see log records and endXact from this + transaction in the first phase (between undoLWM and redoLWM). No + beginXact log record for this transaction will be seen. +

+

+ Transaction 2 started in B (right on the undoLWM) and terminated in C. +
+ Any transaction that terminates in C must have a beginXact at or after + undoLWM. In other words, no transaction can span A, B and C. During redo, + we will see beginXact, other log records and endXact for this + transaction. +

+

+ Transaction 3 started in B and ended in B.
+ During redo, we will see beginXact, other log records and endXact for + this transaction. +

+

+ Transaction 4 begins in B and never ends.
+ During redo, we will see beginXact, other log records. In undo, this + loser transaction will be rolled back. +

+

+ Transaction 5 is the transaction taking the checkpoint.
+ The checkpoint action started way back in time but the checkpoint log + record is only written after the buffer cache has been flushed. +

+

+ Note that if any time elapse between taking the undoLWM and the redoLWM, + then it will create a 4th period of interest. +

+
+ + +

Derby Logging Overview

+
+

A loggable action in Derby is redoable. If the action implements Undoable interface, then it is also + undoable. When an undoable action is rolled back, it must generate a Compensation log which represents + the action necessary to repeat the undo. +

+

Normally a logged action is rolled back on the same page that it was originally applied to. This is + called physical or physiological undo. If the undo needs to be applied to a different page (such as due to + a page split in a BTree), then it is called + a Logical Undo. In Derby, BTree inserts and deletes require logical undo.

+

When performing a loggable action, Derby follows this sequence:

+
    + +
  1. Convert the action into a corresponding log operation. Most BTree and Heap operations are + translated to Page level actions - ie - the action involves updating one or more pages. For example, + a single Heap row insert may be translated to inserts on several pages. Each page insert + will be a separate loggable action.
  2. + +
  3. Generate the log data that describes the page level action.
  4. + +
  5. Perform the action after it has been logged. Also, the action is + performed using the logged data, in the same way as it would be performed during recovery. + In other words, the logged data is used both for normal operations as well as for repeating + history. This has the advantage that the recovery execution path is the same as the execution + path during normal execution.
  6. + +
  7. If a transaction is being rolled back, first the loggable action is asked to generate + the corresponding undo (Compensation) log data. This is then logged, and after that it is performed. + As described before, a Compensation action is only redoable, because by definition, an undo + action does not need to be undone. +
  8. + +
+
+ + +

Loggable Interface Hierarchy

+
+
    + +
  • interface org.apache.derby.iapi.store.raw.Loggable + +
      + +
    • interface org.apache.derby.iapi.store.raw.Compensation +
    • + +
    • interface org.apache.derby.iapi.store.raw.Undoable + + + +
    • + +
    + +
  • + +
+
+ + +

Container Log Operations Hierarchy

+
+ +
+ + +

Transaction Management Log Operations Hierarchy

+
+
    + +
  • class org.apache.derby.impl.store.raw.xact.BeginXact (implements org.apache.derby.iapi.store.raw.Loggable)
  • + +
  • class org.apache.derby.impl.store.raw.xact.EndXact (implements org.apache.derby.iapi.store.raw.Loggable)
  • + +
  • class org.apache.derby.impl.store.raw.log.CheckpointOperation (implements org.apache.derby.iapi.store.raw.Loggable)
  • + +
+
+ + +

Page Level Log Operations Hierarchy

+
+ +
+ +
+ +
 
+
+ + +