db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Matrigali <mikem_...@sbcglobal.net>
Subject Re: Derby architecture/design documents
Date Fri, 21 Jan 2005 19:37:19 GMT
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ok, that's a lot of questions - not going to get to all, but here's a
start.  This posting is going to try to answer the btree questions.
I'll do the recovery questions next.  The compare/contrast with ARIES
papers will take some time, may not get to that right away.

Your suggestion of starting at high level, comparing/contrasting with
published research should work well with the btree and recovery design.
As you stated both of these areas in derby are based on the ARIES
algorithms.
I will schedule some time in the future to look carefully at those
papers and produce the compare/contrast for those 2 areas.

Here are some answers off the top of my head for the btree questions,
that area is most familar to me:

Btree's are based on ARIES.  The btree is a tree of pages, linked to
one another in the standard way.  Derby uses latches on pages to get
exclusive access to the page while reading or writing the page (Derby
only uses exclusive latches, no shared latches are used).  In order
to prevent deadlocks latches requested while holding other latches are
always requested top/down and left to right.  Btree splits are always
left to right.  If for any reason the code needs to break this protocol
then it will first request the latch NOWAIT and if it can't get the
latch it will release all current latches and wait for the latch it is
trying to get, and then after obtaining it go about getting the other
latches it needs for the particular operation.  While traversing down
the tree Derby may hold 2 latches: one on parent and one on child.  It
then continues doing "ladder" latching down the tree releasing the
highest node when it has successfully got a new lower node latch.
Latches are short term, only held while reading/modifying the page,
never held while an I/O is happening.  Structure modifications are all
isolated from other operations through the use of latches.

Derby uses data only locking for it's logical row level locking.  All
isolation level implementation is done using logical locks (Derby has
no non-locking isolation support like versioning).  A quick rundown of
how locks are used to implement isolation level follows, for this
discussion I will use the jdbc isolation notation.  In data only locking
the key for all locks whether obtained in the btree or in the base table
is the address of the base table row.  In the btree case the address of
the base table row is always the last column of the leaf index entry.   :
serializable: Derby uses "previous key" locking to prevent phantoms from
~              being inserted into a range being accessed by a scan.  A
~               	      scan always locks the index row just "previous" to
the  			      first entry that satisfies the scan condition and then
all 	      index entries going forward until the end of the scan.  			
~    All row locks held until end of transaction.  Inserts also
~              get previous key locks, which will block if the existing 		
~              row is locked.

repeatable read:  Same as serializable except that no previous key
~              	      locking is done.

read committed: write locks held until end of transaction.  Read locks
~               are released once the query in question no longer needs
~               them.

read uncommitted: no row locks are gotten.  The code still gets table
level intent locks to prevent concurrent ddl during the query.

In both the btree and the heap, delete are first executed by marking
a "deleted" bit.  This is to insure space on the page for abort, since
row level locking will allow other rows on the page to be modified
conncurrently with transaction executing the delete.  The system uses
a background daemon to schedule work after commit to reclaim the space
of the deleted rows.  A row marked deleted can be "purged" if one can
obtain a lock on it (if it was an uncommitted delete then the
transaction doing the commit would still have an exclusive lock on the
row).



Dibyendu Majumdar wrote:
| "Mike Matrigali" wrote:
|
|
|>I am happy to answer any specific questions about the store module you
|>have, or can find someone to answer them if I don't know.  I
|>think questions should be directed to the mail list so that everyone
|>can share in the discussion.  If I miss something feel free to ping
|>me to get an answer.  I am an IBM employee still working on the derby
|>codeline and was one of the original developers at Cloudscape when it
|>first began as a startup.
|
|
| Thank you very much, this will really help me to get started.
|
|
|>Is there any model/standard in apache or opensource that we could follow
|>or at least look at for a format for this kind of information.
|
|
| I have some thoughts on how I want to do this - these are given below.
|
|
|>I personally really like design information embedded in the actual code
|>it relates to, rather than having separate documents.
|
|
| I think that well documented code is always a good idea. And there is an
| argument that documentation closer to the code is more likely to be kept
| up-to-date. It is also a great help to anyone fixing the code.
|
| However, separate documents are also very useful. Specially because
you can
| illustrate concepts and design with nice diagrams and flow charts. It also
| enables people to grasp the design more easily, as things can be explained
| top-down rather than jumping straight to implementation details.
|
|
|>Is there any more specific area you are looking at, rather than just
|>store.
|
|
|>I usually divide the store into the following, some overlap:
|>o store module interface (java protocol used by other modules to
|>interact with store, this is actually fairly well documented in the
|>javadoc already available)
|>o access methods (btree/heap existing, future might be gist/rtree/...)
|>o recovery (both restart and crash recovery, logging, xa recovery,  ...)
|>o transactions (locking, isolation levels, xa transactions)
|>o data (actual physical storage of rows/columns on pages, page
|>management, cache  - the format of rows/columns/pages is reasonably
|>documented in the code - I think in the javadoc, but might just be in
|>comments)
|>o misc (sorting, encryption, ...)
|
|
| Well, I shall be documenting the whole of it., but will start with a
| particular item, and then move on to other items.
|
| I think that documentation should be at two levels.
|
| The more useful level is where ideas and concepts are explained. I am sure
| that Cloudscape was built upon ideas and concepts that had been invented
| before by others, and perhaps built its own innovations upon those
| foundations.
| So I want to explore the origins of those ideas, where did the ideas come
| from? How did Cloudscape implementation use those ideas? I'd like to give
| some examples to illustrate what I mean.
|
| If we take the transaction manager for a start. As far as I understand,
| Derby uses the ARIES algorithm for write-ahead logging. What I'd like to
| understand and document is how the Derby implementation of ARIES is
| different from the published algorithms. Does Derby use nested top-level
| actions or sub transactions to manage internal atomic operations? I think
| I saw a comment in the code somehere that Derby does not yet
| implement the transaction list in checkpoint records. I assume that Derby
| does page-oriented redo during recovery redo pass, and then
| page-oriented or logical undo during the undo pass.
|
| For the BTREE implementation, I'd like to understand whether Derby
| implements ARIES/IM or some other algorithm. How does it handle
| structural consistency of the tree during concurrent updates? What are
| the locking/latching implications of the algorithm used? Are accesses to
| the entire tree or subtree serialised during structural modifications?
| How is repeatable read ensured? Does Derby use data locking row or
| index row locking? Does Derby delete rows logically first, and
| physically afterwards? How does it determine that a deleted row can
| be physically removed, ie, the transaction that deleted the row has
| committed?
|
| I understand that Derby uses Log Actions to perform both normal
| operations and recovery operations. Thus a key insert is first encoded as
| a log operation, logged, and then executed. This is very neat.
|
| I also get the impression by reading the code that a BTREE is
| implemented on top of a regular table like structure. This is also
| very neat as it resuses all the code for handling rows, locking, etc.
|
| These are a few examples, and I could list several more. What would be
| very useful is if you and other IBM developers could identify the
algorithms
| that Derby uses. Perhaps you could provide references to published
| research papers, books, and patents. I could start by writing a high level
| overview of the concepts used by Derby, and gradually build up the
| detail by reading  the code, and by asking questions in this forum.
|
| The second level of documentation can be built from the comments
| in the code. This is the implementation level where I can describe how
| Derby actually implements something, what the file formats are, how the
| write ahead log is implementated, what a conglomerate is, what a
| record id and how it is different from slot id, etc.
|
| In terms of starting this level of documentation, I would first like to
| concentrate on the "raw" module, which is I suspect is at the heart
| of all other modules.
|
| Thanks and Regards
|
| Dibyendu
|
|
|
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (MingW32)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFB8VntEpeslyHqPs0RAvamAKCz+Krfmgtb7d2e6MrU68Zg1jgSBACg6lbM
Ft/o3N++pWYHO7HzEZZzXQ0=
=l20l
-----END PGP SIGNATURE-----

Mime
View raw message