db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Philip Bohannon" <bohan...@lucent.com>
Subject RE: Derby architecture/design documents
Date Fri, 21 Jan 2005 20:02:24 GMT

<delurk>
Wow, super help.

(Hi, my name is Philip Bohannon.  I work at Bell Labs and have built and
debugged some multi-level concurrency control & recovery schemes in the
past.  Thanks to Mike for peppering you with so many coherent
questions...)

I am adding a couple of questions and comments below.  One general comment
is that it might help to add an occasional pointer to the code; that is,
mention which class/methods implement the algorithm you are outlining.
With an overview like this, it will be vastly more pleasant perusing
around!


> -----Original Message-----
> From: Mike Matrigali [mailto:mikem_app@sbcglobal.net]
> Sent: Friday, January 21, 2005 2:37 PM
> To: Derby Development
> Subject: Re: Derby architecture/design documents
>
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Ok, that's a lot of questions - not going to get to all, but here's a
> start.  This posting is going to try to answer the btree questions.
> I'll do the recovery questions next.  The compare/contrast with ARIES
> papers will take some time, may not get to that right away.
>
> Your suggestion of starting at high level, comparing/contrasting with
> published research should work well with the btree and recovery design.
> As you stated both of these areas in derby are based on the ARIES
> algorithms.
> I will schedule some time in the future to look carefully at those
> papers and produce the compare/contrast for those 2 areas.
>
> Here are some answers off the top of my head for the btree questions,
> that area is most familar to me:
>
> Btree's are based on ARIES.  The btree is a tree of pages, linked to
> one another in the standard way.  Derby uses latches on pages to get
> exclusive access to the page while reading or writing the page (Derby
> only uses exclusive latches, no shared latches are used).  In order
> to prevent deadlocks latches requested while holding other latches are
> always requested top/down and left to right.  Btree splits are always
> left to right.  If for any reason the code needs to break this protocol
> then it will first request the latch NOWAIT and if it can't get the
> latch it will release all current latches and wait for the latch it is
> trying to get, and then after obtaining it go about getting the other
> latches it needs for the particular operation.  While traversing down
> the tree Derby may hold 2 latches: one on parent and one on child.  It
> then continues doing "ladder" latching down the tree releasing the
> highest node when it has successfully got a new lower node latch.
> Latches are short term, only held while reading/modifying the page,
> never held while an I/O is happening.  Structure modifications are all
> isolated from other operations through the use of latches.

The classic problem here is a split taking place during a search.  Say  a
page C contains 50-100, and C's parent in the tree is P, which has an
entry for 50 and 101, with the entry for 50 pointing to C.

Now an inserter comes in, inserts 51 and fills up C (or fills up C's
child, causes a split, and that fills up C).  So C splits into C and C',
which is added to the right of C.

Since the inserter can't get a lock on P while holding a lock on C
(deadlock), it releases the lock on P.  At this point, a reader comes down
the tree looking for 99, looks at P and gets a pointer to C not C'. Now
inserter starts waiting for a lock on P.  When reader gets C, 99 has been
moved off to C'.

It would be interesting to know how Derby handles this (for example, [Yao]
proposes having a pointer from C to C', and I forget what Aries IM does at
the moment, but I think it starts over at the top if you fail to find a
key at the rightmost point in the page...).

>
> Derby uses data only locking for it's logical row level locking.  All
> isolation level implementation is done using logical locks (Derby has
> no non-locking isolation support like versioning).  A quick rundown of
> how locks are used to implement isolation level follows, for this
> discussion I will use the jdbc isolation notation.  In data only locking
> the key for all locks whether obtained in the btree or in the base table
> is the address of the base table row.  In the btree case the address of
> the base table row is always the last column of the leaf index entry.
:
> serializable: Derby uses "previous key" locking to prevent phantoms from
> ~              being inserted into a range being accessed by a scan.  A
> ~               	      scan always locks the index row just "previous"
to
> the  			      first entry that satisfies the scan condition and then
> all 	      index entries going forward until the end of the scan.
> ~    All row locks held until end of transaction.  Inserts also
> ~              get previous key locks, which will block if the existing
> ~              row is locked.
>
> repeatable read:  Same as serializable except that no previous key
> ~              	      locking is done.
>
> read committed: write locks held until end of transaction.  Read locks
> ~               are released once the query in question no longer needs
> ~               them.
>
> read uncommitted: no row locks are gotten.  The code still gets table
> level intent locks to prevent concurrent ddl during the query.
>
> In both the btree and the heap, delete are first executed by marking
> a "deleted" bit.  This is to insure space on the page for abort, since
> row level locking will allow other rows on the page to be modified
> conncurrently with transaction executing the delete.  The system uses
> a background daemon to schedule work after commit to reclaim the space
> of the deleted rows.  A row marked deleted can be "purged" if one can
> obtain a lock on it (if it was an uncommitted delete then the
> transaction doing the commit would still have an exclusive lock on the
> row).

One small note: at recovery time usually such locks are not held.  Thus it
would be required not to restart this garbage collector until all aborted
transactions are rolled back.  And of course, the garbage collector would
need to get the lock, then re-check the bit to see if it is still 1.

Of course, I have no reason to think Derby doesn't handle such cases, but
asking the questions will at least help me learn :).

</delurk>

Philip Bohannon
Computing Sciences Research Center
Lucent Technologies -- Bell Laboratories

>
>
>
> Dibyendu Majumdar wrote:
> | "Mike Matrigali" wrote:
> |
> |
> |>I am happy to answer any specific questions about the store module you
> |>have, or can find someone to answer them if I don't know.  I
> |>think questions should be directed to the mail list so that everyone
> |>can share in the discussion.  If I miss something feel free to ping
> |>me to get an answer.  I am an IBM employee still working on the derby
> |>codeline and was one of the original developers at Cloudscape when it
> |>first began as a startup.
> |
> |
> | Thank you very much, this will really help me to get started.
> |
> |
> |>Is there any model/standard in apache or opensource that we could
follow
> |>or at least look at for a format for this kind of information.
> |
> |
> | I have some thoughts on how I want to do this - these are given below.
> |
> |
> |>I personally really like design information embedded in the actual
code
> |>it relates to, rather than having separate documents.
> |
> |
> | I think that well documented code is always a good idea. And there is
an
> | argument that documentation closer to the code is more likely to be
kept
> | up-to-date. It is also a great help to anyone fixing the code.
> |
> | However, separate documents are also very useful. Specially because
> you can
> | illustrate concepts and design with nice diagrams and flow charts. It
also
> | enables people to grasp the design more easily, as things can be
explained
> | top-down rather than jumping straight to implementation details.
> |
> |
> |>Is there any more specific area you are looking at, rather than just
> |>store.
> |
> |
> |>I usually divide the store into the following, some overlap:
> |>o store module interface (java protocol used by other modules to
> |>interact with store, this is actually fairly well documented in the
> |>javadoc already available)
> |>o access methods (btree/heap existing, future might be gist/rtree/...)
> |>o recovery (both restart and crash recovery, logging, xa recovery,
...)
> |>o transactions (locking, isolation levels, xa transactions)
> |>o data (actual physical storage of rows/columns on pages, page
> |>management, cache  - the format of rows/columns/pages is reasonably
> |>documented in the code - I think in the javadoc, but might just be in
> |>comments)
> |>o misc (sorting, encryption, ...)
> |
> |
> | Well, I shall be documenting the whole of it., but will start with a
> | particular item, and then move on to other items.
> |
> | I think that documentation should be at two levels.
> |
> | The more useful level is where ideas and concepts are explained. I am
sure
> | that Cloudscape was built upon ideas and concepts that had been
invented
> | before by others, and perhaps built its own innovations upon those
> | foundations.
> | So I want to explore the origins of those ideas, where did the ideas
come
> | from? How did Cloudscape implementation use those ideas? I'd like to
give
> | some examples to illustrate what I mean.
> |
> | If we take the transaction manager for a start. As far as I
understand,
> | Derby uses the ARIES algorithm for write-ahead logging. What I'd like
to
> | understand and document is how the Derby implementation of ARIES is
> | different from the published algorithms. Does Derby use nested
top-level
> | actions or sub transactions to manage internal atomic operations? I
think
> | I saw a comment in the code somehere that Derby does not yet
> | implement the transaction list in checkpoint records. I assume that
Derby
> | does page-oriented redo during recovery redo pass, and then
> | page-oriented or logical undo during the undo pass.
> |
> | For the BTREE implementation, I'd like to understand whether Derby
> | implements ARIES/IM or some other algorithm. How does it handle
> | structural consistency of the tree during concurrent updates? What are
> | the locking/latching implications of the algorithm used? Are accesses
to
> | the entire tree or subtree serialised during structural modifications?
> | How is repeatable read ensured? Does Derby use data locking row or
> | index row locking? Does Derby delete rows logically first, and
> | physically afterwards? How does it determine that a deleted row can
> | be physically removed, ie, the transaction that deleted the row has
> | committed?
> |
> | I understand that Derby uses Log Actions to perform both normal
> | operations and recovery operations. Thus a key insert is first encoded
as
> | a log operation, logged, and then executed. This is very neat.
> |
> | I also get the impression by reading the code that a BTREE is
> | implemented on top of a regular table like structure. This is also
> | very neat as it resuses all the code for handling rows, locking, etc.
> |
> | These are a few examples, and I could list several more. What would be
> | very useful is if you and other IBM developers could identify the
> algorithms
> | that Derby uses. Perhaps you could provide references to published
> | research papers, books, and patents. I could start by writing a high
level
> | overview of the concepts used by Derby, and gradually build up the
> | detail by reading  the code, and by asking questions in this forum.
> |
> | The second level of documentation can be built from the comments
> | in the code. This is the implementation level where I can describe how
> | Derby actually implements something, what the file formats are, how
the
> | write ahead log is implementated, what a conglomerate is, what a
> | record id and how it is different from slot id, etc.
> |
> | In terms of starting this level of documentation, I would first like
to
> | concentrate on the "raw" module, which is I suspect is at the heart
> | of all other modules.
> |
> | Thanks and Regards
> |
> | Dibyendu
> |
> |
> |
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.2.5 (MingW32)
> Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
>
> iD8DBQFB8VntEpeslyHqPs0RAvamAKCz+Krfmgtb7d2e6MrU68Zg1jgSBACg6lbM
> Ft/o3N++pWYHO7HzEZZzXQ0=
> =l20l
> -----END PGP SIGNATURE-----
>

Mime
View raw message