Author: buildbot Date: Sun Sep 22 05:03:02 2013 New Revision: 879404 Log: Staging update by buildbot for directory Added: websites/staging/directory/trunk/content/mavibot/user-guide/2.1-logical-structure.html websites/staging/directory/trunk/content/mavibot/user-guide/2.2-physical-storage.html Removed: websites/staging/directory/trunk/content/mavibot/user-guide/2.1-file-format.html Modified: websites/staging/directory/trunk/content/ (props changed) Propchange: websites/staging/directory/trunk/content/ ------------------------------------------------------------------------------ --- cms:source-revision (original) +++ cms:source-revision Sun Sep 22 05:03:02 2013 @@ -1 +1 @@ -1525247 +1525320 Added: websites/staging/directory/trunk/content/mavibot/user-guide/2.1-logical-structure.html ============================================================================== --- websites/staging/directory/trunk/content/mavibot/user-guide/2.1-logical-structure.html (added) +++ websites/staging/directory/trunk/content/mavibot/user-guide/2.1-logical-structure.html Sun Sep 22 05:03:02 2013 @@ -0,0 +1,194 @@ + + + + + 2.1 - Logical Structure — Apache Directory + + + + + + + + + + + + +
+ +
+
+ + + +
+
+ + + + + +

2.1 - Logical Structure

+

Special BTrees

+

We have two special BTrees we use to manage the revisions and the copied pages. We will explain what they are good for

+

Revision tree

+

We need to keep a track of each active revision, so that a search can work with a specific revision. The idea is that when a search starts, it uses the latest revision, but as some modification can occur while teh search is bieng processed, some new revisions will be added. In some case, we may want to keep a revision active for quite a long time.

+

So we store the active revisions in a dedicated BTree.

+

As we may have many BTrees, we have to use a key which is a combinaison of the BTree name and its revision. So the revision BTree manage the revisions of all the managed BTrees.

+

When a revision is not anymore used, we can remove it from the revision BTree.

+

This BTree is not a MVCC BTree, like the other ones. In other words, we only keep the latest revision of this BTree (ie, all the modified pages are immediately freed)

+

Copied pages BTree

+

Once we create a new revision, the pages we copied are not anymore in use except if the revisions they are associated with are still in use. The problem is that we can't discard those pages and move them to the free list until the associated revision is free.

+

We use a dedicated BTree to keep a track of the copied pages, which will be reclaimed and moved to the free pages list once the associated revision will be released.

+

Managing the free pages

+

We have a mechanism to manage the PageIO that are not anymore in use. This is a linked list in which the free pages are added. If we need some page, we first look into this list, and get back as many PageIOs as we need - until we reach the end of this list. If we free some page, we add them at the end of the free list.

+

We always free a logical page, which may be stored into many PageIOs. The good thing is that those PageIOs are already linked, so we just need to make the last free PageIO to point on the first freed PageIO, and to move the pointer to the last free page to the last PageIO used to store the logical page.

+ + + + + +
+
+
+ +
+ + Added: websites/staging/directory/trunk/content/mavibot/user-guide/2.2-physical-storage.html ============================================================================== --- websites/staging/directory/trunk/content/mavibot/user-guide/2.2-physical-storage.html (added) +++ websites/staging/directory/trunk/content/mavibot/user-guide/2.2-physical-storage.html Sun Sep 22 05:03:02 2013 @@ -0,0 +1,259 @@ + + + + + 2.2 - Physical storage — Apache Directory + + + + + + + + + + + + +
+ +
+
+ + + +
+
+ + + + + +

2.2 - Physical storage

+

When associated with a RecordManager, Mavibot stores all the Btrees in one single file, which is split in many physical pages, all having the same size.

+
+

Note +Currently, the choice was to use one single size for all the pages, regardless the data we store into them. The rationnal is to +get close to the OS page size (frequently 512 bytes or 4096 bytes). This is not necessarily the best choice though, let's say +it's something we might want to change later.

+
+

General file structure

+

The file we use to store the data is a plain binary file, used to store all the BTrees. We can store many btrees in one single file.

+

This file is considered as a fileSystem, with fixed size 'pages' (a page is an array of bytes). The page size is arbitrary fixed when teh RecordManager is created, and we will store every logical data n those physical pages, which will require to spread the logical data in many pages in most of the cases.

+

PageIO

+

Let's first introduce the PageIO, which is used to store the data on disk.

+

A PageIO contains some raw data. As we have to map some logical data that may be wider than a physical fixed size PageIO, we use potentially more than one PageIO to store the data, and we link the PageIOs alltogether.

+

Each PageIO has a height bytes pointer at the beginning, pointing to the next PageIO (or to nothing, if there is no more PageIO in the chain), plus an extra 4 bytes on the first PageIO to define the number of bytes stored in the chain of PageIO. Here is the mapping between a logical page and some PageIOs :

+

PageIO mapping

+

Every PageIOs are contiguous on disk, but the PageIOs used to store a logical page may be located anywhere on the disk, they don't have to be continuous.

+

Here is the structure of a PageIO on disk :

+
    +
  • next page offset (8 bytes) : the offset of the next PageIO, or -1L if no more PageIO is needed
  • +
  • data size (4 bytes) : for the first PageIO, the size of the stored data across all the PageIOs used to store a page.
  • +
  • data (N bytes) : a block of data, which size will be min( PageSize - offset - data size, data size) for the first PageIO or min( PageSize - offset, data size) for any other PageIOs
  • +
+

Logical structure mapping on disk

+

We will now describe how each logical structure is serialized on disk.

+

RecordManager header

+

We keep a few bytes at the beginning of the file to store some critical information about the RecordManager. Here is the list of stored informations :

+
    +
  • The PageIO size (in bytes)
  • +
  • The number of managed BTrees
  • +
  • The offset of the first free page
  • +
  • The offset of the last free page
  • +
+

Here is a picture that shows the header content :

+

RecordManager header

+

We keep a track of the free pages (a free page is a PageIO that is not anymore used, for instance because the data have been deleted.) This is done by keeping a link between each PageIO and by pointing to the first feee PageIO and to the last free PageIO of this list.

+
+

Note We might get rid of the last free page offset.

+
+

At startup, of course, we have no free pages, and those pointers contain the -1 offset.

+

This header is stored in a PageIO, at the very beginning of the file.

+

The RecordManager structure

+

The RecordManager manages BTrees, and we have to store them into PageIOs. How do we do that ?

+

All the BTrees have a header that contains many informations about them, and point to a rootPage which is the current root (so the root for the latest revision). As a RecordManager can manage more than one BTree, we have to find a way to retreive all the BTrees at startup : we use an internal link, so that a BTree points to the next btree. At startup, we read the first BTree which is stored in the second PageIO in the file (so just after the RecordManager header), then we read the next BTree pointed by the first BTree, and so on.

+

The BTree header

+

Each BTree has to keep many informations so that it can be used. Those informations are :

+
    +
  • revision (8 bytes) : the current revision for this BTree. This value is updated after each modification in the BTree.
  • +
  • nbElems (8 bytes) : the total number of elements we have in the BTree. This is updated after each modification either.
  • +
  • rootPage offset (8 bytes) : the position in the file where the rootPage is stored
  • +
  • nextBtree offset (8 bytes) : the position of the next BTree header in the file (or -1 if we don't have any other BTree)
  • +
  • pageSize (4 bytes) : the number of elements we cans store in a Node or a Leaf. It's not related in any possible way with the PageIO size.
  • +
  • nameSize (4 bytes) : The BTree name size
  • +
  • name (nameSize bytes) : the BTree name
  • +
  • keySerializerSize (4 bytes) : The size of the java FQCN for the key serializer
  • +
  • keySerializer (keySerializerSize bytes) : The java FQCN for the key serializer
  • +
  • valueSerializerSize (4 bytes) : The size of the java FQCN for the value serializer
  • +
  • valueSerializer (valueSerializerSize bytes): The java FQCN for the value serializer
  • +
  • dupsAllowed (1 byte): tells if the BTree can have duplicated values.
  • +
+

As we can see, thi sheader can have various length, and if one one the names is long, we may need more than one PageIOs to store it.

+

Here is a diagram which present this data structure on disk :

+

BTreeHeader header

+

Note that a BTree header can be stored on one or many IOPages, depending on its size.

+

All in all, when we have more than one BTree stored in the file, the content of the file which stores the BTree headers will look like this one :

+

BTrees

+

Note that each BTreeHeader has at least one root page, even if it contains no data. In this schema, we show the root page just after the BTree it is associated to, but after a few updates, the root page may perfectly well be stored elswhere on the disk.

+

The Nodes and Leaves

+

Nodes and Leaves are logical BTree pages which are serialized on disk into one to many PageIOs. They have slightly different data structures, as Nodes contains pointers to Leaves, and no data, while Leaves contains data. In any case, both contain the keys. The Node has one ore value than the Leaf, too.

+

On disk, each Node and Leaf are stored in PageIOs, as we said. A Node will have pointers to some other logical pages, and on disk, those pointers will be offset of the first PageIO used to store the logical page it points to.

+

Here is the Node and Leaf data structures once serialized :

+

Node and Leaf

+

Note that we store the size of the serialized data : this is necessary as we have to know how many PageIOs will be needed to store the logical page.

+

The rootPage is just a Node or a Leaf.

+

Potential improvement

+

We can get better performance by serializing the data differently. Instead of storing keys and values as byte arrays prefixed by their length, we could store an array of keys and values' offsets before the associated byte[]. Here is the resulting data structure, once serialized :

+

Node and Leaf, improved

+

(The Node is not described, as it's basically the same data structure, but with one extra value).

+

It does not need more space to serialize the data this way, as the offsets are ints, and in the previous version, those ints are used to store the length of the keys and values anyway.

+

The gain is that we can have access to a given key and value without having to read all the previous keys and values. Also we can now read a leaf or a node without having to deserialize all the keys and values they contain.

+ + + + + +
+
+
+ +
+ +