labs-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emmanuel L├ęcharny (Confluence) <conflue...@apache.org>
Subject [CONF] Apache Labs > Physical pages
Date Fri, 19 Jul 2013 07:38:00 GMT
<html>
<head>
    <base href="https://cwiki.apache.org/confluence">
            <link rel="stylesheet" href="/confluence/s/en/2176/1/1/_/styles/combined.css?spaceKey=labs&amp;forWysiwyg=true"
type="text/css">
    </head>
<body style="background: white;" bgcolor="white" class="email-body">
<div id="pageContent">
<div id="notificationFormat">
<div class="wiki-content">
<div class="email">
    <h2><a href="https://cwiki.apache.org/confluence/display/labs/Physical+pages">Physical
pages</a></h2>
    <h4>Page <b>edited</b> by             <a href="https://cwiki.apache.org/confluence/display/~elecharny">Emmanuel
L├ęcharny</a>
    </h4>
        <br/>
                         <h4>Changes (12)</h4>
                                 
    
<div id="page-diffs">
                    <table class="diff" cellpadding="0" cellspacing="0">
    
            <tr><td class="diff-snipped" >...<br></td></tr>
            <tr><td class="diff-unchanged" >If we except a small header at the
beginning of the file, everything is stored in *PageIOs* in the file. <br> <br></td></tr>
            <tr><td class="diff-added-lines" style="background-color: #dfd;">h1.
General file structure <br> <br>The file we use to store the data is a plain binary
file, used to store all the BTrees. We can store many btrees in one single file. <br>
<br>This file is considered as a fileSystem, with fixed size &#39;pages&#39;
(a page is an array of bytes). The page size is arbitrary fixed when teh RecordManager is
created, and we will store every logical data n those physical pages, which will require to
spread the logical data in many pages in most of the cases. <br> <br></td></tr>
            <tr><td class="diff-unchanged" >h1. PageIO <br> <br></td></tr>
            <tr><td class="diff-added-lines" style="background-color: #dfd;">Let&#39;s
first introduce the *PageIO*, which is used to store the data on disk. <br> <br></td></tr>
            <tr><td class="diff-unchanged" >a *PageIO* contains some raw data.
As we have to map some logical data that may be wider than a physical fixed size *PageIO*,
we use potentially more than one *PageIO* to store the data, and we link the *PageIOs* alltogether.
<br> <br></td></tr>
            <tr><td class="diff-snipped" >...<br></td></tr>
            <tr><td class="diff-unchanged" >* data (N bytes) : a block of data,
which size will be _min( PageSize - offset - data size, data size)_ for the first *PageIO*
or _min( PageSize - offset, data size)_ for any other *PageIOs* <br> <br></td></tr>
            <tr><td class="diff-deleted-lines" style="color:#999;background-color:#fdd;text-decoration:line-through;">h1.
RecordManager header <br></td></tr>
            <tr><td class="diff-added-lines" style="background-color: #dfd;">h1.
Logical structure mapping on disk <br></td></tr>
            <tr><td class="diff-unchanged" > <br></td></tr>
            <tr><td class="diff-added-lines" style="background-color: #dfd;">h2.
RecordManager header <br> <br></td></tr>
            <tr><td class="diff-unchanged" >We keep a few bytes at the beginning
of the file to store some critical information about the *RecordManager* Here is the list
of stored informations : <br> <br></td></tr>
            <tr><td class="diff-snipped" >...<br></td></tr>
            <tr><td class="diff-unchanged" >At startup, of course, we have no
free pages, and those pointers contain the *-1* offset. <br> <br></td></tr>
            <tr><td class="diff-changed-lines" ><span class="diff-changed-words">h<span
class="diff-deleted-chars"style="color:#999;background-color:#fdd;text-decoration:line-through;">1</span><span
class="diff-added-chars"style="background-color: #dfd;">2</span>.</span> The
internal structure <br></td></tr>
            <tr><td class="diff-unchanged" > <br>The *RecordManager* manages
*BTrees*, and we have to store them into *PageIOs*. How do we do that ? <br></td></tr>
            <tr><td class="diff-snipped" >...<br></td></tr>
            <tr><td class="diff-unchanged" >All the *BTrees* have a header that
contains many informations about them, and point to a *rootPage* which is the current root
(so the root for the latest revision). As a *RecordManager* can manage more than one BTree,
we have to find a way to retreive all the *BTrees* at startup : we use an internal link, so
that a btree points to the next btree. At startup, we read the first BTree which is stored
in the first *PageIO* in the file (so just after the *RecordManager* header), then we read
the next btree pointed by the first btree, and so on. <br> <br></td></tr>
            <tr><td class="diff-changed-lines" ><span class="diff-changed-words">h<span
class="diff-deleted-chars"style="color:#999;background-color:#fdd;text-decoration:line-through;">2</span><span
class="diff-added-chars"style="background-color: #dfd;">3</span>.</span> The
BTree header <br></td></tr>
            <tr><td class="diff-unchanged" > <br>Each *BTree* has to keep
many informations so that it can be used. Those informations are : <br></td></tr>
            <tr><td class="diff-snipped" >...<br></td></tr>
            <tr><td class="diff-unchanged" > <br> <br></td></tr>
            <tr><td class="diff-changed-lines" ><span class="diff-changed-words">h<span
class="diff-deleted-chars"style="color:#999;background-color:#fdd;text-decoration:line-through;">2</span><span
class="diff-added-chars"style="background-color: #dfd;">3</span>.</span> The
Nodes and Leaves <br></td></tr>
            <tr><td class="diff-unchanged" > <br>*Nodes* and *Leaves* are
logical *BTree* pages which are serialized on disk into one to many *PageIOs*. They have slightly
different data structures, as *Nodes* contains pointers to leaves, and no data, while *Leaves*
contains data. <br></td></tr>
            <tr><td class="diff-snipped" >...<br></td></tr>
            <tr><td class="diff-unchanged" >The *rootPage* is just a *Node* or
a *Leaf*. <br> <br></td></tr>
            <tr><td class="diff-changed-lines" ><span class="diff-changed-words">h<span
class="diff-deleted-chars"style="color:#999;background-color:#fdd;text-decoration:line-through;">1</span><span
class="diff-added-chars"style="background-color: #dfd;">2</span>.</span> Special
BTrees <br></td></tr>
            <tr><td class="diff-unchanged" > <br>We have two special *BTrees*
we use to manage the revisions and the copied pages. We will explain what they are good for
<br> <br></td></tr>
            <tr><td class="diff-changed-lines" ><span class="diff-changed-words">h<span
class="diff-deleted-chars"style="color:#999;background-color:#fdd;text-decoration:line-through;">2</span><span
class="diff-added-chars"style="background-color: #dfd;">3</span>.</span> Revision
tree <br></td></tr>
            <tr><td class="diff-unchanged" > <br>We need to keep a track
of each active revision, so that a search can work with a specific revision. The idea is that
when a search starts, it uses the latest revision, but as some modification can occur while
teh search is beng processed, some new revisions will be added. In some case, we may want
to keep a revision active for quite a long time. <br></td></tr>
            <tr><td class="diff-snipped" >...<br></td></tr>
            <tr><td class="diff-unchanged" >This *BTree* is not a MVCC *BTree*,
like the other ones. In other words, we only keep the latest revision of this *BTree* (ie,
all the modified pages are immediately freed) <br> <br></td></tr>
            <tr><td class="diff-changed-lines" ><span class="diff-changed-words">h<span
class="diff-deleted-chars"style="color:#999;background-color:#fdd;text-decoration:line-through;">2</span><span
class="diff-added-chars"style="background-color: #dfd;">3</span>.</span> Copied
pages BTree <br></td></tr>
            <tr><td class="diff-unchanged" > <br>Once we create a new revision,
the pages we copied are not anymore in use except if the revisions they are associated with
are still in use. The problem is that we can&#39;t discard those pages and move them to
the free list until the associated revision is free. <br></td></tr>
            <tr><td class="diff-snipped" >...<br></td></tr>
            <tr><td class="diff-unchanged" >We use a dedicated *BTree* to keep
a track of the copied pages, which will be reclaimed and moved to the free pages list once
the associated revision will be released. <br> <br></td></tr>
            <tr><td class="diff-changed-lines" ><span class="diff-changed-words">h<span
class="diff-deleted-chars"style="color:#999;background-color:#fdd;text-decoration:line-through;">1</span><span
class="diff-added-chars"style="background-color: #dfd;">2</span>.</span> Managing
the free pages <br></td></tr>
            <tr><td class="diff-unchanged" > <br>We have a mechanism to
manage the *PageIO* that are not anymore in use. This is a linked list in which the free pages
are added. If we need some page, we first look into this list, and get back as many *PageIOs*
as we need - until we reach the end of this list. If we free some page, we add them at the
end of the free list. <br></td></tr>
            <tr><td class="diff-snipped" >...<br></td></tr>
    
            </table>
    </div>                            <h4>Full Content</h4>
                    <div class="notificationGreySide">
        <p>The <b>RecordManager</b> stores all the Btrees in one single
file, which is split in many physical pages, all having the same size. </p>

<div class='panelMacro'><table class='infoMacro'><colgroup><col width='24'><col></colgroup><tr><td
valign='top'><img src="/confluence/images/icons/emoticons/information.gif" width="16"
height="16" align="absmiddle" alt="" border="0"></td><td><b>The page
size</b><br />Currently, the choice was to use one single size for all the pages,
regardless the data we store into them. The rationnal is to get close to the OS page size
(frequently 512 bytes or 4096 bytes). This is not necessarily the best choice though, let's
say it's something we might want to change later.</td></tr></table></div>

<p>If we except a small header at the beginning of the file, everything is stored in
<b>PageIOs</b> in the file.</p>

<h1><a name="Physicalpages-Generalfilestructure"></a>General file structure</h1>

<p>The file we use to store the data is a plain binary file, used to store all the BTrees.
We can store many btrees in one single file.</p>

<p>This file is considered as a fileSystem, with fixed size 'pages' (a page is an array
of bytes). The page size is arbitrary fixed when teh RecordManager is created, and we will
store every logical data n those physical pages, which will require to spread the logical
data in many pages in most of the cases.</p>

<h1><a name="Physicalpages-PageIO"></a>PageIO</h1>

<p>Let's first introduce the <b>PageIO</b>, which is used to store the data
on disk.</p>

<p>a <b>PageIO</b> contains some raw data. As we have to map some logical
data that may be wider than a physical fixed size <b>PageIO</b>, we use potentially
more than one <b>PageIO</b> to store the data, and we link the <b>PageIOs</b>
alltogether.</p>

<p>Each <b>PageIO</b> has a height bytes pointer at the beginning, pointing
to the next <b>PageIO</b> (or to nothing, if there is no more <b>PageIO</b>
in the chain), plus an extra 4 bytes on the first <b>PageIO</b> to define the
number of bytes stored in the chain of <b>PageIO</b>. Here is the mapping between
a logical page and some <b>PageIOs</b> :</p>

<p><span class="image-wrap" style=""><img src="/confluence/download/attachments/31824480/PageIOLogical.png?version=1&amp;modificationDate=1370953949000"
style="border: 0px solid black" /></span></p>

<p>Every <b>PageIOs</b> are contiguous on disk, but the <b>PageIOs</b>
used to stor a logical page may be located anywhere on the disk, they don't have to be continuous.</p>

<p>Here is the structure of a <b>PageIO</b> on disk :</p>

<ul>
	<li>next page offset (8 bytes) : the offset of the next <b>PageIO</b>,
or -1L if no more <b>PageIO</b> is needed</li>
	<li>data size (4 bytes) : for the first <b>PageIO</b>, the size of the
stored data across all the <b>PageIOs</b> used to store a page.</li>
	<li>data (N bytes) : a block of data, which size will be <em>min( PageSize -
offset - data size, data size)</em> for the first <b>PageIO</b> or <em>min(
PageSize - offset, data size)</em> for any other <b>PageIOs</b></li>
</ul>


<h1><a name="Physicalpages-Logicalstructuremappingondisk"></a>Logical structure
mapping on disk</h1>

<h2><a name="Physicalpages-RecordManagerheader"></a>RecordManager header</h2>

<p>We keep a few bytes at the beginning of the file to store some critical information
about the <b>RecordManager</b> Here is the list of stored informations :</p>

<ul>
	<li>The <b>PageIO</b> size (in bytes)</li>
	<li>The number of managed BTrees</li>
	<li>The offset of the first free page</li>
	<li>The offset of the last free page</li>
</ul>


<p>Here is a picture that shows the header content :</p>

<p><span class="image-wrap" style=""><img src="/confluence/download/attachments/31824480/RMHeader.png?version=1&amp;modificationDate=1370955465000"
style="border: 0px solid black" /></span></p>

<p>We keep a track of the free pages (a free page is a <b>PageIO</b> that
is not anymore used, for instance because the data have been deleted.) This is done by keeping
a link between each <b>PageIO</b> and by pointing to the first feee <b>PageIO</b>
and to the last free <b>PageIO</b></p>

<p>At startup, of course, we have no free pages, and those pointers contain the <b>-1</b>
offset.</p>

<h2><a name="Physicalpages-Theinternalstructure"></a>The internal structure</h2>

<p>The <b>RecordManager</b> manages <b>BTrees</b>, and we have
to store them into <b>PageIOs</b>. How do we do that ?</p>

<p>All the <b>BTrees</b> have a header that contains many informations about
them, and point to a <b>rootPage</b> which is the current root (so the root for
the latest revision). As a <b>RecordManager</b> can manage more than one BTree,
we have to find a way to retreive all the <b>BTrees</b> at startup : we use an
internal link, so that a btree points to the next btree. At startup, we read the first BTree
which is stored in the first <b>PageIO</b> in the file (so just after the <b>RecordManager</b>
header), then we read the next btree pointed by the first btree, and so on.</p>

<h3><a name="Physicalpages-TheBTreeheader"></a>The BTree header</h3>

<p>Each <b>BTree</b> has to keep many informations so that it can be used.
Those informations are :</p>

<ul>
	<li>revision : the current revision for this <b>BTree</b>. This value is
updated after each modification in the <b>BTree</b>.</li>
	<li>nbElems : the total number of elements we have in the <b>BTree</b>.
This is updated after each modification either.</li>
	<li>rootPage offset : the position in the file where the <b>rootPage</b>
is stored</li>
	<li>nextBtree offset : the position of the next <b>BTree</b> header in
the file (or -1 if we don't have any other <b>BTree</b>)</li>
	<li>pageSize : the number of elements we cans tore in a <b>Node</b> or
a <b>Leaf</b>. It's not related in any possible way with the <b>PageIO</b>
size.</li>
	<li>name : The <b>BTree</b> name</li>
	<li>keySerializer : The java FQCN for the key serializer</li>
	<li>valueSerializer : The java FQCN for the value serializer</li>
	<li>dupsAllowed : tells if the <b>BTree</b> can have duplicated values.</li>
</ul>


<p>Here is a diagram which present this data structure on disk :</p>

<p><span class="image-wrap" style=""><img src="/confluence/download/attachments/31824480/btreeHeader.png?version=1&amp;modificationDate=1370956120000"
style="border: 0px solid black" /></span></p>

<p>Note that a <b>BTree</b> header can be stored on one or many <b>IOPages</b>,
depending on its size.</p>


<h3><a name="Physicalpages-TheNodesandLeaves"></a>The Nodes and Leaves</h3>

<p><b>Nodes</b> and <b>Leaves</b> are logical <b>BTree</b>
pages which are serialized on disk into one to many <b>PageIOs</b>. They have
slightly different data structures, as <b>Nodes</b> contains pointers to leaves,
and no data, while <b>Leaves</b> contains data.</p>

<p>On disk, each <b>Node</b> and <b>Leaf</b> are stored in <b>PageIOs</b>,
as we said. A <b>Node</b> will have pointers to some other logical pages, and
on disk, those pointers will be offset of the first <b>PageIO</b> used to store
the logical page it points to.</p>

<p>Here is the <b>Node</b> and <b>Leaf</b> data structures once
serialized :</p>

<p><span class="image-wrap" style=""><img src="/confluence/download/attachments/31824480/nodeLeaf.png?version=1&amp;modificationDate=1370956626000"
style="border: 0px solid black" /></span></p>

<p>Note that we store the size of the serialized data : this is necessary as we have
to know how much <b>PageIOs</b> will be needed to store the logical page. </p>

<p>The <b>rootPage</b> is just a <b>Node</b> or a <b>Leaf</b>.</p>

<h2><a name="Physicalpages-SpecialBTrees"></a>Special BTrees</h2>

<p>We have two special <b>BTrees</b> we use to manage the revisions and
the copied pages. We will explain what they are good for</p>

<h3><a name="Physicalpages-Revisiontree"></a>Revision tree</h3>

<p>We need to keep a track of each active revision, so that a search can work with a
specific revision. The idea is that when a search starts, it uses the latest revision, but
as some modification can occur while teh search is beng processed, some new revisions will
be added. In some case, we may want to keep a revision active for quite a long time.</p>

<p>So we store the active revisions in a dedicated <b>BTree</b>.</p>

<p>As we may have many <b>BTrees</b>, we have to use a key which is a combinaison
of the <b>BTree</b> name and its revision. So the revision <b>BTree</b>
manage the revisions of all the managed <b>BTrees</b>.</p>

<p>When a revision is not anymore used, we can remove it from the revision <b>BTree</b>.</p>

<p>This <b>BTree</b> is not a MVCC <b>BTree</b>, like the other
ones. In other words, we only keep the latest revision of this <b>BTree</b> (ie,
all the modified pages are immediately freed)</p>

<h3><a name="Physicalpages-CopiedpagesBTree"></a>Copied pages BTree</h3>

<p>Once we create a new revision, the pages we copied are not anymore in use except
if the revisions they are associated with are still in use. The problem is that we can't discard
those pages and move them to the free list until the associated revision is free.</p>

<p>We use a dedicated <b>BTree</b> to keep a track of the copied pages,
which will be reclaimed and moved to the free pages list once the associated revision will
be released.</p>

<h2><a name="Physicalpages-Managingthefreepages"></a>Managing the free pages</h2>

<p>We have a mechanism to manage the <b>PageIO</b> that are not anymore
in use. This is a linked list in which the free pages are added. If we need some page, we
first look into this list, and get back as many <b>PageIOs</b> as we need - until
we reach the end of this list. If we free some page, we add them at the end of the free list.</p>

<p>We always free a logical page, which may be stored into many <b>PageIOs</b>.
The good thing is that those <b>pageIOs</b> are already linked, so we just need
to make the last free <b>PageIO</b> to point on the first freed <b>PageIO</b>,
and to move the pointer to the last free page to the last <b>PageIO</b> used to
store the logical page.</p>
    </div>
        <div id="commentsSection" class="wiki-content pageSection">
        <div style="float: right;" class="grey">
                        <a href="https://cwiki.apache.org/confluence/users/removespacenotification.action?spaceKey=labs">Stop
watching space</a>
            <span style="padding: 0px 5px;">|</span>
                <a href="https://cwiki.apache.org/confluence/users/editmyemailsettings.action">Change
email notification preferences</a>
</div>
        <a href="https://cwiki.apache.org/confluence/display/labs/Physical+pages">View
Online</a>
        |
        <a href="https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=31824480&revisedVersion=9&originalVersion=8">View
Changes</a>
                |
        <a href="https://cwiki.apache.org/confluence/display/labs/Physical+pages?showComments=true&amp;showCommentArea=true#addcomment">Add
Comment</a>
            </div>
</div>
</div>
</div>
</div>
</body>
</html>

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@labs.apache.org
For additional commands, e-mail: commits-help@labs.apache.org


Mime
View raw message