labs-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emmanuel L├ęcharny (Confluence) <conflue...@apache.org>
Subject [CONF] Apache Labs > Physical pages
Date Fri, 19 Jul 2013 07:32:00 GMT
<html>
<head>
    <base href="https://cwiki.apache.org/confluence">
            <link rel="stylesheet" href="/confluence/s/en/2176/1/1/_/styles/combined.css?spaceKey=labs&amp;forWysiwyg=true"
type="text/css">
    </head>
<body style="background: white;" bgcolor="white" class="email-body">
<div id="pageContent">
<div id="notificationFormat">
<div class="wiki-content">
<div class="email">
    <h2><a href="https://cwiki.apache.org/confluence/display/labs/Physical+pages">Physical
pages</a></h2>
    <h4>Page <b>edited</b> by             <a href="https://cwiki.apache.org/confluence/display/~elecharny">Emmanuel
L├ęcharny</a>
    </h4>
        <br/>
                         <h4>Changes (8)</h4>
                                 
    
<div id="page-diffs">
                    <table class="diff" cellpadding="0" cellspacing="0">
    
            <tr><td class="diff-snipped" >...<br></td></tr>
            <tr><td class="diff-unchanged" >{info} <br> <br></td></tr>
            <tr><td class="diff-changed-lines" >If we except a small header at
the beginning of the file, everything is stored in <span class="diff-changed-words">*PageIO<span
class="diff-added-chars"style="background-color: #dfd;">s</span>*</span> in
the file. <br></td></tr>
            <tr><td class="diff-unchanged" > <br>h1. PageIO <br> <br></td></tr>
            <tr><td class="diff-deleted-lines" style="color:#999;background-color:#fdd;text-decoration:line-through;">a
*PageIO* contains some raw data. As we have to map some logical data that may be wider than
a physical fixed size *PageIO*, we link the *PageIO* so that they contain all the data contained
in a logical page. <br></td></tr>
            <tr><td class="diff-added-lines" style="background-color: #dfd;">a
*PageIO* contains some raw data. As we have to map some logical data that may be wider than
a physical fixed size *PageIO*, we use potentially more than one *PageIO* to store the data,
and we link the *PageIOs* alltogether. <br></td></tr>
            <tr><td class="diff-unchanged" > <br></td></tr>
            <tr><td class="diff-deleted-lines" style="color:#999;background-color:#fdd;text-decoration:line-through;">Each
*PageIO* has a hight byte pointer at the beginning, plus an extra 4 bytes on the first *PageIO*
to define the number of bytes stored. Here is the mapping between a logical page and some
*PageIO* : <br></td></tr>
            <tr><td class="diff-added-lines" style="background-color: #dfd;">Each
*PageIO* has a height bytes pointer at the beginning, pointing to the next *PageIO* (or to
nothing, if there is no more *PageIO* in the chain), plus an extra 4 bytes on the first *PageIO*
to define the number of bytes stored in the chain of *PageIO*. Here is the mapping between
a logical page and some *PageIOs* : <br></td></tr>
            <tr><td class="diff-unchanged" > <br>!PageIOLogical.png! <br>
<br></td></tr>
            <tr><td class="diff-deleted-lines" style="color:#999;background-color:#fdd;text-decoration:line-through;">Every
*PageIO* are contiguous on disk. <br></td></tr>
            <tr><td class="diff-added-lines" style="background-color: #dfd;">Every
*PageIOs* are contiguous on disk, but the *PageIOs* used to stor a logical page may be located
anywhere on the disk, they don&#39;t have to be continuous. <br></td></tr>
            <tr><td class="diff-unchanged" > <br></td></tr>
            <tr><td class="diff-added-lines" style="background-color: #dfd;">Here
is the structure of a *PageIO* on disk : <br> <br>* next page offset (8 bytes)
: the offset of the next *PageIO*, or -1L if no more *PageIO* is needed <br>* data size
(4 bytes) : for the first *PageIO*, the size of the stored data across all the *PageIOs* used
to store a page. <br>* data (N bytes) : a block of data, which size will be _min( PageSize
- offset - data size, data size)_ for the first *PageIO* or _min( PageSize - offset, data
size)_ for any other *PageIOs* <br> <br></td></tr>
            <tr><td class="diff-unchanged" >h1. RecordManager header <br>
<br></td></tr>
            <tr><td class="diff-snipped" >...<br></td></tr>
    
            </table>
    </div>                            <h4>Full Content</h4>
                    <div class="notificationGreySide">
        <p>The <b>RecordManager</b> stores all the Btrees in one single
file, which is split in many physical pages, all having the same size. </p>

<div class='panelMacro'><table class='infoMacro'><colgroup><col width='24'><col></colgroup><tr><td
valign='top'><img src="/confluence/images/icons/emoticons/information.gif" width="16"
height="16" align="absmiddle" alt="" border="0"></td><td><b>The page
size</b><br />Currently, the choice was to use one single size for all the pages,
regardless the data we store into them. The rationnal is to get close to the OS page size
(frequently 512 bytes or 4096 bytes). This is not necessarily the best choice though, let's
say it's something we might want to change later.</td></tr></table></div>

<p>If we except a small header at the beginning of the file, everything is stored in
<b>PageIOs</b> in the file.</p>

<h1><a name="Physicalpages-PageIO"></a>PageIO</h1>

<p>a <b>PageIO</b> contains some raw data. As we have to map some logical
data that may be wider than a physical fixed size <b>PageIO</b>, we use potentially
more than one <b>PageIO</b> to store the data, and we link the <b>PageIOs</b>
alltogether.</p>

<p>Each <b>PageIO</b> has a height bytes pointer at the beginning, pointing
to the next <b>PageIO</b> (or to nothing, if there is no more <b>PageIO</b>
in the chain), plus an extra 4 bytes on the first <b>PageIO</b> to define the
number of bytes stored in the chain of <b>PageIO</b>. Here is the mapping between
a logical page and some <b>PageIOs</b> :</p>

<p><span class="image-wrap" style=""><img src="/confluence/download/attachments/31824480/PageIOLogical.png?version=1&amp;modificationDate=1370953949000"
style="border: 0px solid black" /></span></p>

<p>Every <b>PageIOs</b> are contiguous on disk, but the <b>PageIOs</b>
used to stor a logical page may be located anywhere on the disk, they don't have to be continuous.</p>

<p>Here is the structure of a <b>PageIO</b> on disk :</p>

<ul>
	<li>next page offset (8 bytes) : the offset of the next <b>PageIO</b>,
or -1L if no more <b>PageIO</b> is needed</li>
	<li>data size (4 bytes) : for the first <b>PageIO</b>, the size of the
stored data across all the <b>PageIOs</b> used to store a page.</li>
	<li>data (N bytes) : a block of data, which size will be <em>min( PageSize -
offset - data size, data size)</em> for the first <b>PageIO</b> or <em>min(
PageSize - offset, data size)</em> for any other <b>PageIOs</b></li>
</ul>


<h1><a name="Physicalpages-RecordManagerheader"></a>RecordManager header</h1>

<p>We keep a few bytes at the beginning of the file to store some critical information
about the <b>RecordManager</b> Here is the list of stored informations :</p>

<ul>
	<li>The <b>PageIO</b> size (in bytes)</li>
	<li>The number of managed BTrees</li>
	<li>The offset of the first free page</li>
	<li>The offset of the last free page</li>
</ul>


<p>Here is a picture that shows the header content :</p>

<p><span class="image-wrap" style=""><img src="/confluence/download/attachments/31824480/RMHeader.png?version=1&amp;modificationDate=1370955465000"
style="border: 0px solid black" /></span></p>

<p>We keep a track of the free pages (a free page is a <b>PageIO</b> that
is not anymore used, for instance because the data have been deleted.) This is done by keeping
a link between each <b>PageIO</b> and by pointing to the first feee <b>PageIO</b>
and to the last free <b>PageIO</b></p>

<p>At startup, of course, we have no free pages, and those pointers contain the <b>-1</b>
offset.</p>

<h1><a name="Physicalpages-Theinternalstructure"></a>The internal structure</h1>

<p>The <b>RecordManager</b> manages <b>BTrees</b>, and we have
to store them into <b>PageIOs</b>. How do we do that ?</p>

<p>All the <b>BTrees</b> have a header that contains many informations about
them, and point to a <b>rootPage</b> which is the current root (so the root for
the latest revision). As a <b>RecordManager</b> can manage more than one BTree,
we have to find a way to retreive all the <b>BTrees</b> at startup : we use an
internal link, so that a btree points to the next btree. At startup, we read the first BTree
which is stored in the first <b>PageIO</b> in the file (so just after the <b>RecordManager</b>
header), then we read the next btree pointed by the first btree, and so on.</p>

<h2><a name="Physicalpages-TheBTreeheader"></a>The BTree header</h2>

<p>Each <b>BTree</b> has to keep many informations so that it can be used.
Those informations are :</p>

<ul>
	<li>revision : the current revision for this <b>BTree</b>. This value is
updated after each modification in the <b>BTree</b>.</li>
	<li>nbElems : the total number of elements we have in the <b>BTree</b>.
This is updated after each modification either.</li>
	<li>rootPage offset : the position in the file where the <b>rootPage</b>
is stored</li>
	<li>nextBtree offset : the position of the next <b>BTree</b> header in
the file (or -1 if we don't have any other <b>BTree</b>)</li>
	<li>pageSize : the number of elements we cans tore in a <b>Node</b> or
a <b>Leaf</b>. It's not related in any possible way with the <b>PageIO</b>
size.</li>
	<li>name : The <b>BTree</b> name</li>
	<li>keySerializer : The java FQCN for the key serializer</li>
	<li>valueSerializer : The java FQCN for the value serializer</li>
	<li>dupsAllowed : tells if the <b>BTree</b> can have duplicated values.</li>
</ul>


<p>Here is a diagram which present this data structure on disk :</p>

<p><span class="image-wrap" style=""><img src="/confluence/download/attachments/31824480/btreeHeader.png?version=1&amp;modificationDate=1370956120000"
style="border: 0px solid black" /></span></p>

<p>Note that a <b>BTree</b> header can be stored on one or many <b>IOPages</b>,
depending on its size.</p>


<h2><a name="Physicalpages-TheNodesandLeaves"></a>The Nodes and Leaves</h2>

<p><b>Nodes</b> and <b>Leaves</b> are logical <b>BTree</b>
pages which are serialized on disk into one to many <b>PageIOs</b>. They have
slightly different data structures, as <b>Nodes</b> contains pointers to leaves,
and no data, while <b>Leaves</b> contains data.</p>

<p>On disk, each <b>Node</b> and <b>Leaf</b> are stored in <b>PageIOs</b>,
as we said. A <b>Node</b> will have pointers to some other logical pages, and
on disk, those pointers will be offset of the first <b>PageIO</b> used to store
the logical page it points to.</p>

<p>Here is the <b>Node</b> and <b>Leaf</b> data structures once
serialized :</p>

<p><span class="image-wrap" style=""><img src="/confluence/download/attachments/31824480/nodeLeaf.png?version=1&amp;modificationDate=1370956626000"
style="border: 0px solid black" /></span></p>

<p>Note that we store the size of the serialized data : this is necessary as we have
to know how much <b>PageIOs</b> will be needed to store the logical page. </p>

<p>The <b>rootPage</b> is just a <b>Node</b> or a <b>Leaf</b>.</p>

<h1><a name="Physicalpages-SpecialBTrees"></a>Special BTrees</h1>

<p>We have two special <b>BTrees</b> we use to manage the revisions and
the copied pages. We will explain what they are good for</p>

<h2><a name="Physicalpages-Revisiontree"></a>Revision tree</h2>

<p>We need to keep a track of each active revision, so that a search can work with a
specific revision. The idea is that when a search starts, it uses the latest revision, but
as some modification can occur while teh search is beng processed, some new revisions will
be added. In some case, we may want to keep a revision active for quite a long time.</p>

<p>So we store the active revisions in a dedicated <b>BTree</b>.</p>

<p>As we may have many <b>BTrees</b>, we have to use a key which is a combinaison
of the <b>BTree</b> name and its revision. So the revision <b>BTree</b>
manage the revisions of all the managed <b>BTrees</b>.</p>

<p>When a revision is not anymore used, we can remove it from the revision <b>BTree</b>.</p>

<p>This <b>BTree</b> is not a MVCC <b>BTree</b>, like the other
ones. In other words, we only keep the latest revision of this <b>BTree</b> (ie,
all the modified pages are immediately freed)</p>

<h2><a name="Physicalpages-CopiedpagesBTree"></a>Copied pages BTree</h2>

<p>Once we create a new revision, the pages we copied are not anymore in use except
if the revisions they are associated with are still in use. The problem is that we can't discard
those pages and move them to the free list until the associated revision is free.</p>

<p>We use a dedicated <b>BTree</b> to keep a track of the copied pages,
which will be reclaimed and moved to the free pages list once the associated revision will
be released.</p>

<h1><a name="Physicalpages-Managingthefreepages"></a>Managing the free pages</h1>

<p>We have a mechanism to manage the <b>PageIO</b> that are not anymore
in use. This is a linked list in which the free pages are added. If we need some page, we
first look into this list, and get back as many <b>PageIOs</b> as we need - until
we reach the end of this list. If we free some page, we add them at the end of the free list.</p>

<p>We always free a logical page, which may be stored into many <b>PageIOs</b>.
The good thing is that those <b>pageIOs</b> are already linked, so we just need
to make the last free <b>PageIO</b> to point on the first freed <b>PageIO</b>,
and to move the pointer to the last free page to the last <b>PageIO</b> used to
store the logical page.</p>
    </div>
        <div id="commentsSection" class="wiki-content pageSection">
        <div style="float: right;" class="grey">
                        <a href="https://cwiki.apache.org/confluence/users/removespacenotification.action?spaceKey=labs">Stop
watching space</a>
            <span style="padding: 0px 5px;">|</span>
                <a href="https://cwiki.apache.org/confluence/users/editmyemailsettings.action">Change
email notification preferences</a>
</div>
        <a href="https://cwiki.apache.org/confluence/display/labs/Physical+pages">View
Online</a>
        |
        <a href="https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=31824480&revisedVersion=8&originalVersion=7">View
Changes</a>
                |
        <a href="https://cwiki.apache.org/confluence/display/labs/Physical+pages?showComments=true&amp;showCommentArea=true#addcomment">Add
Comment</a>
            </div>
</div>
</div>
</div>
</div>
</body>
</html>

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@labs.apache.org
For additional commands, e-mail: commits-help@labs.apache.org


Mime
View raw message