directory-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kayyag...@apache.org
Subject svn commit: r1593222 - /directory/site/trunk/content/mavibot/user-guide/7.3-serializations.mdtext
Date Thu, 08 May 2014 10:06:23 GMT
Author: kayyagari
Date: Thu May  8 10:06:23 2014
New Revision: 1593222

URL: http://svn.apache.org/r1593222
Log:
re-wrote parts of the text

Modified:
    directory/site/trunk/content/mavibot/user-guide/7.3-serializations.mdtext

Modified: directory/site/trunk/content/mavibot/user-guide/7.3-serializations.mdtext
URL: http://svn.apache.org/viewvc/directory/site/trunk/content/mavibot/user-guide/7.3-serializations.mdtext?rev=1593222&r1=1593221&r2=1593222&view=diff
==============================================================================
--- directory/site/trunk/content/mavibot/user-guide/7.3-serializations.mdtext (original)
+++ directory/site/trunk/content/mavibot/user-guide/7.3-serializations.mdtext Thu May  8 10:06:23
2014
@@ -24,11 +24,11 @@ Notice: Licensed to the Apache Software 
 
 # 7.3 - Serializations
 
-The logical pages are serialized before being stored on physical pages. This is a process
that should obviously be reversible. W will describe in this chapter how we serialize **Leaf**
and **Node**.
+The logical pages are serialized before storing in physical pages. This is a process that
should obviously be reversible. In this chapter, we will describe how to serialize a **Leaf**
and **Node**.
 
 ## Leaf serialization
 
-A **Leaf** contains some medata data, and a set of keys and values. A key can have many values,
and we have as many keys as values. That being said, let's describe the internal format of
a serialized leaf :
+A **Leaf** contains some metadata, and a set of keys and values. A key can have many values,
and we have as many keys as values. The internal format of a serialized leaf contains:
 
 <pre>
     +----------------------------------+
@@ -61,81 +61,35 @@ A **Leaf** contains some medata data, an
 
 </pre>
 
-We keep the length of each serialized key and value just because we need to provide a complete
byte[] to the key or value deserializer (ie, the exact number of bytes needed to be able to
deserialize the key or the value).
+The length of each serialized key and value is stored so that a complete byte[] can be passed
to the key or value deserializer (ie, the exact number of bytes needed to be able to deserialize
the key or the value).
 
-The _dataSize_ value is used so that we can know how many bytes we will have to read - thus
the number of physical pages to read - in order to get a full page.
+The _dataSize_ value is used to know how many bytes needs to be read - thus the number of
physical pages to read - in order to get a full page.
 
 ## Leaf deserialization
 
-On the other way, when we deserialize a leaf, we won't deserialize the keys and the values
(it would be too expensive, if the leaf is discarded from memory immediately, when we only
need to read one single key and value). We thus keep the byte[] form of each keys and each
values, which will be deserialized on demand.
+When a leaf needs to be deserialized, not all the keys and the values are deserialized at
once. It would be too expensive, if the leaf is discarded from memory immediately, when we
only need to read one single key and value, hence the keys and values are kept in byte[] form,
and will be deserialized on demand.
 
-We use two data structures to store a key and a value :
+Two data structures are used to store a key and a value :
 * a _KeyHolder_ for the key
 * a _ValueHolder_ for the value
 
 ### KeyHolder
 
 The _KeyHolder_ data structure holds the key in two ways :
-* serialized (raw)
+* serialized (raw (byte[]))
 * deserialized (key)
 
-When we just read the data from disk, we don't deserialize the keys. We do that when needed.
-
-Here is a description of this class :
-
-<pre>
-public class KeyHolder<K>
-{
-    /** The deserialized key */
-    private K key;
-
-    /** The ByteBuffer storing the key */
-    private byte[] raw;
-
-    /** The Key serializer */
-    private ElementSerializer<K> keySerializer;
-}
-</pre>
-
-#### KeyHolder operations
-
-Here is a description of the available methods in the KeyHolder class.
-
-##### Constructors
-
-We have two constructors for this class, one which takes a deserialized key, the other which
takes a byte[].
-
-* _KeyHolder( ElementSerializer<K> keySerializer, K key )_
-
-Here, we need to serailize the key immediately, as we may have to flush the key to the disk.
We then serialize the Key immediately and store the resulting byte[] into the _raw_ field.
-
-
-* KeyHolder( ElementSerializer<K> keySerializer, byte[] raw )
-
-Here, we just get the serialized form. We don't need to deserialize it, as the key might
not be used anytime soon. We thus just update the _raw_ field, and the _key_ field remains
null.
-
-##### getKey()
-
-This method retuns the deserialized key. If it does not exist, then we deserialize it on
the fly using the _raw_ field.
-
-##### setKey()
-
-This method set the key. We immediately serialize it, and store the results in the _raw_
field.
-
-##### getRaw()
-
-Returns the _raw_ field. This method is only visible from the classes in the same package.
-
+The keys are not deserialized immediately after the data was read from disk instead they
are deserialized on demand.
 
 ### ValueHolder
 
-The _ValueHolder_ data structure will store the list of values associated with a key. As
we may have more than one value, we use an internal structure for that purpose. This is a
complex data structure, if we compare it with the _KeyHolder_ one.
+Mavibot supports multiple values for a key. The _ValueHolder_ data structure will store a
set of values associated with a key. This is a complex data structure when compared with the
_KeyHolder_.
 
-In some case, the number of values to store is really big, this we need to use an internal
data structure that allows a quick retrieval of a value, plus we need to be able to copy a
page containing such a value in an efficient way. For these reasons, we use two different
internal data structures :
-* an array up to a threshold
-* a sub-BTree above this threshold
+An array is used as the default container to hold these values. In some cases, the number
of values to be stored is really big, and using an array will impact the performance.
+In such cases the array is replaced by a BTree, this helps in improving the retrieval time,
and also page copying becomes more efficient.
 
-When we reach the threshold, the array is transformed into a BTree, and the way back if we
get below this number. In order to avoid many array <-> btree transformations if we
continusously add and delete a value, the array -> btree threshold is bigger than the btree
-> array threshold.
+When the number of the values stored reaches a threshold, the array is replaced with a BTree,
likewise if the number of values stored are is below the threshold then BTree is replaced
by an array.
+In order to avoid many array <-> btree transformations (e.g. when continusously adding
and deleting a value), the array -> btree threshold is bigger than the btree -> array
threshold.
 
 <pre>
    0---1---2---...---TH-low--...--TH-high---...
@@ -145,63 +99,33 @@ When we reach the threshold, the array i
    <-----Array-----<<--------BTree------------... When we delete values.
 </pre>
 
-It's important to know that the sub-BTree will hold only keys, and no values. The sub-btree
Keys will be the values we have to store.
-
-Here is the description of this class :
-
-<pre>
-public class ValueHolder<V> implements Cloneable
-{
-    /** The deserialized value */
-    private V[] valueArray;
-
-    /** The BTree storing multiple value, if we have moe than a threashold values */
-    private BTree<V, V> valueBtree;
-
-    /** The serialized value */
-    private byte[] raw;
-
-    /** A flag set to true if the values are stored in a BTree */
-    private boolean isSubBtree = false;
-
-    /** The RecordManager */
-    private BTree<?, V> btree;
-
-    /** The Value serializer */
-    private ElementSerializer<V> valueSerializer;
-
-    /** An internal flag used when the values are not yet deserialized */
-    private boolean isRaw = true;
-}
-</pre>
-
-As we can see, we use two different fields to store the data, either the _valueArray_ field
or the _valueBtree_ field. The _raw_ field contains the serialized values when the values
are stored in an array; It's null either if the values are stored in a BTree, or if we already
have deserialized the values.
+It's important to note that the values inserted into the sub-BTree will be stored as keys,
and all the values of this sub-BTree will contain nulls. The sub-btree Keys will be the values
of the key present in the parent-BTree.
 
 #### Raw/deserialized values
 
 One key for obtaining good performances is to avoid any useless deserialization. This is
easy to implement for the _KeyHolder_, as we only store one single key. For values, it's slightly
more complex, as we may have more than one value. The following rules should be followed :
 
-* don't deserialized until necessary (ie, when one need to get one value)
-* don't serialize when unnedded (ie until the values must be written back to disk)
+* don't deserialized until necessary (ie, until one needs to get a value)
+* don't serialize until writing (i.e, until the value(s) must be written to disk)
 
-In fact, we may be in three different states :
+In fact, a value may be in three different states :
 
 * all the values are serialized (when we just read the page from disk)
-* non of the values are serialized (when we just created a ValueHolder with new values)
+* none of the values are serialized (when we just created a ValueHolder with new values)
 * somewhere in the middle, when we are modifying a ValueHolder which has been read from the
disk
 
 The third case is the complex one. We should consider two different cases though :
-* the values are stored in a sub BTree : we don't have to deal with this problem, it's up
to the sub-btree to deal with it
-* the values are stored in an array : we don't want to store half of the values as byte[],
and half of the values as java instances. We must pick either one form or the other. In this
case, as soon as we have to manipulate values in Java, then we need to deserialize all the
values.
+* the values are stored in a sub-BTree : we don't have to deal with this problem, it's up
to the sub-btree to deal with it
+* the values are stored in an array : we don't want to store half of the values as byte[],
and half of the values as java instances. We must pick either one form or the other. In this
case, as soon as we have to manipulate values in Java, we must deserialize all the values.
 
 #### ValueHolder operations
 The possible operations on a ValueHolder are the following :
 
-* add( value ) : Insert a new value into the ValueHolder. If we reach the upper threshold,
then the array is converted into a BTree. In any case, we inject the new value into the array
or the BTree so that we keep all the value ordered (the ValueSerializer must have a _Comparator_).
+* add( value ) : Insert a new value into the ValueHolder. If we reach the upper threshold,
then the array is converted into a BTree. In any case, we inject the new value into the array(in
the correct order using the comparator present in the serializer) or the BTree.
 
-As we need to compare values, they must be deserialised, so we need to do it if it's not
already done (the values are not deserialiezed when the page is read from the disk). Note
that it's not necessary for the sub BTree, as it's up to the sub-btree to deserialize the
keys on the fly
+As we need to compare values, they must be deserialised, so we need to do it if it's not
already done (the values are not deserialiezed when the page is read from the disk). Note
that it's not necessary for the sub-btree, as it's up to the sub-btree to deserialize the
keys on the fly.
 
-The _add_ algorithm will thus be :
+Thus the _add_ algorithm will be :
 
 <pre>
   if the values are not yet deserialized



Mime
View raw message