hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-hadoop Wiki] Update of "Hbase/HbaseArchitecture" by JimKellerman
Date Tue, 13 Feb 2007 03:44:51 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by JimKellerman:
http://wiki.apache.org/lucene-hadoop/Hbase/HbaseArchitecture

------------------------------------------------------------------------------
  or '''column''' for short, the latter is called a ''map column'' or
  '''map''' for short.
  
+ There are two types of ''column'':
+  1. value is arbitrary sequence of bytes
+  1. value is an integer counter. We refer to this type of column as a '''counter'''
+ 
  Google makes no distinction between these two value types and groups
  them under the term ''column family''. They achieve the single valued
  column as a degenerate case of a column family. A single valued column
@@ -59, +63 @@

  application desire to implement a ''locality group'' it can do so by
  simply restricting its map column key set.
  
- We use the terms '''column''' and '''map''' throughout the rest of the document for consistency.
+ We use the terms '''column''', '''map''' and '''counter''' throughout the rest of the document
for consistency.
  
  [[Anchor(conceptual)]]
  == Conceptual View ==
@@ -136, +140 @@

  lock service. The following depicts the layout of the file space:
  
  {{{
- /hbase/table-name1/columnname1
+ /hbase/instance-name/table-name1/columnname1
-                    columnname2
+                                  columnname2
-                    etc.
+                                  etc.
  
-        table-name2/columnname1
+                      table-name2/columnname1
-                    columnname2
+                                  columnname2
-                    etc.
+                                  etc.
  }}}
  
  Each column file contains configuration settings for the column:
-  * whether it is a ''column'' or a ''map''
+  * whether it is a ''column'', ''map'', or ''counter''
-   * if ''column'', whether the value is an integer counter
+   * if ''map'', map compression setting
   * tablet size
   * tablet block size
   * garbage collection policy
@@ -155, +159 @@

    * only keep data less than n days old
   * in memory setting
   * bloom filter enabled/disabled
+  * block compression
  
  Access control for each column is based on the access control setting
  for the column file.
  
  Each Hbase table is served by a number of server processes which
- comprise an Hbase '''instance'''. When a server process is started, it
+ belong to an Hbase '''instance'''. When a server process is started, it
  is told which instance to join.
  
  Since the master server and tablet servers run substantially different
@@ -176, +181 @@

  Thus the run time information looks like:
  
  {{{
- /hbase/table-name1/root-tablet
+ /hbase/instance-name/table-name1/root-tablet
-                    master.lock
+                                  master.lock
-                    servers/
+                                  servers/
-                            tablet-server-1
+                                  tablet-server-1
-                            tablet-server-2
+                                  tablet-server-2
-                            etc.
+                                  etc.
  
-        table-name2/root-tablet
+        instance-name/table-name2/root-tablet
-                    master.lock
+                                  master.lock
-                    servers/
+                                  servers/
-                            tablet-server-1
+                                  tablet-server-1
-                            tablet-server-2
+                                  tablet-server-2
-                            etc.
+                                  etc.
  }}}
  
  and the complete distributed lock server file space for a single table
@@ -222, +227 @@

  = Master Node =
  
   * Handles table creation and deletion
+   * bootstrap
+    * create instance directory in lock server file system
+    * create servers directory in lock server file system
+    * create root tablet in HDFS
+    * save location in lock server file system
+    * create first metadata tablet
+    * write metadata tablet information to root tablet
+   * column/map creation
+    * create file in lock server file system
+    * write configuration settings to file
+    * create new tablet file in HDFS
+    * write tablet information to METADATA table
+    * assign tablet to tablet server
   * Responsible for assigning tablets to tablet servers
   * Detects the addition and expiration of tablet servers
   * Balances tablet server load
@@ -331, +349 @@

   * Can be Memory-mapped
   * Can be shared by two tablets immediately after a split
   * API
-   * Lookup a given key ''(with or without timestamp?)''
+   * Lookup a given key ''(with or without timestamp)''
    * Iterate over key/value pairs
  
  As a first approximation, a Hadoop !MapFile satisfies these requirements. It is a persistent
(lives in the Hadoop DFS), ordered (!MapFile is based on !SequenceFile which is strictly ordered),
immutable (once written, an attempt to open a !MapFile for writing will overwrite the existing
contents) map from keys to values.
@@ -360, +378 @@

    * Root tablet stores the location of all tablets in a METADATA table
    * The "root tablet" is just the first tablet in the METADATA table, it is never split
  
-   ''I really don't like this definition (and I know it is from the Bigtable paper). Aside
from the fact that it is never split, the root tablet is special in another way: it is the
metadata table for the METADATA table.''
+   ''I really don't like this definition (and I know it is from the Bigtable paper). Aside
from the fact that it is never split, the root tablet is special in another way: it is the
metadata table for the METADATA table.'' ''It '''does''' however have the same format as the
rest of the METADATA table.''
  
-   '''Example:''' [[Anchor(example)]]
+  * The row key is a combination of the table name, column or map name and the __'''first
row key'''__ of the tablet.
  
+   ''Note that this differs from Bigtable which stores the location of a tablet under a row
key that is an encoding of the tablet's table ID and it's __'''end row'''__''. ''The reasons
for this are:''
-   ''Suppose you had a table with one column and 4x10^9^ rows. Each row contains about 5KB
of data resulting in a total table size of 200TB.''
- 
-   ''If each tablet holds about 100MB of data, this will require 2x10^6^ tablets and the
same number of rows in the METADATA table.''
- 
-   ''If each metadata row is about 1KB, the METADATA table size required to map all the tablets
is 2x10^9^ bytes. If each METADATA tablet is 100MB that requires 20 METADATA tablets to map
the entire table, and consequently 20 root tablet rows to map the METADATA tablets.''
- 
-  * Stores the location of a tablet under a row key that is an encoding of the tablet's table
ID and it's __'''end row'''__
-   * This seems counter intuitive for a couple of reasons:
-    1. Since a row key can be an arbitrary string, how do you represent the "maximum value"
for the last tablet?
+    1. ''Since a row key can be an arbitrary string, how do you represent the "maximum value"
for the last tablet?''
-    1. As new rows are added with row keys that are greater than the previous maximum value,
is the metadata table only updated when the last tablet does a major compaction?
+    1. ''As new rows are added with row keys that are greater than the previous maximum value,
the metadata table needs to be updated more frequently than if the first row key is stored''
+    1. ''If the first key in a tablet is stored instead, the minimum value could be represented
as the empty string "" and a simple string comparison between a key and the first key would
result in key > first key.'' If you represented the maximum row key as the empty string,
that would require a special case instead of just a simple compare.''
  
-    If the first key in a tablet were stored instead, the minimum value could be represented
as the empty string "" and a simple string comparison between a key and the first key would
result in key > first key.
- 
-    If the keys go from 1 to n, when the tablet is split, the first tablet still has the
row key "" as its first key and the second tablet has a first row key of n/2. So if a key
is presented and its value is < n/2, you know that if it exists, it is in the first tablet
and if the value >= n/2 it is in the second tablet.
+    1. ''If the keys go from 1 to n, when the tablet is split, the first tablet still has
the row key "" as its first key and the second tablet has a first row key of n/2. So if a
key is presented and its value is < n/2, you know that if it exists, it is in the first
tablet and if the value >= n/2 it is in the second tablet.''
- 
-    I suppose you could represent the maximum row key as the empty string but that would
require a special case instead of just a simple compare.
  
   * The "location" map has the ''InMemory'' tuning parameter set
   * Each row stores approximately 1KB of data in memory
   * All events pertaining to each tablet are logged here (such as when a tablet server starts
serving a tablet)
+ 
+ === Format ===
+ 
+ The METADATA table has a single '''map''' column that stores the following data about the
tablet it refers to:
+  * tablet file name
+  * whether the tablet contains a ''column'', ''map'' or ''counter''
+   * if ''map'', compression setting
+  * maximum desired tablet size
+  * tablet block size
+  * block compression setting
+  * garbage collection policy
+  * whether the tablet should be memory resident
+  * whether there is a bloom filter enabled for the tablet
+  * log redo point
+  * when the tablet most recently came on-line
+  * names and ordering of ''sub-tablets'' created as the result of a minor or merging compaction
+ 
+ [[Anchor(example)]]
+ === Example ===
+ 
+ ''Suppose you had a table with one column and 4x10^9^ rows. Each row contains about 5KB
of data resulting in a total table size of 200TB.''
+ 
+ ''If each tablet holds about 100MB of data, this will require 2x10^6^ tablets and the same
number of rows in the METADATA table.''
+ 
+ ''If each metadata row is about 1KB, the METADATA table size required to map all the tablets
is 2x10^9^ bytes. If each METADATA tablet is 100MB that requires 20 METADATA tablets to map
the entire table, and consequently 20 root tablet rows to map the METADATA tablets.''
  
  [[Anchor(ipc)]]
  = Inter-process Communication Messages =
@@ -405, +438 @@

  ||<:> byte ||<(> protocol version number ||
  ||<:> int ||<(> message type (op code) ||
  ||<:> int ||<(> number of parameters ||
- ||||<:> ''repeat the following for each parameter'' ||
+ ||||<:>  ''repeat the following for each parameter''  ||
  ||<:> int ||<(> parameter identifier ||
  ||<:> int ||<(> length of parameter value ||
  ||<:> byte[length] ||<(> parameter value ||
  
  [[Anchor(ipcclientmaster)]]
  == Client to Master Messages ==
+ 
+  * create table (table name)
+  * create column (type: {simple|counter|map}, max-tablet-size, tablet-block-size, gc-policy,
compression-policy, in-memory, bloom-filter)
+  * get column metadata returns values that were specified for create column
+  * set acl (column-name, read, write, control)
+  * get acl (column-name) returns (read, write, control)
+  * get tablet server(tablet) returns server
  
  [[Anchor(ipcclienttablet)]]
  == Client to TabletServer Messages ==

Mime
View raw message