hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-hadoop Wiki] Update of "Hbase/HbaseArchitecture" by JimKellerman
Date Thu, 01 Feb 2007 08:21:48 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by JimKellerman:
http://wiki.apache.org/lucene-hadoop/Hbase/HbaseArchitecture

------------------------------------------------------------------------------
+ This effort is still a "work in progress". Please feel free to add comments, but please
make them stand out by bolding or underlining them. Thanks!
+ 
  = Master Node =
  
   * Handles table creation and deletion
@@ -36, +38 @@

  = Tablet Server =
  
   * Manages and serves a set of Tablets (between 10 and 1000). A tablet is a row range of
the table sorted in lexographical order. Tablets comprise two types of data structures: One
or more on-disk structure called an SSTable, and one or more in-memory data structures called
memtable. Initially, a table consists of a single tablet.
+  * Tablet servers are configured with 1GB of virtual memory
+  * Tablet servers typically serve no more than 1GB of daa
+   * If tablets are "full size" (128 MB), then a tablet server limited to 1GB of data will
serve approximately 10 tablets.
+   * A 100MB tablet contains about 100,000 1K records.
+   * 10 tablets of 1K records will contain about 1x10^6^ records.
   * SSTables for a tablet are registered in the METADATA table
-  * Tablet size is 100-200MB by default
+  * Tablet size is 100-200MB by default (128MB typical)
+  * Typical SSTable block size is 64K.
+   * Thus a block of 1K records contains between 50 to 64 records.
+   * A 100MB tablet will contain approximately 2,000 blocks.
   * The set of tablets changes when:
    * a table is created or deleted
    * two existing tablets are merged to form a single larger tablet
    * an existing tablet is split into two smaller tablets
   * Splits tablets that have grown too large
+    * All tablets can be consolidated into a single SSTable via a ''major'' compaction. This
suggests that tablets are split when they exceed 100-128MB in size.
    * Initiates the split by recording information for the new tablet in the METADATA table
    * Notifies the master of the split
  
@@ -60, +71 @@

    * Writes mutation to commit log
     * has group commit feature to improve performance. A group is comprised of all the tablets
currently being served by the tablet server.
    * Deletes are just writes with a ''special'' value that indicates that the record is deleted.
+   * Updates memtable.
+    * When a tablet is "brand new", all reads can be satisfied from the memtable.
-   * Updates memtable. When the memtable gets too large ''(how large?)'', the memtable is
written to a new SSTable. This is a ''minor'' compaction. 
+    * When the memtable gets too large ''(how large?)'', the memtable is written to a new
SSTable. This is a ''minor'' compaction. 
  
    ''Since this is not a tablet split, this implies that information about the new SSTable
is not written to the METADATA table. Is this correct? Maybe not because a minor compaction
does write a new redo point into the METADATA table and the redo point could contain a pointer
to the new SSTable.''
  
@@ -68, +81 @@

    * Checks that the request is well-formed
    * Checks that the sender is authorized (by reading authorization info from Chubby file,
usually a cache hit)
    * Executes read on merged view of memtable and SSTables. The merged view consists of the
memtable, the most recent SSTable, the next most recent SSTable, ..., the oldest SSTable.
The first hit for a key masks any potential duplicates in older SSTables.
+    * At what point does this become too expensive so that a ''merging compaction'' is triggered?
   * Compactions
    * '''minor:''' Writes memtable to SSTable when it reaches a certain size. Writes new redo
point into METADATA table.
    * '''merging:''' Periodically merges a few SSTables and the memtable into one larger SSTable.
This newly generated table may contain deletion entries that suppress deleted data in older
tables.
+    * Since a merging compaction does not necessarily include all the SSTables in the tablet,
how are the candidate SSTables chosen? One might be tempted to merge the oldest SSTables,
but since a merging compaction also includes the memtable, that would indicate that the most
recent SSTables should be merged with the memtable. Further, a read is more likely to be satisfied
from the most recent SSTables, so this makes sense.
+    * ''How many SSTables are chosen for a merging compaction?''
    * '''major:''' merging compaction that rewrites all SSTables into one SSTable. Contains
no deletion entries
+    * Initiated by master
+    * ''What is the threshold that triggers a major compaction?''
   * Commit Log
    * Stores redo records
    * Contains redo records for all tablets managed by tablet server
@@ -85, +103 @@

    * Reconstructs the memtable by applying all of the updates that have committed since the
redo points
   * Caching
    * Scan Cache: caches the key/value pairs returned by the SSTable interface
-    * Block Cache: caches blocks read from the SSTables
+   * Block Cache: caches blocks read from the SSTables
   * Bloom Filter
    * Optional in-memory structure that reduces disk access by
   * API
@@ -98, +116 @@

   * Immutable (write-once)
   * Sorted list of key/value pairs.
   * Sequence of 64KB blocks
-  * Block index (presumably consists of the start keys for each block)
+  * Block index __consists of the start keys for each block__
   * Compression
    * Per block
    * ''Column family compression?''
@@ -139, +157 @@

  
    ''If each metadata row is about 1KB, the METADATA table size required to map all the tablets
is 2x10^9^ bytes. If each METADATA tablet is 100MB that requires 20 METADATA tablets to map
the entire table, and consequently 20 root tablet rows to map the METADATA tablets.''
  
-  * Stores the location of a tablet under a row key that is an encoding of the tablet's table
ID and it's end row
+  * Stores the location of a tablet under a row key that is an encoding of the tablet's table
ID and it's __'''end row'''__
+   * This seems counter intuitive for a couple of reasons:
+    1. Since a row key can be an arbitrary string, how do you represent the "maximum value"
for the last tablet?
+    1. As new rows are added with row keys that are greater than the previous maximum value,
is the metadata table only updated when the last tablet does a major compaction?
+ 
+    If the first key in a tablet were stored instead, the minimum value could be represented
as the empty string "" and a simple string comparison between a key and the first key would
result in key > first key.
+ 
+    If the keys go from 1 to n, when the tablet is split, the first tablet still has the
row key "" as its first key and the second tablet has a first row key of n/2. So if a key
is presented and its value is < n/2, you know that if it exists, it is in the first tablet
and if the value >= n/2 it is in the second tablet.
+ 
+    I suppose you could represent the maximum row key as the empty string but that would
require a special case instead of just a simple compare.
+ 
   * The "location" column family is in it's own locality group and has the ''InMemory'' tuning
parameter set
   * Each row stores approximately 1KB of data in memory
   * All events pertaining to each tablet are logged here (such as when a tablet server starts
serving a tablet)

Mime
View raw message