hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-hadoop Wiki] Trivial Update of "Hbase/HbaseArchitecture" by JimKellerman
Date Thu, 01 Feb 2007 02:26:13 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by JimKellerman:
http://wiki.apache.org/lucene-hadoop/Hbase/HbaseArchitecture

------------------------------------------------------------------------------
  
  = Tablet Server =
  
-  * Manages and serves a set of Tablets (between 10 and 1000). A tablet is a row range of
the table sorted in lexographical order. Tablets comprise two types of data structures: One
or more on-disk structure called an SSTable, and one or more in-memory data structures called
memtable
+  * Manages and serves a set of Tablets (between 10 and 1000). A tablet is a row range of
the table sorted in lexographical order. Tablets comprise two types of data structures: One
or more on-disk structure called an SSTable, and one or more in-memory data structures called
memtable. Initially, a table consists of a single tablet.
+  * SSTables for a tablet are registered in the METADATA table
   * Tablet size is 100-200MB by default
-  * Memtable rows are marked Copy-on-write during reads to allow writes to happen in parallel
-  * Handles read/write requests to tablets
+  * The set of tablets changes when:
+   * a table is created or deleted
+   * two existing tablets are merged to form a single larger tablet
+   * an existing tablet is split into two smaller tablets
   * Splits tablets that have grown too large
    * Initiates the split by recording information for the new tablet in the METADATA table
    * Notifies the master of the split
  
    ''So if a METADATA tablet splits, that would imply that the root tablet needs to be updated.''
  
+  * Memtable rows are marked Copy-on-write during reads to allow writes to happen in parallel
+  * Handles read/write requests to tablets
   * Can be dynamically added or removed
   * Clients communicate directly with tablet servers
   * Announces it's existence by creating a uniquely named file in the /servers Chubby directory
   * Stops serving it's tablets and kills itself if it cannot renew the lease on it's /servers
file
+  * Writes
+   * Checks that the request is well-formed
+   * Checks that the sender is authorized (by reading authorization info from Chubby file,
usually a cache hit)
+   * Writes mutation to commit log
+    * has group commit feature to improve performance. A group is comprised of all the tablets
currently being served by the tablet server.
+   * Deletes are just writes with a ''special'' value that indicates that the record is deleted.
+   * Updates memtable. When the memtable gets too large ''(how large?)'', the memtable is
written to a new SSTable. This is a ''minor'' compaction. 
+ 
+   ''Since this is not a tablet split, this implies that information about the new SSTable
is not written to the METADATA table. Is this correct? Maybe not because a minor compaction
does write a new redo point into the METADATA table and the redo point could contain a pointer
to the new SSTable.''
+ 
+  * Reads
+   * Checks that the request is well-formed
+   * Checks that the sender is authorized (by reading authorization info from Chubby file,
usually a cache hit)
+   * Executes read on merged view of memtable and SSTables. The merged view consists of the
memtable, the most recent SSTable, the next most recent SSTable, ..., the oldest SSTable.
The first hit for a key masks any potential duplicates in older SSTables.
+  * Compactions
+   * '''minor:''' Writes memtable to SSTable when it reaches a certain size. Writes new redo
point into METADATA table.
+   * '''merging:''' Periodically merges a few SSTables and the memtable into one larger SSTable.
This newly generated table may contain deletion entries that suppress deleted data in older
tables.
+   * '''major:''' merging compaction that rewrites all SSTables into one SSTable. Contains
no deletion entries
   * Commit Log
    * Stores redo records
    * Contains redo records for all tablets managed by tablet server
    * Key consists of <table, row name, log sequence number>
    * To speed recovery when a tablet server dies, the log is sorted by key. This sort is
done by breaking the log into 64MB chunks and is done in parallel on different tablet servers.
The sort is managed by the Master.
    * Two logs are kept, one active and one inactive. When writing to one log becomes slow,
a log sequence number is incremented, and the other log is switched to. During recovery, both
logs are sorted together and the sequence number is used to elided duplicated entries.
-  * Moving tablets. When a tablet is moved from one server to another, the tablet server
does a compaction prior to the move to speed up tablet recovery.
+  * Moving tablets. When a tablet is moved from one server to another, the tablet server
does a major compaction prior to the move to speed up tablet recovery.
-  * SSTables for a tablet are registered in the METADATA table
   * Tablet recovery
    * Reads the METADATA table to find SSTable locations and the redo points
    * Reads SSTable indices into memory
    * Reconstructs the memtable by applying all of the updates that have committed since the
redo points
-  * Writes
-   * Checks that the request is well-formed
-   * Checks that the sender is authorized (by reading authorization info from Chubby file,
usually a cache hit)
-   * Writes mutation to commit log (has group commit feature to improve performance)
-   * Updates memtable
-  * Reads
-   * Checks that the request is well-formed
-   * Checks that the sender is authorized (by reading authorization info from Chubby file,
usually a cache hit)
-   * Executes read on merged view of SSTables and memtable
-  * Compactions
-   * minor: Writes memtable to SSTable when it reaches a certain size. Writes new redo point
into METADATA table.
-   * merging: Periodically merges a few SSTables and the memtable into one larger SSTable.
This newly generated table may contain deletion entries that suppress deleted data in older
tables.
-   * major: merging compaction that rewrites all SSTables into one SSTable. Contains no deletion
entries
   * Caching
    * Scan Cache: caches the key/value pairs returned by the SSTable interface
     * Block Cache: caches blocks read from the SSTables
@@ -100, +109 @@

    * Per block
    * ''Column family compression?''
   * Can be Memory-mapped
-  * Columns are organized into Locality Groups. Separate SSTable(s) are generated for each
locality group in each tablet.
   * Can be shared by two tablets immediately after a split
   * API
    * Lookup a given key ''(with or without timestamp?)''
    * Iterate over key/value pairs
  
+ As a first approximation, a Hadoop !MapFile satisfies these requirements. It is a persistent
(lives in the Hadoop DFS), ordered (!MapFile is based on !SequenceFile which is strictly ordered),
immutable (once written, an attempt to open a !MapFile for writing will overwrite the existing
contents) map from keys to values.
+ 
+ Keys and values can be arbitrary byte strings.
+ 
+ Given a key, you can find its value(s). It is possible to iterate over the entire file or
find a key and iterate from tha point forward.
+ 
+ It is implemented as a sequence of blocks which are configurable in size.
+ 
+ Unlike SSTable, !MapFile stores its index in a separate file instead of at the end of the
file. This is likely to be more efficient than storing the index at the end of the file and
having to re-write it when the file is being created. Although it uses two files, both are
accessed in a sequential append only fashion when the file is being created.
+ 
+ ''Questions have been raised about the suitability of using a !MapFile to implement SSTables:''
+  * ''will the two file implementation put extra stress on the Name Node?''
+  * ''would something else better meet Hbase's requirements?''
+ 
+ ''The consensus to date has been that !MapFile is a good enough approximation to start with
and optimization at this point is premature.''
  
  = METADATA Table =
  
@@ -140, +163 @@

  = Configuration / Schema Definition =
  
   * Tablet Size
-  * Column Families
+  * Columns are organized into Locality Groups. Separate SSTable(s) are generated for each
locality group in each table.
    * Access Control
    * Garbage Collection (last n, newer than time t)
    * IntegerCounter

Mime
View raw message