Return-Path: Delivered-To: apmail-lucene-hadoop-commits-archive@locus.apache.org Received: (qmail 20129 invoked from network); 31 Jan 2007 23:22:47 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 31 Jan 2007 23:22:47 -0000 Received: (qmail 59831 invoked by uid 500); 31 Jan 2007 23:22:54 -0000 Delivered-To: apmail-lucene-hadoop-commits-archive@lucene.apache.org Received: (qmail 59814 invoked by uid 500); 31 Jan 2007 23:22:54 -0000 Mailing-List: contact hadoop-commits-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-commits@lucene.apache.org Received: (qmail 59805 invoked by uid 99); 31 Jan 2007 23:22:53 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 31 Jan 2007 15:22:53 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_HELO_PASS X-Spam-Check-By: apache.org Received: from [140.211.11.130] (HELO eos.apache.org) (140.211.11.130) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 31 Jan 2007 15:22:46 -0800 Received: from eos.apache.osuosl.org (localhost [127.0.0.1]) by eos.apache.org (Postfix) with ESMTP id C18C75A24F for ; Wed, 31 Jan 2007 23:22:26 +0000 (GMT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: Apache Wiki To: hadoop-commits@lucene.apache.org Date: Wed, 31 Jan 2007 23:22:26 -0000 Message-ID: <20070131232226.14019.8628@eos.apache.osuosl.org> Subject: [Lucene-hadoop Wiki] Update of "Hbase/HbaseArchitecture" by JimKellerman X-Virus-Checked: Checked by ClamAV on apache.org Dear Wiki user, You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification. The following page has been changed by JimKellerman: http://wiki.apache.org/lucene-hadoop/Hbase/HbaseArchitecture The comment on the change is: initial revision, more to come... ------------------------------------------------------------------------------ + = Master Node = - ## page was renamed from HbaseArchitecture - Describe HbaseArchitecture here. To be supplied. + * Handles table creation and deletion + * Responsible for assigning tablets to tablet servers + * Detects the addition and expiration of tablet servers + * Balances tablet server load + * Garbage collects files (SSTables) in GFS by mark-and-sweep + * Handles schema changes, such as the addition of Column families + * Keeps track of the set of live tablet servers + * Keeps current assignment of tablets to tablet servers, including those that are unassigned + * Assigns unassigned tablets to tablet servers with sufficient room + * Polls the /servers directory to discover new tablet servers + * Regularly pings tablet servers for the status of their lock + * If it can't contact a tablet server or it reports that it lost it's lock, the Master will acquire the tablet server's lock on it's /servers Chubby file, and if successful, will delete it. + * Initiates tablet merges ''(when/how does it know to do this?)'' + * Startup + * Acquires unique "master" Chubby lock + * Scans /servers directory in Chubby to find live tablet servers + * Communicates with all tablet servers to discover tablet assignment + * Scans METADATA table to find all tablets and adds those that have not been assigned to the set of unassigned tablets + + + = Chubby = + + * Distributed Lock Service + * Provides a namespace of directories and small files + * Reads and writes to a file are atomic + * Locking + * Both directories and files can be used as locks + * Chubby client maintains a "session" with a Chubby service, which expires if it is unable to renew it's session lease with ''SessionTimeout'' seconds. + * Chubby client provides consistent caching of Chubby files + * Uses + 1. To ensure that there is at most one active master at any time + 1. To store bootstrap location of Bigtable data + 1. To discover tablet servers and finalize tablet server death + 1. To store Bigtable schema information + 1. To store access control lists + + + = Tablet Server = + + * Manages and serves a set of Tablets (between 10 and 1000). A tablet is a row range of the table sorted in lexographical order. Tablets comprise two types of data structures: One or more on-disk structure called an SSTable, and one or more in-memory data structures called memtable + * Tablet size is 100-200MB by default + * Memtable rows are marked Copy-on-write during reads to allow writes to happen in parallel + * Handles read/write requests to tablets + * Splits tablets that have grown too large + * Initiates the split by recording information for the new tablet in the METADATA table + * Notifies the master of the split + + ''So if a METADATA tablet splits, that would imply that the root tablet needs to be updated.'' + + * Can be dynamically added or removed + * Clients communicate directly with tablet servers + * Announces it's existence by creating a uniquely named file in the /servers Chubby directory + * Stops serving it's tablets and kills itself if it cannot renew the lease on it's /servers file + * Commit Log + * Stores redo records + * Contains redo records for all tablets managed by tablet server + * Key consists of + * To speed recovery when a tablet server dies, the log is sorted by key. This sort is done by breaking the log into 64MB chunks and is done in parallel on different tablet servers. The sort is managed by the Master. + * Two logs are kept, one active and one inactive. When writing to one log becomes slow, a log sequence number is incremented, and the other log is switched to. During recovery, both logs are sorted together and the sequence number is used to elided duplicated entries. + * Moving tablets. When a tablet is moved from one server to another, the tablet server does a compaction prior to the move to speed up tablet recovery. + * SSTables for a tablet are registered in the METADATA table + * Tablet recovery + * Reads the METADATA table to find SSTable locations and the redo points + * Reads SSTable indices into memory + * Reconstructs the memtable by applying all of the updates that have committed since the redo points + * Writes + * Checks that the request is well-formed + * Checks that the sender is authorized (by reading authorization info from Chubby file, usually a cache hit) + * Writes mutation to commit log (has group commit feature to improve performance) + * Updates memtable + * Reads + * Checks that the request is well-formed + * Checks that the sender is authorized (by reading authorization info from Chubby file, usually a cache hit) + * Executes read on merged view of SSTables and memtable + * Compactions + * minor: Writes memtable to SSTable when it reaches a certain size. Writes new redo point into METADATA table. + * merging: Periodically merges a few SSTables and the memtable into one larger SSTable. This newly generated table may contain deletion entries that suppress deleted data in older tables. + * major: merging compaction that rewrites all SSTables into one SSTable. Contains no deletion entries + * Caching + * Scan Cache: caches the key/value pairs returned by the SSTable interface + * Block Cache: caches blocks read from the SSTables + * Bloom Filter + * Optional in-memory structure that reduces disk access by + * API + {{{ + LoadTablet() + }}} + + = SSTable = + + * Immutable (write-once) + * Sorted list of key/value pairs. + * Sequence of 64KB blocks + * Block index (presumably consists of the start keys for each block) + * Compression + * Per block + * ''Column family compression?'' + * Can be Memory-mapped + * Columns are organized into Locality Groups. Separate SSTable(s) are generated for each locality group in each tablet. + * Can be shared by two tablets immediately after a split + * API + * Lookup a given key ''(with or without timestamp?)'' + * Iterate over key/value pairs + + + = METADATA Table = + + * Three-level heirarchy + * Chubby file contains the location of the "root tablet" + * Root tablet stores the location of all tablets in a METADATA table + * The "root tablet" is just the first tablet in the METADATA table, it is never split + + ''I really don't like this definition (and I know it is from the Bigtable paper). Aside from the fact that it is never split, the root tablet is special in another way: it is the metadata table for the METADATA table.'' + + ''Suppose you had a table with one column and 4x10^9^ rows. Each row contains about 5KB of data resulting in a total table size of 200TB.'' + + ''If each tablet holds about 100MB of data, this will require 2x10^6^ tablets and the same number of rows in the METADATA table.'' + + ''If each metadata row is about 1KB, the METADATA table size required to map all the tablets is 2x10^9^ bytes. If each METADATA tablet is 100MB that requires 20 METADATA tablets to map the entire table, and consequently 20 root tablet rows to map the METADATA tablets.'' + + * Stores the location of a tablet under a row key that is an encoding of the tablet's table ID and it's end row + * The "location" column family is in it's own locality group and has the ''InMemory'' tuning parameter set + * Each row stores approximately 1KB of data in memory + * All events pertaining to each tablet are logged here (such as when a tablet server starts serving a tablet) + * ["Schema"] + + + = Client Library = + + * Caches tablet locations + * If can't find locaiton, recurses up the heirarchy + * Contacts Chubby directly to find root tablet + * Client library pre-fetches tablet locations by reading metadata for more than one tablet whenever it reads the METADATA table. + + + = Configuration / Schema Definition = + + * Tablet Size + * Column Families + * Access Control + * Garbage Collection (last n, newer than time t) + * IntegerCounter + * Locality Groups + * In Memory tuning parameter + * Use Bloom Filter + + = API = + + {{{ + CreateTable() + ChangeColumnFamilyMetadata(name=ACL, value=foo) + Scanner + FetchColumnFamily + Lookup + RowName + + ScanStream + SetReturnAllVersions + Next + Done + ColumnName + MicroTimestamp + Value + + WriteStream? + Batch interface + Increment IntegerCounter? + }}} + + = Other = + + * Map/Reduce connector + * Client Sawzall script execution in Tablet server space +