hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hbase/HbaseArchitecture" by JimKellerman
Date Fri, 05 Sep 2008 16:59:27 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by JimKellerman:
http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture

The comment on the change is:
major update

------------------------------------------------------------------------------
- This effort is still a "work in progress". Please feel free to add
- comments, but please make them stand out by bolding or underlining
- them. Thanks!
- 
  = Table of Contents =
  
   * [#intro Introduction]
   * [#datamodel Data Model]
    * [#conceptual Conceptual View]
    * [#physical Physical Storage View]
+  * [#arch Architecture and Implementation]
-  * [#client Client API]
-   * [#scanner Scanner API]
-  * [#hregion HRegion (Tablet) Server]
-  * [#master HBase Master Server]
+   * [#master HBaseMaster]
+   * [#hregion HRegionServer]
+   * [#client HBase Client]
-  * [#metadata META Table]
-  * [#summary Summary]
-  * [#status Current Status]
-  * [#comments Comments]
  
  [[Anchor(intro)]]
  = Introduction =
  
- This document gives a quick overview of HBase, the Hadoop simple
- database. It is extremely similar to Google's Bigtable, with a just a
- few differences. If you understand Bigtable, great. If not, you should
- still be able to understand this document.
+ In order to better understand this document, it is highly recommended that the Google [http://labs.google.com/papers/bigtable.html
Bigtable paper] be read first.
+ 
+ HBase is an [http://apache.org/ Apache] open source project whose goal is to provide Bigtable-like
storage for the Hadoop Distributed Computing Environment. Just as Google's [http://labs.google.com/papers/bigtable.html
Bigtable] leverages the distributed data storage provided by the [http://labs.google.com/papers/gfs.html
Google Distributed File System (GFS)], HBase provides Bigtable-like capabilities on top of
the [http://hadoop.apache.org/core/docs/current/hdfs_design.html Hadoop Distributed File System
(HDFS)].
+ 
+ Data is logically organized into tables, rows and columns.  An iterator-like interface is
available for scanning through a row range and, of course, there is the ability to retrieve
a column value for a specific row key. Any particular column may have multiple versions for
the same row key.
  
  [[Anchor(datamodel)]]
  = Data Model =
  
+ HBase uses a data model very similar to that of Bigtable. Applications store data rows in
labeled tables. A data row has a sortable row key and an arbitrary number of columns. The
table is stored sparsely, so that rows in the same table can have widely varying numbers of
columns. 
- HBase uses a data model very similar to that of Bigtable. Users store
- data rows in labelled tables. A data row has a sortable key and an
- arbitrary number of columns. The table is stored sparsely, so that
- rows in the same table can have crazily-varying columns, if the user
- likes.
  
+ A column name has the form ''"<family>:<label>"'' where <family> and <label>
can be arbitrary byte arrays. A table enforces its set of <family>s (called ''"column
families"''). Adjusting the set of families is done by performing administrative operations
on the table. However, new <label>s can be used in any write operation without pre-announcing
it. HBase stores column families physically close on disk, so the items in a given column
family should have roughly the same read/write characteristics and contain similar data.
- A column name has the form "<family>:<label>" where <family> and <label>
- can be any string you like. A single table enforces its set of
- <family>s (called "column families"). You can only adjust this set of
- families by performing administrative operations on the table. However,
- you can use new <label> strings at any write without preannouncing
- it. HBase stores column families physically close on disk. So the items
- in a given column family should have roughly the same write/read
- behavior.
  
+ Only a single row at a time may be locked by default. Row writes are always atomic, but
it is also possible to lock a single row and perform both read and write operations on that
row atomically.
- Writes are row-locked only. You cannot lock multiple rows at once. All
- row-writes are atomic by default.
  
+ An extension was added recently to allow multi-row locking, but this is not the default
behavior and must be explicitly enabled.
- All updates to the database have an associated timestamp. The HBase
- will store a configurable number of versions of a given cell. Clients
- can get data by asking for the "most recent value as of a certain
- time". Or, clients can fetch all available versions at once.
  
  [[Anchor(conceptual)]]
  == Conceptual View ==
  
  Conceptually a table may be thought of a collection of rows that
  are located by a row key (and optional timestamp) and where any column
- may not have a value for a particular row key (sparse). The following example is a slightly
modified form of the one on page 2 of the [http://labs.google.com/papers/bigtable.html Bigtable
Paper].
+ may not have a value for a particular row key (sparse). The following example is a slightly
modified form of the one on page 2 of the [http://labs.google.com/papers/bigtable.html Bigtable
Paper] (adds a new column family ''"mime:"'').
  
  [[Anchor(datamodelexample)]]
  ||<:> '''Row Key''' ||<:> '''Time Stamp''' ||<:> '''Column''' ''"contents:"''
||||<:> '''Column''' ''"anchor:"'' ||<:> '''Column''' ''"mime:"'' ||
@@ -71, +49 @@

  [[Anchor(physical)]]
  == Physical Storage View ==
  
+ Although at a conceptual level, tables may be viewed as a sparse set of rows, physically
they are stored on a per-column family basis. This is an important consideration for schema
and application designers to keep in mind.
- Although, at a conceptual level, tables may be viewed as a sparse set
- of rows, physically they are stored on a per-column basis. This is an
- important consideration for schema and application designers to keep
- in mind.
  
- Pictorially, the table shown in the [#datamodelexample conceptual view] above would be stored
as
+ Pictorially, the table shown in the [#datamodelexample conceptual view] above would be stored
as follows:
- follows:
  
  ||<:> '''Row Key''' ||<:> '''Time Stamp''' ||<:> '''Column''' ''"contents:"''
||
  ||<^|3> "com.cnn.www" ||<:> t6 ||<:> "<html>..." ||
@@ -97, +71 @@

  
  [[BR]]
  
+ It is important to note in the diagram above that the empty cells shown in the conceptual
view are not stored since they need not be in a column-oriented storage format. Thus a request
for the value of the ''"contents:"'' column at time stamp t8 would return no value. Similarly,
a request for an ''"anchor:my.look.ca"'' value at time stamp t9 would return no value.
- It is important to note in the diagram above that the empty cells
- shown in the conceptual view are not stored. Thus a request for the
- value of the ''"contents:"'' column at time stamp ''t8'' would return
- a null value. Similarly, a request for an ''"anchor:"'' value at time
- stamp ''t9'' for "my.look.ca" would return a null value.
  
+ However, if no timestamp is supplied, the most recent value for a particular column would
be returned and would also be the first one found since timestamps are stored in descending
order. Thus a request for the values of all columns in the row "com.cnn.www" if no timestamp
is specified would be: the value of ''"contents:"'' from time stamp t6, the value of ''"anchor:cnnsi.com"''
from time stamp t9, the value of ''"anchor:my.look.ca"'' from time stamp t8 and the value
of ''"mime:"'' from time stamp t6.
- However, if no timestamp is supplied, the most recent value for a
- particular column would be returned and would also be the first one
- found since time stamps are stored in descending order. Consequently
- the value returned for ''"contents:"'' if no time stamp is supplied is
- the value for ''t6'' and the value for an ''"anchor:"''  for
- "my.look.ca" if no time stamp is supplied is the value for time stamp
- ''t8''.
  
- === Example ===
  
- To show how data is stored on disk, consider the following example:
+ === Row Ranges: Regions ===
  
+ To an application, a table appears to be a list of tuples sorted by row key ascending, column
name ascending and timestamp descending.  Physically, tables are broken up into row ranges
called ''regions'' (equivalent Bigtable term is ''tablet''). Each row range contains rows
from start-key (inclusive) to end-key (exclusive). A set of regions, sorted appropriately,
forms an entire table. Unlike Bigtable which identifies a row range by the table name and
end-key, HBase identifies a row range by the table name and start-key.
- A program writes rows "row[0-9]", column "anchor:foo"; then writes
- rows "row[0-9]"; column "anchor:bar"; and finally writes rows
- "row[0-9]" column "anchor:foo". After flushing the memcache and
- compacting the store, the contents of the !MapFile would look like:
  
+ Each column family in a region is managed by an ''HStore''. Each HStore may have one or
more ''!MapFiles'' (a Hadoop HDFS file type) that is very similar to a Google ''SSTable''.
Like SSTables, !MapFiles are immutable once closed. !MapFiles are stored in the Hadoop HDFS.
Other details are the same, except:
+  * !MapFiles cannot currently be mapped into memory.
+  * !MapFiles maintain the sparse index in a separate file rather than at the end of the
file as SSTable does.
+  * HBase extends !MapFile so that a bloom filter can be employed to enhance negative lookup
performance. The hash function employed is one developed by Bob Jenkins.
- {{{
- row=row0, column=anchor:bar, timestamp=1174184619081
- row=row0, column=anchor:foo, timestamp=1174184620720
- row=row0, column=anchor:foo, timestamp=1174184617161
- row=row1, column=anchor:bar, timestamp=1174184619081
- row=row1, column=anchor:foo, timestamp=1174184620721
- row=row1, column=anchor:foo, timestamp=1174184617167
- row=row2, column=anchor:bar, timestamp=1174184619081
- row=row2, column=anchor:foo, timestamp=1174184620724
- row=row2, column=anchor:foo, timestamp=1174184617167
- row=row3, column=anchor:bar, timestamp=1174184619081
- row=row3, column=anchor:foo, timestamp=1174184620724
- row=row3, column=anchor:foo, timestamp=1174184617168
- row=row4, column=anchor:bar, timestamp=1174184619081
- row=row4, column=anchor:foo, timestamp=1174184620724
- row=row4, column=anchor:foo, timestamp=1174184617168
- row=row5, column=anchor:bar, timestamp=1174184619082
- row=row5, column=anchor:foo, timestamp=1174184620725
- row=row5, column=anchor:foo, timestamp=1174184617168
- row=row6, column=anchor:bar, timestamp=1174184619082
- row=row6, column=anchor:foo, timestamp=1174184620725
- row=row6, column=anchor:foo, timestamp=1174184617168
- row=row7, column=anchor:bar, timestamp=1174184619082
- row=row7, column=anchor:foo, timestamp=1174184620725
- row=row7, column=anchor:foo, timestamp=1174184617168
- row=row8, column=anchor:bar, timestamp=1174184619082
- row=row8, column=anchor:foo, timestamp=1174184620725
- row=row8, column=anchor:foo, timestamp=1174184617169
- row=row9, column=anchor:bar, timestamp=1174184619083
- row=row9, column=anchor:foo, timestamp=1174184620725
- row=row9, column=anchor:foo, timestamp=1174184617169
- }}}
  
- Note that column "anchor:foo" is stored twice (because the timestamp
- differs) and that the most recent timestamp is the first of the two
- entries (so the most recent update is always found first). 
+ [[Anchor(arch)]]
+ = Architecture and Implementation =
+ 
+ There are three major components of the HBase architecture:
+  1. The H!BaseMaster (analogous to the Bigtable master server)
+  2. The H!RegionServer (analogous to the Bigtable tablet server)
+  3. The HBase client, defined by org.apache.hadoop.hbase.client.HTable
+ 
+ Each will be discussed in the following sections.
+ 
+ [[Anchor(master)]]
+ == HBaseMaster ==
+ 
+ The H!BaseMaster is responsible for assigning regions to H!RegionServers. The first region
to be assigned is the ''ROOT region'' which locates all the META regions to be assigned. Each
''META region'' maps a number of user regions which comprise the multiple tables that a particular
HBase instance serves. Once all the META regions have been assigned, the master will then
assign user regions to the H!RegionServers, attempting to balance the number of regions served
by each H!RegionServer.
+ 
+ It also holds a pointer to the H!RegionServer that is hosting the ROOT region.
+ 
+ The H!BaseMaster also monitors the health of each H!RegionServer, and if it detects a H!RegionServer
is no longer reachable, it will split the H!RegionServer's write-ahead log so that there is
now one write-ahead log for each region that the H!RegionServer was serving. After it has
accomplished this, it will reassign the regions that were being served by the unreachable
H!RegionServer.
+ 
+ In addition, the H!BaseMaster is also responsible for handling table administrative functions
such as on/off-lining of tables, changes to the table schema (adding and removing column families),
etc.
+ 
+ Unlike Bigtable, currently, when the H!BaseMaster dies, the cluster will shut down. In Bigtable,
a Tabletserver can still serve Tablets after its connection to the Master has died. We tie
them together, because we do not currently use an external lock-management system like Bigtable.
The Bigtable Master allocates tablets and a lock manager (''Chubby'') guarantees atomic access
by Tabletservers to tablets. HBase uses just a single central point for all H!RegionServers
to access: the H!BaseMaster.
+ 
+ === The META Table ===
+ 
+ The META table stores information about every user region in HBase which includes a H!RegionInfo
object containing information such as the start and end row keys, whether the region is on-line
or off-line, etc. and the address of the H!RegionServer that is currently serving the region.
The META table can grow as the number of user regions grows.
+ 
+ === The ROOT Table ===
+ 
+ The ROOT table is confined to a single region and maps all the regions in the META table.
Like the META table, it contains a H!RegionInfo object for each META region and the location
of the H!RegionServer that is serving that META region.
+ 
+ Each row in the ROOT and META tables is approximately 1KB in size. At the default region
size of 256MB, this means that the ROOT region can map 2.6 x 10^5^ META regions, which in
turn map a total 6.9 x 10^10^ user regions, meaning that approximately 1.8 x 10^19^ (2^64^)
bytes of user data.
+ 
+ [[Anchor(hregion)]]
+ == HRegionServer ==
+ 
+ The H!RegionServer is responsible for handling client read and write requests. It communicates
with the H!BaseMaster to get a list of regions to serve and to tell the master that it is
alive. Region assignments and other instructions from the master "piggy back" on the heart
beat messages.
+ 
+ === Write Requests ===
+ 
+ When a write request is received, it is first written to a write-ahead log called a ''HLog''.
All write requests for every region the region server is serving are written to the same log.
Once the request has been written to the HLog, it is stored in an in-memory cache called the
''Memcache''. There is one Memcache for each HStore.
+ 
+ === Read Requests ===
+ 
+ Reads are handled by first checking the Memcache and if the requested data is not found,
the !MapFiles are searched for results.
+ 
+ === Cache Flushes ===
+ 
+ When the Memcache reaches a configurable size, it is flushed to disk, creating a new !MapFile
and a marker is written to the HLog, so that when it is replayed, log entries before the last
flush can be skipped. A flush may also be triggered to relieve memory pressure on the region
server.
+ 
+ Cache flushes happen concurrently with the region server processing read and write requests.
Just before the new !MapFile is moved into place, reads and writes are suspended until the
!MapFile has been added to the list of active !MapFiles for the HStore.
+ 
+ === Compactions ===
+ 
+ When the number of !MapFiles exceeds a configurable threshold, a minor compaction is performed
which consolidates the most recently written !MapFiles. A major compaction is performed periodically
which consolidates all the !MapFiles into a single !MapFile. The reason for not always performing
a major compaction is that the oldest !MapFile can be quite large and reading and merging
it with the latest !MapFiles, which are much smaller, can be very time consuming due to the
amount of I/O involved in reading merging and writing the contents of the largest !MapFile.
+ 
+ Compactions happen concurrently with the region server processing read and write requests.
Just before the new !MapFile is moved into place, reads and writes are suspended until the
!MapFile has been added to the list of active !MapFiles for the HStore and the !MapFiles that
were merged to create the new !MapFile have been removed.
+ 
+ === Region Splits ===
+ 
+ When the aggregate size of the !MapFiles for an HStore reaches a configurable size (currently
256MB), a region split is requested. Region splits divide the row range of the parent region
in half and happen very quickly because the child regions read from the parent's !MapFile.

+ 
+ The parent region is taken off-line, the region server records the new child regions in
the META region and the master is informed that a split has taken place so that it can assign
the children to region servers. Should the split message be lost, the master will discover
the split has occurred since it periodically scans the META regions for unassigned regions.
+ 
+ Once the parent region is closed, read and write requests for the region are suspended.
The client has a mechanism for detecting a region split and will wait and retry the request
when the new children are on-line.
+ 
+ When a compaction is triggered in a child, the data from the parent is copied to the child.
When both children have performed a compaction, the parent region is garbage collected.
  
  [[Anchor(client)]]
- = Client API =
+ == HBase Client ==
  
+ The HBase client is responsible for finding H!RegionServers that are serving the particular
row range of interest. On instantiation, the HBase client communicates with the H!BaseMaster
to find the location of the ROOT region. This is the only communication between the client
and the master.
+ 
+ Once the ROOT region is located, the client contacts that region server and scans the ROOT
region to find the META region that will contain the location of the user region that contains
the desired row range. It then contacts the region server that is serving that META region
and scans that META region to determine the location of the user region.
+ 
+ After locating the user region, the client contacts the region server serving that region
and issues the read or write request.
+ 
+ This information is cached in the client so that subsequent requests need not go through
this process. 
+ 
+ Should a region be reassigned either by the master for load balancing or because a region
server has died, the client will rescan the META table to determine the new location of the
user region. If the META region has been reassigned, the client will rescan the ROOT region
to determine the new location of the META region. If the ROOT region has been reassigned,
the client will contact the master to determine the new ROOT region location and will locate
the user region by repeating the original process described above.
+ 
+ === Client API ===
+ 
- See the Javadoc for [http://hadoop.apache.org/hbase/docs/r0.2.0/api/org/apache/hadoop/hbase/client/HTable.html
HTable] and [http://hadoop.apache.org/hbase/docs/r0.2.0/api/org/apache/hadoop/hbase/client/HBaseAdmin.html
HBaseAdmin]
+ See the Javadoc for [http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/client/HTable.html
HTable] and [http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/client/HBaseAdmin.html
HBaseAdmin]
  
  
- [[Anchor(scanner)]]
- == Scanner API ==
+ ==== Scanner API ====
  
- To obtain a scanner, a Cursor-like row 'iterator' that must be closed, [http://hadoop.apache.org/hbase/docs/r0.2.0/api/org/apache/hadoop/hbase/client/HTable.html#HTable(org.apache.hadoop.hbase.HBaseConfiguration,%20java.lang.String)
instantiate an HTable], and then invoke ''getScanner''.  This method returns an [http://hadoop.apache.org/hbase/docs/r0.2.0/api/org/apache/hadoop/hbase/client/Scanner.html
Scanner] against which you call [http://hadoop.apache.org/hbase/docs/r0.2.0/api/org/apache/hadoop/hbase/client/Scanner.html#next()
next] and ultimately [http://hadoop.apache.org/hbase/docs/r0.2.0/api/org/apache/hadoop/hbase/client/Scanner.html#close()
close].
+ To obtain a scanner, a Cursor-like row 'iterator' that must be closed, [http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/client/HTable.html#HTable(org.apache.hadoop.hbase.HBaseConfiguration,%20java.lang.String)
instantiate an HTable], and then invoke ''getScanner''.  This method returns an [http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/client/Scanner.html
Scanner] against which you call [http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/client/Scanner.html#next()
next] and ultimately [http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/client/Scanner.html#close()
close].
  
- [[Anchor(hregion)]]
- = HRegion (Tablet) Server =
- 
- To the user, a table seems like a list of data tuples, sorted by row
- key. Physically, tables are broken into HRegions. An HRegion is
- identified by its tablename plus a start/end-key pair. A given HRegion
- with keys <start> and <end> will store all the rows from [<start>,
- <end>]. A set of HRegions, sorted appropriately, forms an entire
- table.
- 
- All data is physically stored using Hadoop's DFS. Data is served to
- clients by a set of H!RegionServers, usually one per machine. A given
- HRegion is served by only one H!RegionServer at a time.
- 
- When a client wants to make updates, it contacts the relevant
- H!RegionServer and commits the update to an HRegion. Upon commit, the
- data is added to the HRegion's HMemcache and to the H!RegionServer's
- HLog. The HMemcache is a memory buffer that stores and serves the
- most-recent updates. The HLog is an on-disk log file that tracks all
- updates. The commit() call will not return to the client until the
- update has been written to the HLog.
- 
- When serving data, the HRegion will first check its HMemcache. If not
- available, it will then check its on-disk HStores. There is an HStore
- for each column family in an HRegion. An HStore might consist of
- multiple on-disk H!StoreFiles. Each H!StoreFile is a B-Tree-like
- structure that allow for relatively fast access.
- 
- Periodically, we invoke HRegion.flushcache() to write the contents of
- the HMemcache to an on-disk HStore's files. This adds a new H!StoreFile
- to each HStore. The HMemcache is then emptied, and we write a special
- token to the HLog, indicating the HMemcache has been flushed.
- 
- On startup, each HRegion checks to see if there have been any writes
- to the HLog since the most-recent invocation of flushcache(). If not,
- then all relevant HRegion data is reflected in the on-disk HStores. If
- yes, the HRegion reconstructs the updates from the HLog, writes them
- to the HMemcache, and then calls flushcache(). Finally, it deletes the
- HLog and is now available for serving data.
- 
- Thus, calling flushcache() infrequently will be less work, but
- HMemcache will consume more memory and the HLog will take a longer
- time to reconstruct upon restart. If flushcache() is called
- frequently, the HMemcache will take less memory, and the HLog will be
- faster to reconstruct, but each flushcache() call imposes some
- overhead.
- 
- The HLog is periodically rolled, so it consists of multiple
- time-sorted files. Whenever we roll the HLog, the HLog will delete all
- old log files that contain only flushed data. Rolling the HLog takes
- very little time and is generally a good idea to do from time to time.
- 
- Each call to flushcache() will add an additional H!StoreFile to each
- HStore. Fetching a value from an HStore can potentially access all of
- its H!StoreFiles. This is time-consuming, so we want to periodically
- compact these H!StoreFiles into a single larger one. This is done by
- calling HStore.compact().
- 
- Compaction is an expensive operation that runs in background.  Its triggered when the number
of H!StoreFiles cross a configurable threshold.
- 
- The Google Bigtable paper has a slightly-confusing hierarchy of major
- and minor compactions. We have just two things to keep in mind:
- 
-  1. A "flushcache()" drives all updates out of the memory buffer into on-disk structures.
Upon flushcache, the log-reconstruction time goes to zero. Each flushcache() will add a new
H!StoreFile to each HStore.
- 
-  1. a "compact()" consolidates all the H!StoreFiles into a single one.
- 
- Unlike Bigtable, Hadoop's HBase allows no period where updates have
- been "committed" but have not been written to the log. This is not
- hard to add, if it's really wanted.
- 
- We can merge two HRegions into a single new HRegion by calling HRegion.merge(HRegion, HRegion).
   
- 
- When a region grows larger than a configurable size, HRegion.splitRegion() is called on
the region server.  Two new regions are created by dividing the parent region.  The new regions
are reported to the master for it to rule which region server should host each of the daughter
splits.  The division is pretty fast mostly because the daughter regions hold references to
the parent's H!StoreFiles -- one to the top half of the parent's H!StoreFiles, and the other
to the bottom half.  While the references are in place, the parent region is marked ''offline''
and hangs around until compactions in the daughters cleans up all parent references at which
time the parent is removed.
- 
- OK, to sum up so far:
- 
-  1. Clients access data in tables.
-  1. tables are broken into HRegions.
-  1. HRegions are served by H!RegionServers. Clients contact an H!RegionServer to access
the data within its row-range.
-  1. HRegions store data in:
-   a. HMemcache, a memory buffer for recent writes
-   a. HLog, a write-log for recent writes
-   a. HStores, an efficient on-disk set of files. One per col-group.
- 
- [[Anchor(master)]]
- = HBase Master Server =
- 
- Each H!RegionServer stays in contact with the single HMaster. The
- HMaster is responsible for telling each H!RegionServer what
- HRegions it should load and make available.
- 
- The HMaster keeps a constant tally of which H!RegionServers are
- alive at any time. If the connection between an H!RegionServer and the
- HMaster times out, then:
- 
-  a. The H!RegionServer kills itself and restarts in an empty state.
-  a. The HMaster assumes the H!RegionServer has died and reallocates its HRegions to other
H!RegionServers
- 
- Note that this is unlike Google's Bigtable, where a !TabletServer can
- still serve Tablets after its connection to the Master has died. We
- tie them together, because we do not use an external lock-management
- system like Bigtable. With Bigtable, there's a Master that allocates
- tablets and a lock manager (Chubby) that guarantees atomic access by
- !TabletServers to tablets. HBase uses just a single central point for
- all H!RegionServers to access: the HMaster.
- 
- (This is no more dangerous than what Bigtable does. Each system is
- reliant on a network structure (whether HMaster or Chubby) that
- must survive for the data system to survive. There may be some
- Chubby-specific advantages, but that's outside HBase's goals right
- now.)
- 
- As H!RegionServers check in with a new HMaster, the HMaster
- asks each H!RegionServer to load in zero or more HRegions. When the
- H!RegionServer dies, the HMaster marks those HRegions as
- unallocated, and attempts to give them to different H!RegionServers.
- 
- Recall that each HRegion is identified by its table name and its
- key-range. Since key ranges are contiguous, and they always start and
- end with NULL, it's enough to simply indicate the start-key.
- 
- Unfortunately, this is not quite enough. Because of merge() and
- split(), we may (for just a moment) have two quite different HRegions
- with the same name. If the system dies at an inopportune moment, both
- HRegions may exist on disk simultaneously. The arbiter of which
- HRegion is "correct" is the HBase meta-information (to be discussed
- shortly). In order to distinguish between different versions of the
- same HRegion, we also add a unique 'regionId' to the HRegion name.
- 
- Thus, we finally get to this identifier for an HRegion: ''tablename + startkey + regionId''
Here's an example where the table is named ''hbaserepository'', the start key is ''w-nk5YNZ8TBb2uWFIRJo7V==''
and the region id is ''6890601455914043877'': ''hbaserepository,w-nk5YNZ8TBb2uWFIRJo7V==,6890601455914043877''
- 
- [[Anchor(metadata)]]
- = META Table =
- 
- We can also use this identifier as a row-label in a different
- HRegion. Thus, the HRegion meta-info is itself stored in an
- HRegion. We call this table, which maps from HRegion identifiers to
- physical H!RegionServer locations, the META table.
- 
- The META table itself can grow large, and may be broken into separate
- HRegions. To locate all components of the META table, we list all META
- HRegions in a ROOT table. The ROOT table is always contained in a
- single HRegion.
- 
- Upon startup, the HMaster immediately attempts to scan the ROOT
- table (because there is only one HRegion for the ROOT table, that
- HRegion's name is hard-coded). It may have to wait for the ROOT table
- to be allocated to an H!RegionServer.
- 
- Once the ROOT table is available, the HMaster scans it and
- learns of all the META HRegions. It then scans the META table. Again,
- the HMaster may have to wait for all the META HRegions to be
- allocated to different H!RegionServers.
- 
- Finally, when the HMaster has scanned the META table, it knows the
- entire set of HRegions. It can then allocate these HRegions to the set
- of H!RegionServers.
- 
- The HMaster keeps the set of currently-available H!RegionServers in
- memory. Since the death of the HMaster means the death of the
- entire system, there's no reason to store this information on
- disk. 
- 
- Unlike Bigtable, which stores information about Tablet->!TabletServer
- mapping in Chubby, Google's distributed lock server, all information
- about the HRegion->H!RegionServer mapping is stored physically in the
- META table (since there is no equivalent to Chubby in the Hadoop
- environment).
- 
- Consequently each row in the META and ROOT tables has three members of
- the "info:" column family:
- 
-  1. '''info:regioninfo''' contains a serialized [http://hadoop.apache.org/hbase/docs/r0.2.0/api/org/apache/hadoop/hbase/HRegionInfo.html
HRegionInfo object]
-  1. '''info:server''' contains a serialized string which is the output from [http://hadoop.apache.org/hbase/docs/r0.2.0/api/org/apache/hadoop/hbase/HServerAddress.html#toString()
HServerAddress.toString()]. This string can be supplied to one of the [http://hadoop.apache.org/hbase/docs/r0.2.0/api/org/apache/hadoop/hbase/HServerAddress.html#HServerAddress(java.lang.String)
HServerAddress constructors].
-  1. '''info:startcode''' a serialized long integer generated by the H!RegionServer when
it starts. The H!RegionServer sends this start code to the master so the master can determine
if the server information in the META and ROOT regions is stale.
- 
- Thus, a client does not need to contact the HMaster after it learns
- the location of the ROOT HRegion. The load on HMaster should be
- relatively small: it deals with timing out H!RegionServers, scanning
- the ROOT and META upon startup, and serving the location of the ROOT
- HRegion.
- 
- The HBase client is fairly complicated, and often needs to navigate the
- ROOT and META HRegions when serving a user's request to scan a
- specific user table. If an H!RegionServer is unavailable or it does not
- have an HRegion it should have, the client will wait and retry. At
- startup or in case of a recent H!RegionServer failure, the correct
- mapping info from HRegion to H!RegionServer may not always be
- available.
- 
- [[Anchor(summary)]]
- = Summary =
- 
-  1. H!RegionServers offer access to HRegions (an HRegion lives at one H!RegionServer)
-  1. H!RegionServers check in with the HMaster
-  1. If the HMaster dies, the whole system dies
-  1. The set of current H!RegionServers is known only to the HMaster
-  1. The mapping between HRegions and H!RegionServers is stored in two special HRegions,
which are allocated to H!RegionServers like any other.
-  1. The ROOT HRegion is a special one, the location of which the HMaster always knows.
-  1. It's the client's responsibility to navigate all this.
- 
- [[Anchor(status)]]
- = Current Status =
- 
- As of this writing (2008/08/22), there are approximately 79,500 lines of code in 
- "hbase/trunk/src/java/org/apache/hadoop/hbase/" directory on the Hadoop SVN trunk.
- 
- Also included in this directory are approximately 14,000 lines of test cases.
- 
- Issues and TODOs:
-  1. Modify RPC and amend how hbase uses RPC.  In particular, some messages will use the
current RPC mechanism and others will use raw sockets.  Raw sockets will be used for sending
the data of some serialized objects; the idea is that we can hard-code the creation/interpretation
of the serialized data and avoid introspection.
-  1. Vuk Ercegovac [[MailTo(vercego AT SPAMFREE us DOT ibm DOT com)]] of IBM Almaden Research
pointed out that keeping HBase HRegion edit logs in HDFS is currently flawed.  HBase writes
edits to logs and to a memcache.  The 'atomic' write to the log is meant to serve as insurance
against abnormal !RegionServer exit: on startup, the log is rerun to reconstruct an HRegion's
last wholesome state. But files in HDFS do not 'exist' until they are cleanly closed -- something
that will not happen if !RegionServer exits without running its 'close'.  This issue will
be addressed in HBase-0.19.0.
-  1. The HMemcache lookup structure is relatively inefficient
- 
- Release News:
-  * HBase-0.2.0 was released in early August and runs on Hadoop-0.17.1.
-  * Work is underway on releasing HBase-0.2.1 for Hadoop-0.17.2.1, and HBase-0.18.0 for Hadoop-0.18.
 Looking farther ahead, HBase-0.19.0 will run with Hadoop-0.19.  Note that the HBase release
numbering schema has changed to align with the version of Hadoop with which it runs: 0.3.x
will be called 0.18.x, and 0.4.x will be called 0.19.x.
- 
- See [https://issues.apache.org/jira/browse/HBASE?report=com.atlassian.jira.plugin.system.project:roadmap-panel&subset=3
hbase issues] for list of whats being currently worked on.
- 
- [[Anchor(comments)]]
- = Comments =
- 
- Please add comments about the architecture below. In the future, as this page grows too
big, it will be split into multiple sub-pages based on the architectural component. Applicable
comments will then be moved to that page. At that point, comments on this page should be related
to an overall architectural issue or one that spans multiple components. Thank you.
- 
- ----
- [[Anchor(rdf)]]
- === It is not Row-Oriented. ===
- 
- by [wiki:udanax Udanax] [[MailTo(webmaster AT SPAMFREE udanax DOT org)]] 2007/02/06
- 
- I think Hbase should be compact (space-efficient), fast and should be able to manage high-demand
load. It should be able to handle sparse tables efficiently. So, for wide and sparse data,
Hbase must store data by columns like C-Store does.
- 
-  ''I agree. See the sections on the [#conceptual conceptual data model] and the [#physical
physical data model]. -- JimKellerman 2007/05/30''
- 
- A column-oriented system handles NULLs more easily with significantly smaller performance
overhead,
- and supports both Horizontal and Vertical Parallel Processing.
- 
-  ''Bigtable (and Hbase) do not store nulls. If there is no value for a particular key, then
an empty or null value will be returned -- JimKellerman 2007/05/30''
- 
- Let's consider the following case:
- You may be familiar to RDF(Resource Description Framework) Storage from W3C, which is
- 
-  * Storing and managing very large amounts of structured data
-  * Row/column space can be sparse
-  * Columns are in the form of (family: optional qualifier). This is a RDF Properties 
-  * Columns have type information  
- 
-   ''In both Bigtable, and Hbase, there is no notion of type. Keys and values in Bigtable
are arbitrary strings. In Hbase, values are an arbitrary byte array. -- JimKellerman 2007/05/30''
- 
-  * Because of the design of the system, columns are easy to create (and are created implicitly)

- 
-   ''In Bigtable, column families are easy to create but they require administration priviliges
(Access Control Lists control who can manipulate the schema. New column family members can
be created at any time. Hbase follows this metaphor. -- JimKellerman 2007/05/30''
- 
-  * Column families can be split into locality groups (Ontologies!) 
- 
- Let's assume a large amount of RDF documents are stored in the system.
- And then, vertical(column) data set by one of RDF properties can be read fast from Table,
because it is column-stored.
- Please let me know if you don't agree with me.
- 

Mime
View raw message