hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-hadoop Wiki] Update of "Hbase/HbaseArchitecture" by JimKellerman
Date Wed, 28 Feb 2007 22:34:07 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by JimKellerman:
http://wiki.apache.org/lucene-hadoop/Hbase/HbaseArchitecture

------------------------------------------------------------------------------
- This effort is still a "work in progress". Please feel free to add comments, but please make them stand out by bolding or underlining them. Thanks!
+ This effort is still a "work in progress". Please feel free to add
+ comments, but please make them stand out by bolding or underlining
+ them. Thanks!
+ 
+ '''NOTE:''' This document has been replaced by the contents of the
+ README file provided by Michael Cafarella along with an initial code
+ base that is attached to
+ [http://issues.apache.org/jira/browse/HADOOP-1045 Hadoop Jira Issue 1045]
+ 
+ Where appropriate, portions of the old document will be merged into
+ this document in the future.
  
  = Table of Contents =
  
+  * [#intro Introduction]
   * [#datamodel Data Model]
+  * [#hregion HRegion (Tablet) Server]
+  * [#master HBase Master Server]
-   * [#columnvaluetypes Column Value Types]
-   * [#conceptual Conceptual View]
-   * [#physical Physical Storage View]
-  * [#schema Schema Definition / Configuration]
-  * [#chubby Distributed Lock Server]
-  * [#masternode Master Node]
-  * [#tabletserver Tablet Server]
-  * [#sstable SSTable]
-  * [#metadata METADATA Table]
+  * [#metadata META Table]
+  * [#summary Summary]
+  * [#status Current Status]
-  * [#ipc Inter-process Communication Messages]
-   * [#wireformat IPC Message Wire Format]
-   * [#ipcclientmaster Client to Master Messages]
-   * [#ipcclienttablet Client to Tablet Server Messages]
-   * [#ipcmastertablet Master to Tablet Server Messages]
-   * [#ipctabletmaster Tablet Server to Master Messages]
-  * [#clientlib Client Library]
-   * [#api API]
-    * [#accesscontrol Access Control Entries]
-    * [#columnmeta Column Metadata]
-    * [#adminapi Administrative API]
-    * [#dataapi Client Data Access API]
-  * [#other Other]
   * [#comments Comments]
+ 
+ [[Anchor(intro)]]
+ = Introduction =
+ 
+ This document gives a quick overview of HBase, the Hadoop simple
+ database. It is extremely similar to Google's Bigtable, with a just a
+ few differences. If you understand Bigtable, great. If not, you should
+ still be able to understand this document.
  
  [[Anchor(datamodel)]]
  = Data Model =
  
- A Hbase table is a sparse, distributed, persistent, multi-dimensional
- sorted map. The map is indexed by a row key, column key, and a
- timestamp. Each value in the map is an uninterpreted array of bytes.
+ HBase uses a data model very similar to that of Bigtable. Users store
+ data rows in labelled tables. A data row has a sortable key and an
+ arbitrary number of columns. The table is stored sparsely, so that
+ rows in the same table can have crazily-varying columns, if the user
+ likes.
  
- (row:string, column:string, time:long) -> byte[]
+ A column name has the form "<group>:<label>" where <group> and <label>
+ can be any string you like. A single table enforces its set of
+ <group>s (called "column groups"). You can only adjust this set of
+ groups by performing administrative operations on the table. However,
+ you can use new <label> strings at any write without preannouncing
+ it. HBase stores column groups physically close on disk. So the items
+ in a given column group should have roughly the same write/read
+ behavior.
  
- [[Anchor(columnvaluetypes)]]
- == Column Value Types ==
+ Writes are row-locked only. You cannot lock multiple rows at once. All
+ row-writes are atomic by default.
  
- A column may have a single value for a specified row key or it may
- have a map of key value pairs. The former is called a ''value column''
- or '''column''' for short, the latter is called a ''map column'' or
- '''map''' for short.
+ All updates to the database have an associated timestamp. The HBase
+ will store a configurable number of versions of a given cell. Clients
+ can get data by asking for the "most recent value as of a certain
+ time". Or, clients can fetch all available versions at once.
  
- There are two types of ''column'':
-  1. value is arbitrary sequence of bytes
-  1. value is an integer counter. We refer to this type of column as a '''counter'''
- 
- Google makes no distinction between these two value types and groups
- them under the term ''column family''. They achieve the single valued
- column as a degenerate case of a column family. A single valued column
- has no column key in Bigtable.
- 
- In the general case, Google allows arbitrary keys in a column
- family. However, they also provide a specialization called a
- ''locality group'' in which the column keys are limited to a specific
- enumerated set. In the example given on page 6 of the
- [http://labs.google.com/papers/bigtable.html Bigtable Paper], they
- define a locality group that contains web page metadata and has
- specific keys for language and checksums.
- 
- We feel that this is an unnecessary complication of the platform, and
- will support '''columns''' and '''maps''' only. Should a client
- application desire to implement a ''locality group'' it can do so by
- simply restricting its map column key set.
- 
- We use the terms '''column''', '''map''' and '''counter''' throughout the rest of the document for consistency.
- 
- [[Anchor(conceptual)]]
- == Conceptual View ==
- 
- Conceptually a table may be thought of a collection of rows that
- are located by a row key (and optional timestamp) and where any column
- may not have a value for a particular row key (sparse). The following example is a slightly modified form of the one on page 2 of the [http://labs.google.com/papers/bigtable.html Bigtable Paper].
- 
- [[Anchor(datamodelexample)]]
- ||<:|2> '''Row Key''' ||<:|2> '''Time Stamp''' ||<:|2> '''Column''' ''"contents"'' |||| '''Map''' ''"anchor"'' ||<:|2> '''Column''' ''"mime"'' ||
- ||<:> '''key''' ||<:> '''value''' ||
- ||<^|5> "com.cnn.www" ||<:> t9 || ||<)> "cnnsi.com" ||<:> "CNN" || ||
- ||<:> t8 || ||<)> "my.look.ca" ||<:> "CNN.com" || ||
- ||<:> t6 ||<:> "<html>..." || || ||<:> "text/html" ||
- ||<:> t5 ||<:> `"<html>..."` || || || ||
- ||<:> t3 ||<:> `"<html>..."` || || || ||
- 
- [[Anchor(physical)]]
- == Physical Storage View ==
- 
- Although, at a conceptual level, tables may be viewed as a sparse set
- of rows, physically they are stored on a per-column basis. This is an
- important consideration for schema and application designers to keep
- in mind.
- 
- Scanning through a range of key values for a particular column will
- always be much faster than accessing the values for each column for a
- given row key. Consequently, values that will be used together should
- either be encoded together into a single column value or a map
- should be considered for grouping values.
- 
- Pictorially, the table shown in the [#datamodelexample conceptual view] above would be stored as
- follows:
- 
- ||<:> '''Row Key''' ||<:> '''Time Stamp''' ||<:> '''Column''' ''"contents'' ||
- ||<^|3> "com.cnn.www" ||<:> t6 ||<:> "<html>..." ||
- ||<:> t5 ||<:> `"<html>..."` ||
- ||<:> t3 ||<:> `"<html>..."` ||
- 
- [[BR]]
- 
- ||<:|2> '''Row Key''' ||<:|2> '''Time Stamp''' |||| '''Map''' ''"anchor"'' ||
- ||<:> '''key''' ||<:> '''value''' ||
- ||<^|2> "com.cnn.www" ||<:> t9 ||<)> "cnnsi.com" ||<:> "CNN" ||
- ||<:> t8 ||<)> "my.look.ca" ||<:> "CNN.com" ||
- 
- [[BR]]
- 
- ||<:> '''Row Key''' ||<:> '''Time Stamp''' ||<:> '''Column''' ''"mime"'' ||
- || "com.cnn.www" ||<:> t6 ||<:> "text/html" ||
- 
- [[BR]]
- 
- It is important to note in the diagram above that the empty cells
- shown in the conceptual view are not stored. Thus a request for the
- value of the ''"contents"'' column at time stamp ''t8'' would return
- a null value. Similarly, a request for an ''"anchor"'' value at time
- stamp ''t9'' for "my.look.ca" would return a null value.
- 
- However, if no timestamp is supplied, the most recent value for a
- particular column would be returned and would also be the first one
- found since time stamps are stored in descending order. Consequently
- the value returned for ''"contents"'' if no time stamp is supplied is
- the value for ''t6'' and the value for an ''"anchor"''  for
- "my.look.ca" if no time stamp is supplied is the value for time stamp
- ''t8''.
-  
- [[Anchor(schema)]]
+ [[Anchor(hregion)]]
- = Schema Definition / Configuration =
+ = HRegion (Tablet) Server =
  
- The schema is stored in the [#chubby Distributed Lock Server] file
- space. This allows Hbase to leverage the availability of a number of
- small files, access controls and locking mechanisms provided by the
- lock service. The following depicts the layout of the file space:
+ To the user, a table seems like a list of data tuples, sorted by row
+ key. Physically, tables are broken into HRegions. An HRegion is
+ identified by its tablename plus a start/end-key pair. A given HRegion
+ with keys <start> and <end> will store all the rows from (<start>,
+ <end>]. A set of HRegions, sorted appropriately, forms an entire
+ table.
  
+ All data is physically stored using Hadoop's DFS. Data is served to
+ clients by a set of H!RegionServers, usually one per machine. A given
+ HRegion is served by only one H!RegionServer at a time.
- {{{
- /hbase/instance-name/table-name1/columnname1
-                                  columnname2
-                                  etc.
  
-                      table-name2/columnname1
-                                  columnname2
-                                  etc.
- }}}
+ When a client wants to make updates, it contacts the relevant
+ H!RegionServer and commits the update to an HRegion. Upon commit, the
+ data is added to the HRegion's HMemcache and to the H!RegionServer's
+ HLog. The HMemcache is a memory buffer that stores and serves the
+ most-recent updates. The HLog is an on-disk log file that tracks all
+ updates. The commit() call will not return to the client until the
+ update has been written to the HLog.
  
+ When serving data, the HRegion will first check its HMemcache. If not
+ available, it will then check its on-disk HStores. There is an HStore
+ for each column family in an HRegion. An HStore might consist of
+ multiple on-disk H!StoreFiles. Each H!StoreFile is a B-Tree-like
+ structure that allow for relatively fast access.
- Each column file contains configuration settings for the column:
-  * whether it is a ''column'', ''map'', or ''counter''
-   * if ''map'', map compression setting
-  * tablet size
-  * tablet block size
-  * garbage collection policy
-   * keep latest n versions
-   * only keep data less than n days old
-  * in memory setting
-  * bloom filter enabled/disabled
-  * block compression
  
- Access control for each column is based on the access control setting
- for the column file.
+ Periodically, we invoke HRegion.flushcache() to write the contents of
+ the HMemcache to an on-disk HStore's files. This adds a new H!StoreFile
+ to each HStore. The HMemcache is then emptied, and we write a special
+ token to the HLog, indicating the HMemcache has been flushed.
  
- Each Hbase table is served by a number of server processes which
- belong to an Hbase '''instance'''. When a server process is started, it
- is told which instance to join.
+ On startup, each HRegion checks to see if there have been any writes
+ to the HLog since the most-recent invocation of flushcache(). If not,
+ then all relevant HRegion data is reflected in the on-disk HStores. If
+ yes, the HRegion reconstructs the updates from the HLog, writes them
+ to the HMemcache, and then calls flushcache(). Finally, it deletes the
+ HLog and is now available for serving data.
  
- Since the master server and tablet servers run substantially different
- code, the server process knows what role it will play based on the
- which code set was loaded when the process was started.
+ Thus, calling flushcache() infrequently will be less work, but
+ HMemcache will consume more memory and the HLog will take a longer
+ time to reconstruct upon restart. If flushcache() is called
+ frequently, the HMemcache will take less memory, and the HLog will be
+ faster to reconstruct, but each flushcache() call imposes some
+ overhead.
  
+ The HLog is periodically rolled, so it consists of multiple
+ time-sorted files. Whenever we roll the HLog, the HLog will delete all
+ old log files that contain only flushed data. Rolling the HLog takes
+ very little time and is generally a good idea to do from time to time.
- In addition to the schema information stored in the distributed lock
- server file space, run time information is also stored for each table:
-  * the location of the root tablet for the table
-  * the master server lock file
-  * the list of available tablet servers
  
- Thus the run time information looks like:
+ Each call to flushcache() will add an additional H!StoreFile to each
+ HStore. Fetching a file from an HStore can potentially access all of
+ its H!StoreFiles. This is time-consuming, so we want to periodically
+ compact these H!StoreFiles into a single larger one. This is done by
+ calling HStore.compact().
  
+ Compaction is a very expensive operation. It's done automatically at
+ startup, and should probably be done periodically during operation.
- {{{
- /hbase/instance-name/table-name1/root-tablet
-                                  master.lock
-                                  servers/
-                                  tablet-server-1
-                                  tablet-server-2
-                                  etc.
  
+ The Google Bigtable paper has a slightly-confusing hierarchy of major
+ and minor compactions. We have just two things to keep in mind:
-        instance-name/table-name2/root-tablet
-                                  master.lock
-                                  servers/
-                                  tablet-server-1
-                                  tablet-server-2
-                                  etc.
- }}}
  
+  1. A "flushcache()" drives all updates out of the memory buffer into on-disk structures. Upon flushcache, the log-reconstruction time goes to zero. Each flushcache() will add a new H!StoreFile to each HStore.
- and the complete distributed lock server file space for a single table
- would look like:
  
+  1. a "compact()" consolidates all the H!StoreFiles into a single one. It's expensive, and is always done at startup.
- {{{
- /hbase/table-name1/columnname1
-                    columnname2
-                    etc.
-                    root-tablet
-                    master.lock
-                    servers/
-                            tablet-server-1
-                            tablet-server-2
-                            etc.
- }}}
  
- [[Anchor(chubby)]]
- = Distributed Lock Server - a ''Chubby'' clone =
+ Unlike Bigtable, Hadoop's HBase allows no period where updates have
+ been "committed" but have not been written to the log. This is not
+ hard to add, if it's really wanted.
  
+ We can merge two HRegions into a single new HRegion by calling
+ HRegion.closeAndMerge(). We can split an HRegion into two smaller
+ HRegions by calling HRegion.closeAndSplit().
- [:DistributedLockServer:See the Hadoop Distributed Lock Server Project Page]
-  * Uses
-   1. To ensure that there is at most one active master at any time
-   1. To store bootstrap location of Bigtable data
-   1. To discover tablet servers and finalize tablet server death
-   1. To store Bigtable schema information
-   1. To store access control lists
  
+ OK, to sum up so far:
  
+  1. Clients access data in tables.
+  1. tables are broken into HRegions.
+  1. HRegions are served by H!RegionServers. Clients contact an H!RegionServer to access the data within its row-range.
+  1. HRegions store data in:
+   a. HMemcache, a memory buffer for recent writes
+   a. HLog, a write-log for recent writes
+   a. HStores, an efficient on-disk set of files. One per col-group.
+      (HStores use H!StoreFiles.)
+ 
- [[Anchor(masternode)]]
+ [[Anchor(master)]]
- = Master Node =
+ = HBase Master Server =
  
+ Each H!RegionServer stays in contact with the single H!BaseMaster. The
+ H!BaseMaster is responsible for telling each H!RegionServer what
+ HRegions it should load and make available.
-  * Handles table creation and deletion
-   * bootstrap
-    * create instance directory in lock server file system
-    * create servers directory in lock server file system
-    * create root tablet in HDFS
-    * save location in lock server file system
-    * create first metadata tablet
-    * write metadata tablet information to root tablet
-   * column/map creation
-    * create file in lock server file system
-    * write configuration settings to file
-    * create new tablet file in HDFS
-    * write tablet information to METADATA table
-    * assign tablet to tablet server
-  * Responsible for assigning tablets to tablet servers
-  * Detects the addition and expiration of tablet servers
-  * Balances tablet server load
-  * Garbage collects files (SSTables) in GFS by mark-and-sweep
-  * Handles schema changes, such as the addition of Columns and Maps
-  * Keeps track of the set of live tablet servers
-  * Keeps current assignment of tablets to tablet servers, including those that are unassigned
-  * Assigns unassigned tablets to tablet servers with sufficient room
-  * Polls the /servers directory to discover new tablet servers
-  * Regularly pings tablet servers for the status of their lock
-  * If it can't contact a tablet server or it reports that it lost it's lock, the Master will acquire the tablet server's lock on it's /servers Chubby file, and if successful, will delete it.
-  * Initiates tablet merges ''(when/how does it know to do this?)''
-  * Startup
-   * Acquires unique "master" Chubby lock
-   * Scans /servers directory in Chubby to find live tablet servers
-   * Communicates with all tablet servers to discover tablet assignment
-   * Scans METADATA table to find all tablets and adds those that have not been assigned to the set of unassigned tablets
  
+ The H!BaseMaster keeps a constant tally of which H!RegionServers are
+ alive at any time. If the connection between an H!RegionServer and the
+ H!BaseMaster times out, then:
  
- [[Anchor(tabletserver)]]
- = Tablet Server =
+  a. The H!RegionServer kills itself and restarts in an empty state.
+  b. The H!BaseMaster assumes the H!RegionServer has died and reallocates its HRegions to other H!RegionServers
  
+ Note that this is unlike Google's Bigtable, where a !TabletServer can
+ still serve Tablets after its connection to the Master has died. We
+ tie them together, because we do not use an external lock-management
+ system like Bigtable. With Bigtable, there's a Master that allocates
+ tablets and a lock manager (Chubby) that guarantees atomic access by
+ !TabletServers to tablets. HBase uses just a single central point for
+ all H!RegionServers to access: the H!BaseMaster.
-  * Manages and serves a set of Tablets (between 10 and 1000). A tablet is a row range of the table sorted in lexographical order. Tablets comprise two types of data structures: One or more on-disk structure called an SSTable, and one or more in-memory data structures called memtable. Initially, a table consists of a single tablet.
-  * Tablet servers are configured with 1GB of virtual memory
-  * Tablet servers typically serve no more than 1GB of daa
-   * If tablets are "full size" (128 MB), then a tablet server limited to 1GB of data will serve approximately 10 tablets.
-   * A 100MB tablet contains about 100,000 1K records.
-   * 10 tablets of 1K records will contain about 1x10^6^ records.
-  * SSTables for a tablet are registered in the METADATA table
-  * Tablet size is 100-200MB by default (128MB typical)
-  * Typical SSTable block size is 64K.
-   * Thus a block of 1K records contains between 50 to 64 records.
-   * A 100MB tablet will contain approximately 2,000 blocks.
-  * The set of tablets changes when:
-   * a table is created or deleted
-   * two existing tablets are merged to form a single larger tablet
-   * an existing tablet is split into two smaller tablets
-  * Splits tablets that have grown too large
-    * All tablets can be consolidated into a single SSTable via a ''major'' compaction. This suggests that tablets are split when they exceed 100-128MB in size.
-   * Initiates the split by recording information for the new tablet in the METADATA table
-   * Notifies the master of the split
  
-   ''So if a METADATA tablet splits, that would imply that the root tablet needs to be updated.''
+ (This is no more dangerous than what Bigtable does. Each system is
+ reliant on a network structure (whether H!BaseMaster or Chubby) that
+ must survive for the data system to survive. There may be some
+ Chubby-specific advantages, but that's outside HBase's goals right
+ now.)
  
+ As H!RegionServers check in with a new H!BaseMaster, the H!BaseMaster
+ asks each H!RegionServer to load in zero or more HRegions. When the
+ H!RegionServer dies, the H!BaseMaster marks those HRegions as
+ unallocated, and attempts to give them to different H!RegionServers.
-  * Memtable rows are marked Copy-on-write during reads to allow writes to happen in parallel
-  * Handles read/write requests to tablets
-  * Can be dynamically added or removed
-  * Clients communicate directly with tablet servers
-  * Announces it's existence by creating a uniquely named file in the /servers Chubby directory
-  * Stops serving it's tablets and kills itself if it cannot renew the lease on it's /servers file
-  * Writes
-   * Checks that the request is well-formed
-   * Checks that the sender is authorized (by reading authorization info from Chubby file, usually a cache hit)
-   * Writes mutation to commit log
-    * has group commit feature to improve performance. A group is comprised of all the tablets currently being served by the tablet server.
-   * Deletes are just writes with a ''special'' value that indicates that the record is deleted.
-   * Updates memtable.
-    * When a tablet is "brand new", all reads can be satisfied from the memtable.
-    * When the memtable gets too large ''(how large?)'', the memtable is written to a new SSTable. This is a ''minor'' compaction. 
  
-   ''Minor and merging compactions write a new redo point into the METADATA table. The redo point contains information about the new SSTables.''
+ Recall that each HRegion is identified by its table name and its
+ key-range. Since key ranges are contiguous, and they always start and
+ end with NULL, it's enough to simply indicate the end-key.
  
+ Unfortunately, this is not quite enough. Because of merge() and
+ split(), we may (for just a moment) have two quite different HRegions
+ with the same name. If the system dies at an inopportune moment, both
+ HRegions may exist on disk simultaneously. The arbiter of which
+ HRegion is "correct" is the HBase meta-information (to be discussed
+ shortly). In order to distinguish between different versions of the
+ same HRegion, we also add a unique 'regionId' to the HRegion name.
-  * Reads
-   * Checks that the request is well-formed
-   * Checks that the sender is authorized (by reading authorization info from Chubby file, usually a cache hit)
-   * Executes read on merged view of memtable and SSTables. The merged view consists of the memtable, the most recent SSTable, the next most recent SSTable, ..., the oldest SSTable. The first hit for a key masks any potential duplicates in older SSTables.
-    * At what point does this become too expensive so that a ''merging compaction'' is triggered?
-  * Compactions
-   * '''minor:''' Writes memtable to SSTable when it reaches a certain size. Writes new redo point into METADATA table.
-   * '''merging:''' Periodically merges a few SSTables and the memtable into one larger SSTable. This newly generated table may contain deletion entries that suppress deleted data in older tables.
-    * Since a merging compaction does not necessarily include all the SSTables in the tablet, how are the candidate SSTables chosen? One might be tempted to merge the oldest SSTables, but since a merging compaction also includes the memtable, that would indicate that the most recent SSTables should be merged with the memtable. Further, a read is more likely to be satisfied from the most recent SSTables, so this makes sense.
-    * ''How many SSTables are chosen for a merging compaction?''
-   * '''major:''' merging compaction that rewrites all SSTables into one SSTable. Contains no deletion entries
-    * Initiated by master
-    * ''What is the threshold that triggers a major compaction?''
-  * Commit Log
-   * Stores redo records
-   * Contains redo records for all tablets managed by tablet server
-   * Key consists of <table, row name, log sequence number>
-   * To speed recovery when a tablet server dies, the log is sorted by key. This sort is done by breaking the log into 64MB chunks and is done in parallel on different tablet servers. The sort is managed by the Master.
-   * Two logs are kept, one active and one inactive. When writing to one log becomes slow, a log sequence number is incremented, and the other log is switched to. During recovery, both logs are sorted together and the sequence number is used to elided duplicated entries.
-  * Moving tablets. When a tablet is moved from one server to another, the tablet server does a major compaction prior to the move to speed up tablet recovery.
-  * Tablet recovery
-   * Reads the METADATA table to find SSTable locations and the redo points
-   * Reads SSTable indices into memory
-   * Reconstructs the memtable by applying all of the updates that have committed since the redo points
-  * Caching
-   * Scan Cache: caches the key/value pairs returned by the SSTable interface
-   * Block Cache: caches blocks read from the SSTables
-  * Bloom Filter
-   * Optional in-memory structure that reduces disk access by
  
+ Thus, we finally get to this identifier for an HRegion:
- [[Anchor(sstable)]]
- = SSTable =
  
+ tablename + endkey + regionId.
-  * Immutable (write-once)
-  * Sorted list of key/value pairs.
-  * Sequence of 64KB blocks
-  * Block index __consists of the start keys for each block__
-  * Compression
-   * Per block
-   * Per Map compression
-  * Can be Memory-mapped
-  * Can be shared by two tablets immediately after a split
-  * API
-   * Lookup a given key ''(with or without timestamp)''
-   * Iterate over key/value pairs
  
+ You can see this identifier being constructed in
+ HRegion.buildRegionName().
- As a first approximation, a Hadoop !MapFile satisfies these requirements. It is a persistent (lives in the Hadoop DFS), ordered (!MapFile is based on !SequenceFile which is strictly ordered), immutable (once written, an attempt to open a !MapFile for writing will overwrite the existing contents) map from keys to values.
- 
- Keys are arbitrary Java strings (String) and values are treated as an array of bytes (byte[]).
- 
- Because entries are also timestamped, it is possible to have multiple values for the same key.
- 
- Given a key, you can find its value(s). It is possible to iterate over the entire file or find a key and iterate from tha point forward.
- 
- It is implemented as a sequence of blocks which are configurable in size.
- 
- Unlike SSTable, !MapFile stores its index in a separate file instead of at the end of the file. This is likely to be more efficient than storing the index at the end of the file and having to re-write it when the file is being created. Although it uses two files, both are accessed in a sequential append only fashion when the file is being created.
- 
- ''Questions have been raised about the suitability of using a !MapFile to implement SSTables:''
-  * ''will the two file implementation put extra stress on the Name Node?''
-  * ''would something else better meet Hbase's requirements?''
- 
- ''The consensus to date has been that !MapFile is a good enough approximation to start with and optimization at this point is premature.''
  
  [[Anchor(metadata)]]
- = METADATA Table =
+ = META Table =
  
-  * Three-level heirarchy
-   * Chubby file contains the location of the "root tablet"
-   * Root tablet stores the location of all tablets in a METADATA table
-   * The "root tablet" is just the first tablet in the METADATA table, it is never split
+ We can also use this identifier as a row-label in a different
+ HRegion. Thus, the HRegion meta-info is itself stored in an
+ HRegion. We call this table, which maps from HRegion identifiers to
+ physical H!RegionServer locations, the META table.
  
-    ''I really don't like this definition (and I know it is from the Bigtable paper). Aside from the fact that it is never split, the root tablet is special in another way: it is the metadata table for the METADATA table.'' ''It '''does''' however have the same format as the rest of the METADATA table.''
+ The META table itself can grow large, and may be broken into separate
+ HRegions. To locate all components of the META table, we list all META
+ HRegions in a ROOT table. The ROOT table is always contained in a
+ single HRegion.
  
-  * The row key is a combination of the table name, column or map name and the __'''first row key'''__ of the tablet.
+ Upon startup, the H!RegionServer immediately attempts to scan the ROOT
+ table (because there is only one HRegion for the ROOT table, that
+ HRegion's name is hard-coded). It may have to wait for the ROOT table
+ to be allocated to an H!RegionServer.
  
-   ''Note that this differs from Bigtable which stores the location of a tablet under a row key that is an encoding of the tablet's table ID and it's __'''end row'''__''. ''The reasons for this are:''
-    1. ''Since a row key can be an arbitrary string, how do you represent the "maximum value" for the last tablet?''
-    1. ''As new rows are added with row keys that are greater than the previous maximum value, the metadata table needs to be updated more frequently than if the first row key is stored''
-    1. ''If the first key in a tablet is stored instead, the minimum value could be represented as the empty string "" and a simple string comparison between a key and the first key would result in key > first key.'' If you represented the maximum row key as the empty string, that would require a special case instead of just a simple compare.''
+ Once the ROOT table is available, the H!BaseMaster can scan it and
+ learn of all the META HRegions. It then scans the META table. Again,
+ the H!BaseMaster may have to wait for all the META HRegions to be
+ allocated to different H!RegionServers.
  
-    1. ''If the keys go from 1 to n, when the tablet is split, the first tablet still has the row key "" as its first key and the second tablet has a first row key of n/2. So if a key is presented and its value is < n/2, you know that if it exists, it is in the first tablet and if the value >= n/2 it is in the second tablet.''
+ Finally, when the H!BaseMaster has scanned the META table, it knows the
+ entire set of HRegions. It can then allocate these HRegions to the set
+ of H!RegionServers.
  
-   ''If it turns out that there is a good reason that the last row key should be stored instead of the first row key, then this can be changed in the future.''
+ The H!BaseMaster keeps the set of currently-available H!RegionServers in
+ memory. Since the death of the H!BaseMaster means the death of the
+ entire system, there's no reason to store this information on
+ disk. All information about the HRegion->H!RegionServer mapping is
+ stored physically on different tables. Thus, a client does not need to
+ contact the H!BaseMaster after it learns the location of the ROOT
+ HRegion. The load on H!BaseMaster should be relatively small: it deals
+ with timing out H!RegionServers, scanning the ROOT and META upon
+ startup, and serving the location of the ROOT HRegion.
  
-  * The "location" map has the ''InMemory'' tuning parameter set
-  * Each row stores approximately 1KB of data in memory
-  * All events pertaining to each tablet are logged here (such as when a tablet server starts serving a tablet)
+ The HClient is fairly complicated, and often needs to navigate the
+ ROOT and META HRegions when serving a user's request to scan a
+ specific user table. If an H!RegionServer is unavailable or it does not
+ have an HRegion it should have, the HClient will wait and retry. At
+ startup or in case of a recent H!RegionServer failure, the correct
+ mapping info from HRegion to H!RegionServer may not always be
+ available.
  
- === Format ===
+ [[Anchor(summary)]]
+ = Summary =
  
+  1. H!RegionServers offer access to HRegions (an HRegion lives at one H!RegionServer)
+  1. H!RegionServers check in with the H!BaseMaster
+  1. If the H!BaseMaster dies, the whole system dies
+  1. The set of current H!RegionServers is known only to the H!BaseMaster
+  1. The mapping between HRegions and H!RegionServers is stored in two special HRegions, which are allocated to H!RegionServers like any other.
+  1. The ROOT HRegion is a special one, the location of which the H!BaseMaster always knows.
+  1. It's the HClient's responsibility to navigate all this.
- The METADATA table has a single '''map''' column that stores the following data about the tablet it refers to:
-  * tablet file name
-  * whether the tablet contains a ''column'', ''map'' or ''counter''
-   * if ''map'', compression setting
-  * maximum desired tablet size
-  * tablet block size
-  * block compression setting
-  * garbage collection policy
-  * whether the tablet should be memory resident
-  * whether there is a bloom filter enabled for the tablet
-  * log redo point
-  * when the tablet most recently came on-line
-  * names and ordering of ''sub-tablets'' created as the result of a minor or merging compaction
  
- [[Anchor(example)]]
- === Example ===
- 
- ''Suppose you had a table with one column and 4x10^9^ rows. Each row contains about 5KB of data resulting in a total table size of 200TB.''
- 
- ''If each tablet holds about 100MB of data, this will require 2x10^6^ tablets and the same number of rows in the METADATA table.''
- 
- ''If each metadata row is about 1KB, the METADATA table size required to map all the tablets is 2x10^9^ bytes. If each METADATA tablet is 100MB that requires 20 METADATA tablets to map the entire table, and consequently 20 root tablet rows to map the METADATA tablets.''
- 
- [[Anchor(ipc)]]
- = Inter-process Communication Messages =
- 
- Interactions betwen the client and master, the client and the tablet
- server and between the master and tablet servers occur as messages
- transferred over a socket connection. The following sections document
- the general format of the messages, and the specific messages for each
- interaction. 
- 
- [[Anchor(wireformat)]]
- == IPC Message Wire Format ==
- 
- All IPC messages have the same basic format but differ in the details
- of the operation being performed and the parameters that are
- passed. The basic format of a message is:
- 
- ||<:> '''Size''' ||<(> '''Description''' ||
- ||<:> byte ||<(> protocol version number ||
- ||<:> int ||<(> message type (op code) ||
- ||<:> int ||<(> number of parameters ||
- ||||<style="text-align: center; font-style: italic;"> repeat for each parameter ||
- ||<:> int ||<(> parameter identifier ||
- ||<:> int ||<(> length of parameter value ||
- ||<:> byte[length] ||<(> parameter value ||
- 
- [[Anchor(ipcclientmaster)]]
- == Client to Master Messages ==
- 
-  * create table (table-name)
-  * delete table (table-name)
-  * create column (table-name, column-name, type: {simple|counter|map}, max-tablet-size, tablet-block-size, gc-policy, compression-policy, in-memory, bloom-filter)
-  * delete column (table-name, column-name)
-  * get column metadata returns values that were specified for create column
-  * set acl (column-name, read, write, control)
-  * get acl (column-name) returns (read, write, control)
-  * get tablet server(tablet) returns server
- 
- [[Anchor(ipcclienttablet)]]
- == Client to TabletServer Messages ==
- 
-  * read (row-key, column-name, [column-key (for maps)], [timestamp]) returns value
-   * for maps, if no column-key is specified, returns the entire map for the specified row key
-   * ''Question: Is there a need to split up large return values into multiple messages?''
-   * The table name is implied since this message is sent directly to the tablet server for the specified row-key
- 
-  * scan (starting-row-key, column-name, [starting-timestamp]) returns a tablet block size of data for the specified column and starting point.
- 
-  * write (row-key, column-name, [column-key (for maps)], timestamp, value)
- 
-  * batchUpdate (row-key, List<parameters>, timestamp)
-   * batchUpdates can only be used when updating multiple values in a map for a specific row key or if the tablets for all the specified columns are all served by the same tablet server.
-   * the parameter list consists of column name - value pairs (for columns) or column-name - column-key - value triples (for maps)
- 
-  * incrementCounter (row-key, column-name, timestamp, increment)
- 
- [[Anchor(ipcmastertablet)]]
- == Master to Tablet Server Messages ==
- 
-  * ''Note: the master uses the client API to read METADATA tablets, so no special messages are required for this activity.''
- 
-  * "Are you ok?"
-   * response (indicates tablet server alive)
-    * if the tablet server is serving any tablets, the response will also include a list of tablets being served.
- 
-  * start serving tablets (list-of-tablets-to-serve)
-   * response indicates tablets being served and any that it failed to start serving
- 
-  * major compaction (tablet)
-   * response indicates compaction complete
- 
-  * stop serving tablets (list-of-tablets-to-stop-serving)
-   * response indicates that tablets are no longer being served
- 
-  * shut down
-   * response indicates that shutdown was successful
- 
- [[Anchor(ipctabletmaster)]]
- == Tablet Server to Master Messages ==
- 
-  * tablet split (old-tablet, new-tablet-1, new-tablet-2)
- 
- [[Anchor(clientlib)]]
- = Client Library =
- 
-  * Caches tablet locations
-  * If can't find locaiton, recurses up the heirarchy
-  * Contacts Chubby directly to find root tablet
-  * Client library pre-fetches tablet locations by reading metadata for more than one tablet whenever it reads the METADATA table.
- 
- 
- [[Anchor(api)]]
+ [[Anchor(status)]]
- == API ==
+ = Current Status =
  
- There are four components to the client API. Two are supporting
- classes for access control entries and column metadata and two interfaces: one for administrative functions such as creating tables and columns and one for client access to the data.
+ As of this writing, there is just shy of 7000 lines of code in the
+ "hbase" directory, in a patch attached to 
+ [http://issues.apache.org/jira/browse/HADOOP-1045 Hadoop Jira Issue 1045]
  
+ All of the single-machine operations (safe-committing, merging,
+ splitting, versioning, flushing, compacting, log-recovery) are
+ complete, have been tested, and seem to work great.
- [[Anchor(accesscontrol)]]
- === Access Control Entries ===
- {{{
- package org.apache.hbase.client;
  
+ The multi-machine stuff (the H!BaseMaster, the H!RegionServer, and the
+ HClient) have not been fully tested. The reason is that the HClient is
+ still incomplete, so the rest of the distributed code cannot be
+ fully-tested. I think it's good, but can't be sure until the HClient
+ is done. However, the code is now very clean and in a state where
+ other people can understand it and contribute.
- /** Access Control List Entry (ACE) */
- public class AccessControlEntry {
-   
-   public enum Permission {              // The kind of access this entry applies to
-     READ,                               // read access
-     WRITE,                              // write access
-     CONTROL                             // control (can change ACL) access
-   }
-   
-   /** Authorized users */
-   private String[] authorizedUsers;
-   
-   /** Permission granted */
-   private Permission permission;
-   
-   /**
-    * Constructs an ACE for the specified permission with an empty user list
-    * 
-    * @param p - access permission
-    */
-   public AccessControlEntry(Permission p) {
-     this.permission = p;
-     this.authorizedUsers = new String[0];
-   }
-   
-   /**
-    * Constructs an ACE for the specified permission with the specified users
-    * 
-    * @param p           - access permission
-    * @param users       - user list
-    */
-   public AccessControlEntry(Permission p, String[] users) {
-     this.permission = p;
-     this.authorizedUsers = users;
-   }
  
+ Other related features and TODOs:
-   // Accessors
-   public String[] getUsers() { return this.authorizedUsers; }
-   public Permission getPermission() { return this.permission; }
-   
-   public void setUsers(String[] users) { this.authorizedUsers = users; }
-   public void setPermission(Permission p) { this.permission = p; }
- }
- }}}
  
+  1. Single-machine log reconstruction works great, but distributed log recovery is not yet implemented. This is relatively easy, involving just a sort of the log entries, placing the shards into the right DFS directories
+  1. Data compression is not yet implemented, but there is an obvious place to do so in the HStore.
+  1. We need easy interfaces to !MapReduce jobs, so they can scan tables
+  1. The HMemcache lookup structure is relatively inefficient
+  1. File compaction is relatively slow; we should have a more conservative algorithm for deciding when to apply compaction.
+  1. For the getFull() operation, use of Bloom filters would speed things up
+  1. We need stress-test and performance-number tools for the whole system
+  1. There's some HRegion-specific testing code that worked fine during development, but it has to be rewritten so it works against an HRegion while it's hosted by an H!RegionServer, and connected to an H!BaseMaster. This code is at the bottom of the HRegion.java file.
- [[Anchor(columnmeta)]]
- === Column Metadata ===
- {{{
- package org.apache.hbase.client;
- 
- public class ColumnMetadata {
- 
-   /** The kinds of columns */
-   public enum ColumnType {
-     COLUMN,                 // a 'normal' column - one value per row key
-     COUNTER,                // an integer counter - one value per row key
-     MAP                     // a 'map' of key/value pairs for the same row key
-   }
- 
-   /** Garbage collection policies */
-   public enum GarbageCollectionPolicy {
-     LATEST_N_VERSIONS,      // keep latest N versions
-     NEWER_THAN_N_DAYS       // keep versions newer than N days old
-   }
-   
-   public static final int DEFAULT_MAX_TABLET_SIZE = 128 * 1024 * 1024;  // 128MB
-   public static final int DEFAULT_BLOCK_SIZE = 64 * 1024 * 1024;        // 64MB
-   public static final int DEFAULT_GC_N = 3; // latest 3 versions or newer than 3 days old
- 
-   // Accessor methods
-   
-   public ColumnType getColumnType() { return columnType; }
-   public boolean mapCompressionEnabled() { return mapCompressionEnabled; }
-   public GarbageCollectionPolicy getGarbageCollectionPolicy() { return gcPolicy; }
-   public int getGarbageCollectionN() { return gcN; }
-   public int getMaxTabletSize() { return maxTabletSize; }
-   public int getBlockSize() { return blockSize; }
-   public boolean blockCompressionEnabled() { return compressionEnabled; }
-   public boolean isInMemory() { return inMemory; }
-   public boolean bloomFilterEnabled() { return bloomFilterEnabled; }
-   
-   // Set value methods
-   
-   public void setColumnType(ColumnType type) { this.columnType = type; }
-   public void setMapCompression(boolean compression) { 
-     this.mapCompressionEnabled = compression;
-   }
-   public void setGarbageCollection(GarbageCollectionPolicy policy, int n) {
-     this.gcPolicy = policy;
-     this.gcN = n;
-   }
-   public void setMaxTabletSize(int size) { this.maxTabletSize = size; }
-   public void setBlockSize(int size) { this.blockSize = size; }
-   public void setBlockCompression(boolean compression) {
-     this.compressionEnabled = compression;
-   }
-   public void setInMemory(boolean isInMemory) { this.inMemory = isInMemory; }
-   public void setBloomFilter(boolean bloomFilter) {
-     this.bloomFilterEnabled = bloomFilter;
-   }
-   
-   /** Constructor. Sets default values */
-   public ColumnMetadata() {
-     this.columnType = ColumnType.COLUMN;
-     this.mapCompressionEnabled = false;
-     this.gcPolicy = GarbageCollectionPolicy.LATEST_N_VERSIONS;
-     this.gcN = DEFAULT_GC_N;
-     this.maxTabletSize = DEFAULT_MAX_TABLET_SIZE;
-     this.blockSize = DEFAULT_BLOCK_SIZE;
-     this.compressionEnabled = false;
-     this.inMemory = false;
-     this.bloomFilterEnabled = false;
-   }
-   
-   /**
-    * Constructor that requires all the settings to be specified
-    * 
-    * @param type                - ColumnType
-    * @param compressMap         - if map, is map compressed
-    * @param policy              - GarbageCollectionPolicy
-    * @param gcn                 - The N for garbage collection
-    * @param tabletSize          - Max tablet size
-    * @param blocksize           - Tablet block size
-    * @param compressBlocks      - Enable block compression
-    * @param isInMemory          - Keep tablet in memory
-    * @param enableBloomFilter   - Enable tablet bloom filter
-    */
-   public ColumnMetadata(ColumnType type,
-       boolean compressMap,
-       GarbageCollectionPolicy policy,
-       int gcn,
-       int tabletSize,
-       int blocksize,
-       boolean compressBlocks,
-       boolean isInMemory,
-       boolean enableBloomFilter
-       ) {
-     this.columnType = type;
-     this.mapCompressionEnabled = compressMap;
-     this.gcPolicy = policy;
-     this.gcN = gcn;
-     this.maxTabletSize = tabletSize;
-     this.blockSize = blocksize;
-     this.inMemory = isInMemory;
-     this.bloomFilterEnabled = enableBloomFilter;
-   }
-   
-   // Implementation details - how the values are currently stored
-   
-   private ColumnType columnType;                // The column type
-   private boolean mapCompressionEnabled;        // If map, whether map should be compressed
-   
-   private GarbageCollectionPolicy gcPolicy;     // Garbage collection policy
-   private int gcN;                              // The N for garbage collection
-   
-   private int maxTabletSize;                    // Maximum tablet size
-   private int blockSize;                        // Tablet block size
-   
-   private boolean compressionEnabled;           // block compression setting
-   private boolean inMemory;                     // tablet should be kept in memory
-   private boolean bloomFilterEnabled;           // column has a bloom filter
- }
- }}}
- 
- [[Anchor(adminapi)]]
- === Administrative API ===
- 
- {{{
- package org.apache.hbase.client;
- 
- import java.util.List;
- 
- /** Interface for Hbase administrative functions */
- public interface HbaseAdministration {
-   
-   /**
-    * Create a Hbase table
-    * 
-    * @param tableName   - name of the table to create
-    */
-   public void createTable(String tableName);
- 
-   /**
-    * Create a column in an Hbase table
-    * 
-    * @param tableName   - name of the table
-    * @param columnName  - name of the column
-    * @param columnDescription - column description metadata 
-    */
-   public void createColumn(String tableName, String columnName, ColumnMetadata columnDescription );
-   
-   /**
-    * Fetch metadata for a column
-    * 
-    * @param tableName   - name of table
-    * @param columnName  - name of column
-    * @return            - metadata for specified column
-    */
-   public ColumnMetadata getColumnMetadata(String tableName, String columnName);
-   
-   /**
-    * Deletes the specified table
-    * 
-    * @param tableName   - table to be deleted
-    */
-   public void deleteTable(String tableName);
-   
-   /**
-    * Deletes the specified column from the specified table
-    * 
-    * @param tableName   - name of table
-    * @param columnName  - name of column to be deleted
-    */
-   public void deleteColumn(String tableName, String columnName);
-   
-   /**
-    * Set access control for a column
-    * 
-    * @param tableName   - name of table
-    * @param columnName  - name of column
-    * @param acl         - access to set
-    */
-   public void setColumnACE(String tableName, String columnName, AccessControlEntry acl);
-   
-   /**
-    * Set access control list for a column. Specifies more than one kind of access.
-    * 
-    * @param tableName   - name of table
-    * @param columnName  - name of column
-    * @param acls        - List of ACEs to set
-    */
-   public void setColumnACL(String tableName, String columnName, List<AccessControlEntry> acls);
-   
-   /**
-    * Return all the access control entries for the specified column.
-    * 
-    * @param tableName   - name of the table
-    * @param columnName  - name of the column
-    * @return            - List of access control entries for column
-    */
-   public List<AccessControlEntry> getColumnACL(String tableName, String columnName);
- }
- }}}
- 
- [[Anchor(dataapi)]]
- === Client Data Access API ===
- 
- {{{
- package org.apache.hbase.client;
- 
- import java.util.Iterator;
- import java.util.Map;
- 
- /** General Hbase client API */
- public interface HbaseClient {
- 
-   public enum OpenAccess {
-     READ_ACCESS,
-     WRITE_ACCESS
-   }
-   
-   /**
-    * Open a table for read or write access
-    * 
-    * @param tableName   - table to open
-    * @param access      - open access (read or write)
-    */
-   public void openTable(String tableName, OpenAccess access);
-   
-   /**
-    * Get the value of a 'normal' column
-    * 
-    * @param columnName  - name of column
-    * @param rowKey      - select row of column
-    * @param timestamp   - value timestamp (zero for most recent value)
-    * @return            - column value
-    */
-   public byte[] read(String columnName, String rowKey, long timestamp);
-   
-   /**
-    * Get the value of a 'counter' column for the specified row
-    * 
-    * @param columnName  - name of counter column
-    * @param rowKey      - select row of column
-    * @return            - value of counter
-    */
-   public int read(String columnName, String rowKey);
-   
-   /**
-    * Get the value of a map column given the specified row and column keys
-    * 
-    * @param columnName  - name of column
-    * @param rowKey      - select row of column
-    * @param columnKey   - select column map member
-    * @param timestamp   - value timestamp (zero for most recent value)
-    * @return
-    */
-   public byte[] read(String columnName, String rowKey, String columnKey, long timestamp);
-   
-   /**
-    * Increment the value of a counter column (by one)
-    * 
-    * @param columnName  - name of the column
-    * @param rowKey      - select row of column
-    */
-   public void incrementColumn(String columnName, String rowKey);
- 
-   /**
-    * Increment the value of a counter column (by the specified increment)
-    * 
-    * @param columnName  - name of the column
-    * @param rowKey      - select row of column
-    * @param increment   - amount to increment counter
-    */
-   public void incrementColumn(String columnName, String rowKey, int increment);
-   
-   /**
-    * Write the specified value into the specified column at the specified row
-    * with the specified timestamp.
-    * 
-    * @param columnName  - name of column
-    * @param rowKey      - select row of column
-    * @param timestamp   - timestamp for value (if zero, current time is used)
-    * @param value       - value to store
-    */
-   public void write(String columnName, String rowKey, long timestamp, byte[] value);
-   
-   /**
-    * Write a single value into the specified map column at the specified row for 
-    * the specified timestamp and specified map key
-    * 
-    * @param columnName  - name of column
-    * @param rowKey      - select row of column
-    * @param timestamp   - timestamp for value (if zero, current time is used)
-    * @param columnKey   - column map key
-    * @param value       - value to associate with column map key.
-    */
-   public void write(String columnName, String rowKey, long timestamp, String columnKey, byte[] value);
-   
-   /**
-    * Write the specified key/value map to the specified column in the specified
-    * row for the specified timestamp
-    * 
-    * @param columnName  - name of column
-    * @param rowKey      - select row of column
-    * @param timestamp   - timestamp for value (if zero, current time is used)
-    * @param value       - Map of key/value pairs to store
-    */
-   public void write(String columnName, String rowKey, long timestamp, Map<String, byte[]> value);
-   
-   /**
-    * Delete the value in the specified row of the specified column for the specified
-    * timestamp. (Note: for map columns, deletes all the values for the specified
-    * row key and timestamp)
-    * 
-    * @param columnName  - name of column
-    * @param rowKey      - select row of column
-    * @param timestamp   - timestamp of value (if zero, most recent value is deleted.
-    *                      if -1, oldest value is deleted)
-    */
-   public void delete(String columnName, String rowKey, long timestamp);
-   
-   /**
-    * Delete the value associated with the specified map key in the specified row
-    * of the specified column for the specified timestamp.
-    * 
-    * @param columnName  - name of column
-    * @param rowKey      - select row of column
-    * @param timestamp   - timestamp of value (if zero, most recent value is deleted.
-    *                      if -1, oldest value is deleted)
-    * @param columnKey   - column map key of value to be deleted
-    */
-   public void delete(String columnName, String rowKey, long timestamp, String columnKey);
- 
-   public abstract interface HbaseIterator extends Iterator, Iterable {
-     
-     /**
-      * @return  - current row key
-      */
-     public String getRowKey();
-     
-     /**
-      * @return  - timestamp of current value
-      */
-     public int getTimestamp();
-     
-   }
- 
-   /** iterate over a 'normal' column */
-   public interface columnIterator extends HbaseIterator {
-     /**
-      * @return  - current value
-      */
-     public byte[] getValue();
-   }
-   
-   /**
-    * Get an iterator for the specified normal column
-    * 
-    * @param columnName  - name of column
-    * @param rowKey      - starting row key (if null, start at beginning of table)
-    * @param timestamp   - timestamp of value (if zero, return most recent version,
-    *                      if -1, return all versions)
-    * @return            - iterator object
-    */
-   public columnIterator getColumnIterator(String columnName, String rowKey, long timestamp);
-   
-   /** iterate over a 'counter' column */
-   public interface counterIterator extends HbaseIterator {
-     /**
-      * @return  - current value
-      */
-     public int getValue();
-   }
-   
-   /**
-    * Get an iterator for the specified counter column
-    * 
-    * @param columnName  - name of column
-    * @param rowKey      - starting row key (if null, start at beginning of table)
-    * @param timestamp   - timestamp of value (if zero, return most recent version,
-    *                      if -1, return all versions)
-    * @return            - iterator object
-    */
-   public counterIterator getCounterIterator(String columnName, String rowKey, long timestamp);
-   
-   public interface mapIterator extends HbaseIterator {
-     
-     /**
-      * Get an iterator for all values for a particular row key
-      * 
-      * @return            - iterator object
-      */
-     public Iterator<Map<String, byte[]>> getValueIterator();
-   }
- 
-   /**
-    * Get an iterator for the specified map column
-    * 
-    * @param columnName  - name of column 
-    * @param rowKey      - starting row key (if null, start at beginning of table)
-    * @param timestamp   - timestamp of value (if zero, return most recent version,
-    *                      if -1, return all versions)
-    * @return            - iterator object
-    */
-   public mapIterator getMapIterator(String columnName, String rowKey, long timestamp);
-   
- }
- }}}
- 
- [[Anchor(other)]]
- = Other =
- 
-  * Map/Reduce connector
-  * Client Sawzall script execution in Tablet server space
  
  [[Anchor(comments)]]
  = Comments =

Mime
View raw message