hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/HBaseIntegration" by JohnSichi
Date Wed, 03 Mar 2010 02:58:24 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/HBaseIntegration" page has been changed by JohnSichi.
http://wiki.apache.org/hadoop/Hive/HBaseIntegration?action=diff&rev1=2&rev2=3

--------------------------------------------------

- TBD:  spec for [[https://issues.apache.org/jira/browse/HIVE-705|HIVE-705]].
+ = Introduction =
  
+ This page documents the Hive/HBase integration support originally
+ introduced in
+ [[https://issues.apache.org/jira/browse/HIVE-705|HIVE-705]].  This
+ feature allows Hive QL statements to access
+ [[http://hadoop.apache.org/hbase|HBase]] tables for both read (SELECT)
+ and write (INSERT).  It is even possible to combine access to HBase
+ tables with native Hive tables via joins and unions.
+ 
+ This feature is a work in progress, and suggestions for its
+ improvement are very welcome.
+ 
+ = Storage Handlers =
+ 
+ Before proceeding, please read [[Hive/StorageHandlers]] for an overview
+ of the generic storage handler framework on which HBase integration depends.
+ 
+ = Usage =
+ 
+ The storage handler is built as an independent module,
+ {{{hive_hbase_handler.jar}}}, which must be available on the Hive
+ client auxpath, along with HBase and Zookeeper jars.  It requires
+ configuration property {{{hbase.master}}} in order to connect to the
+ HBase master.
+ 
+ Here's an example using CLI from a source build environment:
+ 
+ {{{
+ $HIVE_SRC/build/dist/bin/hive --auxpath $HIVE_SRC/build/hbase-handler/hive_hbase-handler.jar,$HIVE_SRC/hbase-handler/lib/hbase-0.20.3.jar,$HIVE_SRC/hbase-handler/lib/zookeeper-3.2.2.jar
-hiveconf hbase.master=hbase.yoyodyne.com:60000
+ }}}
+ 
+ The handler requires Hadoop 0.20 or higher, and has only been tested
+ with dependency versions hadoop-0.20.0, hbase-0.20.3 and zookeeper-3.2.2.
+ 
+ In order to create a new HBase table which is to be managed by Hive,
+ use the STORED BY clause on CREATE TABLE:
+ 
+ {{{
+ CREATE TABLE hbase_table_1(key int, value string) 
+ STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
+ WITH SERDEPROPERTIES (
+ "hbase.columns.mapping" = "cf:string",
+ "hbase.table.name" = "xyz"
+ );
+ }}}
+ 
+ The {{{hbase.columns.mapping}}} property is required and will be
+ explained in the next section.  The {{{hbase.table.name}}} property
+ is optional; it controls the name of the table as known by HBase, and
+ allows the Hive table to have a different name.  In this example, the
+ table is known as {{{hbase_table_1}}} within Hive, and as {{{xyz}}}
+ within HBase.  If not specified, then the Hive and HBase table names
+ will be identical.
+ 
+ If instead you want to give Hive access to an existing HBase table,
+ use CREATE EXTERNAL TABLE:
+ 
+ {{{
+ CREATE EXTERNAL TABLE hbase_table_2(key int, value string) 
+ STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
+ WITH SERDEPROPERTIES (
+ "hbase.columns.mapping" = "cf:string",
+ "hbase.table.name" = "some_existing_table"
+ );
+ }}}
+ 
+ Again, {{{hbase.columns.mapping}}} is required (and will be
+ validated against the existing HBase table's column families), whereas
+ {{{hbase.table.name}}} is optional.
+ 
+ = Column Mapping =
+ 
+ The column mapping support currently available is somewhat
+ cumbersome and restrictive:
+ 
+  * the first column in the Hive table automatically becomes the key in the HBase table
+  * for each subsequent Hive column, the table creator must specify a corresponding entry
in the comma-delimited {{{hbase.columns.mapping}}} string (so for a Hive table with n columns,
the string should have n-1 entries)
+  * a mapping entry is of the form {{{column-family-name:[column-type]}}}
+  * if no column-type is given, then the Hive column will map to all columns in the corresponding
HBase column family, and the Hive MAP datatype will be used to allow access to these (possibly
sparse) columns
+ 
+ TBD: details on how HBase columns are named within a family, and how
+ primitive and map values are serialized
+ 
+ = Potential Followups =
+ 
+  * more flexible column mapping (HIVE-806)
+  * default column mapping in cases where no mapping spec is given
+  * filter/projection pushdown
+  * implement virtual partitions corresponding to HBase timestamps
+  * allow per-table hbase.master configuration
+  * run profiler and minimize any per-row overhead in column mapping
+  * user defined routines for lookups and data loads via HBase client API (HIVE-758 and HIVE-791)
+  * support a fast-path mode in which no map/reduce is used for simple queries (go through
HBase client API instead?)
+ 
+ = Build =
+ 
+ Code for the storage handler is located under
+ {{{hive/trunk/hbase-handler}}}.  The Hive build automatically enables
+ the storage handler build for {{{hadoop.version=0.20.x}}}, and
+ disables it for any other Hadoop version.  This behavior can be
+ overridden by setting ant property {{{hbase.enabled}}} to either
+ {{{true}}} or {{{false}}}.
+ 
+ HBase and Zookeeper dependencies are currently checked in under 
+ {{{hbase-handler/lib}}}.  We will convert this to use Ivy instead once
+ the corresponding POM's are available.
+ 
+ = Tests =
+ 
+ Class-level unit tests are provided under
+ {{{hbase-handler/src/test/org/apache/hadoop/hive/hbase}}}.
+ 
+ Positive QL tests are under {{{hbase-handler/src/test/queries}}}.
+ These use a HBase+Zookeeper mini-cluster for hosting the fixture
+ tables, so no real HBase installation is needed in order to run them.
+ Run them like this:
+ 
+ {{{
+ ant test -Dtestcase=TestHBaseCliDriver -Dqfile=hbase_queries.q
+ }}}
+ 
+ An Eclipse launch template remains to be defined.
+ 
+ TBD:  how to set up a mini-cluster server for ad hoc testing from CLI
+ 
+ = Links =
+ 
+ For another project which adds SQL-like query language support on top
+ of HBase, see [[http://www.hbql.com|HBQL]] (unrelated to Hive).
+ 

Mime
View raw message