hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/HBaseBulkLoad" by JohnSichi
Date Fri, 09 Apr 2010 21:15:24 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/HBaseBulkLoad" page has been changed by JohnSichi.
http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad?action=diff&rev1=5&rev2=6

--------------------------------------------------

- '''under construction'''
- 
  This page explains how to use Hive to bulk load data into a new (empty) HBase table per
[[https://issues.apache.org/jira/browse/HIVE-1295|HIVE-1295]].  (If you're not using a build
which contains this functionality yet, you'll need to build from source and make sure this
patch is applied.)
  
  = Overview =
@@ -38, +36 @@

  
  Currently there are a number of constraints here:
  
- * The target table must be new (you can't bulk load into an existing table)
+  * The target table must be new (you can't bulk load into an existing table)
- * The target table can only have a single column family ([[http://issues.apache.org/jira/browse/HBASE-1861|HBASE-1861]])
+  * The target table can only have a single column family ([[http://issues.apache.org/jira/browse/HBASE-1861|HBASE-1861]])
- * The target table cannot be sparse (every row will have the same set of columns); this
should be easy to fix by either allowing a MAP value to be read from Hive, and/or by allowing
rows to be read from Hive in pivoted form (one row per HBase cell)
+  * The target table cannot be sparse (every row will have the same set of columns); this
should be easy to fix by either allowing a MAP value to be read from Hive, and/or by allowing
rows to be read from Hive in pivoted form (one row per HBase cell)
  
  Besides dealing with these constraints, probably the most important work here is deciding
on how you want to assign an HBase row key to each row coming from Hive.  To avoid inconsistencies
between lexical and binary comparators, it is simplest to design a string row key and use
it consistently all the way through.  If you want to combine multiple columns into the key,
use Hive's string concat expression for this purpose.  You can use CREATE VIEW to tack on
your rowkey logically without having to update any existing data in Hive.
  
@@ -123, +121 @@

  cluster by transaction_id;
  }}}
  
- The CREATE TABLE creates a dummy table which controls how the output of the sort is written.
 Note that it uses {{{HiveHFileOutputFormat}}} to do this, with the table property {{{hfile.family.path}}}
used to control the destination directory for the output.  Again, be sure to set the inputformat/outputformat
exactly as specified.  
+ The CREATE TABLE creates a dummy table which controls how the output of the sort is written.
 Note that it uses {{{HiveHFileOutputFormat}}} to do this, with the table property {{{hfile.family.path}}}
used to control the destination directory for the output.  Again, be sure to set the inputformat/outputformat
exactly as specified.
  
  The {{{cf}}} in the path specifies the name of the column family which will be created in
HBase, so the directory name you choose here is important.  (Note that we're not actually
using an HBase table here; {{{HiveHFileOutputFormat}}} writes directly to files.)
  
@@ -137, +135 @@

  
  If Hive and HBase are running in different clusters, use [[http://hadoop.apache.org/common/docs/current/distcp.html|distcp]]
to copy the files from one to the other.
  
- Once the files are in the HBase cluster, use the {{{bin/loadtable.rb}}} script which comes
with HBase to import:
+ Once the files are in the HBase cluster, use the {{{bin/loadtable.rb}}} script which comes
with HBase to import them:
  
  {{{
  hbase org.jruby.Main loadtable.rb transactions /tmp/hbsort
@@ -147, +145 @@

  
  After this script finishes, you may need to wait a minute or two for the new table to be
picked up by the HBase meta scanner.  Use the hbase shell to verify that the new table was
created correctly, and do some sanity queries to locate individual cells and make sure they
can be found.
  
+ = Map New Table Back Into Hive =
+ 
+ Finally, if you'd like to access the HBase table you just created via Hive:
+ 
+ {{{
+ CREATE EXTERNAL TABLE hbase_transactions(transaction_id string, user_name string, amount
double, ...) 
+ STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
+ WITH SERDEPROPERTIES ("hbase.columns.mapping" = "cf:user_name,cf:amount,...")
+ TBLPROPERTIES("hbase.table.name" = "transactions");
+ }}}
+ 
+ = Followups Needed =
+ 
+  * Support sparse tables
+  * Support loading binary data representations once HIVE-1245 is fixed
+  * Support assignment of timestamps
+  * Provide control over file parameters such as compression
+  * Support multiple column families once HBASE-1861 is implemented
+  * Wrap it all up into the ideal single-INSERT-with-auto-sampling job...
+ 

Mime
View raw message