hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/HBaseBulkLoad" by JohnSichi
Date Fri, 09 Apr 2010 00:46:26 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/HBaseBulkLoad" page has been changed by JohnSichi.


New page:
'''under construction'''

This page explains how to use Hive to bulk load data into a new (empty) HBase table per [[https://issues.apache.org/jira/browse/HIVE-1295|HIVE-1295]].

= Overview =

Ideally, bulk load from Hive into HBase would be as simple as this:

CREATE TABLE new_hbase_table(rowkey string, x int, y int) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = "cf:x,cf:y");

SET hive.hbase.bulk=true;

INSERT OVERWRITE new_hbase_table
SELECT ... FROM hive_query;

However, things aren't ''quite'' as simple as that yet.  Instead, a multistep procedure is
required involving both SQL and shell script commands.  It should still be a lot easier and
more flexible than writing your own map/reduce program, and over time we can enhance Hive
to move closer to the ideal.

The procedure is based on [[http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#bulk|underlying
HBase recommendations]], and involves the following steps:

 1. Decide on the number of reducers you're planning to use for parallelizing the sorting
and HFile creation.  This depends on the size of your data as well as cluster resources available.
 1. Run Hive commands which will create a file containing "splitter" keys which will be used
for range-partitioning the data during sort.
 1. Prepare a staging location in HDFS where the HFiles will be generated.
 1. Run Hive commands which will execute the sort and generate the HFiles.
 1. (Optional:  if HBase and Hive are running in different clusters, distcp the generated
files from the Hive cluster to the HBase cluster.)
 1. Run HBase script {{{loadtable.rb}}} to move the files into a new HBase table.
 1. (Optional:  register the HBase table as an external table in Hive so you can access it
from there.)

The rest of this page explains each step in greater detail.

View raw message