hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/HBaseBulkLoad" by JohnSichi
Date Fri, 09 Apr 2010 01:48:54 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/HBaseBulkLoad" page has been changed by JohnSichi.


  '''under construction'''
- This page explains how to use Hive to bulk load data into a new (empty) HBase table per
+ This page explains how to use Hive to bulk load data into a new (empty) HBase table per
[[https://issues.apache.org/jira/browse/HIVE-1295|HIVE-1295]].  (If you're not using a build
which contains this functionality yet, you'll need to build from source and make sure this
patch is applied.)
  = Overview =
@@ -16, +16 @@

  SET hive.hbase.bulk=true;
  INSERT OVERWRITE new_hbase_table
- SELECT ... FROM hive_query;
+ SELECT rowid_expression, x, y FROM ...any_hive_query...;
  However, things aren't ''quite'' as straightforward as that yet.  Instead, a procedure involving
a series of SQL commands is required.  It should still be a lot easier and more flexible than
writing your own map/reduce program, and over time we hope to enhance Hive to move closer
to the ideal.
@@ -37, +37 @@

  '''tbd:  provide some example numbers based on Facebook experiments'''
- = Run Sampling for Range Partitioning =
+ = Prepare Range Partitioning =
+ In order to perform a parallel sort on the data, we need to range-partition it.  The idea
is to divide the space of row keys up into nearly equal-sized ranges, one per reducer.  The
details will vary according to your source data, and you may need to run a number of exploratory
Hive queries in order to come up with a good enough set of ranges.  As an oversimplified example,
suppose your row keys are transaction timestamps, you have a year's worth of data starting
from January, your data growth is constant month-over-month, and you want to run 12 reducers.
 In that case, you could use a query such as this one:
+ {{{
+ select transaction_id
+ from
+ (select month,max(transaction_id) as transaction_id
+  from transactions
+  group by month) m
+ order by transaction_id
+ limit 11
+ }}}
+ Note that we only want 11 values for breaking the data into 12 ranges, so we drop the max
timestamp for the last month.  Also note that there are usually much cheaper ways to come
up with good split keys; '''this is just an example'''.
  = Prepare Staging Location =

View raw message