hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/HBaseBulkLoad" by JohnSichi
Date Fri, 09 Apr 2010 19:45:48 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/HBaseBulkLoad" page has been changed by JohnSichi.
http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad?action=diff&rev1=3&rev2=4

--------------------------------------------------

  SET hive.hbase.bulk=true;
  
  INSERT OVERWRITE new_hbase_table
- SELECT rowid_expression, x, y FROM ...any_hive_query...;
+ SELECT rowkey_expression, x, y FROM ...any_hive_query...;
  }}}
  
  However, things aren't ''quite'' as straightforward as that yet.  Instead, a procedure involving
a series of SQL commands is required.  It should still be a lot easier and more flexible than
writing your own map/reduce program, and over time we hope to enhance Hive to move closer
to the ideal.
@@ -39, +39 @@

  
  = Prepare Range Partitioning =
  
- In order to perform a parallel sort on the data, we need to range-partition it.  The idea
is to divide the space of row keys up into nearly equal-sized ranges, one per reducer.  The
details will vary according to your source data, and you may need to run a number of exploratory
Hive queries in order to come up with a good enough set of ranges.  As an oversimplified example,
suppose your row keys are transaction timestamps, you have a year's worth of data starting
from January, your data growth is constant month-over-month, and you want to run 12 reducers.
 In that case, you could use a query such as this one:
+ In order to perform a parallel sort on the data, we need to range-partition it.  The idea
is to divide the space of row keys up into nearly equal-sized ranges, one per reducer.  The
details will vary according to your source data, and you may need to run a number of exploratory
Hive queries in order to come up with a good enough set of ranges.  As a highly contrived
example, suppose your row keys are sequence-generated transaction ID strings (possibly with
gaps), you have a year's worth of data starting from January, your data growth is constant
month-over-month, and you want to run 12 reducers.  In that case, you could use a query such
as this one:
  
  {{{
  select transaction_id
@@ -51, +51 @@

  limit 11
  }}}
  
- Note that we only want 11 values for breaking the data into 12 ranges, so we drop the max
timestamp for the last month.  Also note that there are usually much cheaper ways to come
up with good split keys; '''this is just an example'''.
+ Note that we only want 11 values for breaking the data into 12 ranges, so we drop the max
timestamp for the last month.  Also note that the ORDER BY is necessary for producing the
range start keys in ascending order.
+ 
+ ''Important:'' there are usually much cheaper ways to come up with good split keys; '''this
is just an example to give you an idea of the kind of result your sampling query should produce'''.
+ 
+ In this example, our partitioning key is a single column.  In theory, you could partition
over a compound key, but if you're going to do this, it might be safest to pre-concatenate
your keys into a single string column (maybe using a view) which will serve as your HBase
rowkey; this will avoid potential inconsistencies in comparators later on.
+ 
+ Once you have your sampling query defined, the next step is to save its results to a properly
formatted file which will be used in a subsequent step.  To do this, run commands like the
following:
+ 
+ {{{
+ create external table hb_range_keys(transaction_id_range_start string)
+ row format serde 
+ 'org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe'
+ stored as 
+ inputformat 
+ 'org.apache.hadoop.mapred.TextInputFormat'
+ outputformat 
+ 'org.apache.hadoop.hive.ql.io.HiveNullValueSequenceFileOutputFormat'
+ location '/tmp/hb_range_keys';
+ 
+ insert overwrite table hb_range_keys
+ select transaction_id
+ from
+ (select month,max(transaction_id) as transaction_id
+  from transactions
+  group by month) m
+ order by transaction_id
+ limit 11;
+ }}}
+ 
+ The first command creates an external table defining the format of the file to be created;
be sure to set the serde and inputformat/outputformat exactly as specified.  
+ 
+ The second command populates it (using the sampling query previously defined).  Usage of
ORDER BY guarantees that a single file will be produced in directory {{{/tmp/hb_range_keys}}}.
 The filename is unknown, but it is necessary to reference the file by name later, so run
a command such as the following to copy it to a specific name:
+ 
+ {{{
+ dfs -cp /tmp/hb_range_keys/* /tmp/hb_range_key_list
+ }}}
  
  = Prepare Staging Location =
  
+ The sort is going to produce a lot of data, so make sure you have sufficient space in your
HDFS cluster, and choose the location where the files will be staged.  We'll use {{{/tmp/hbsort}}}
in this example.
+ 
  = Sort Data =
+ 
+ Now comes the big step:  running a sort over all of the data to be bulk loaded.  Make sure
that your Hive instance has {{{hive_hbase-handler.jar}}} available on its auxpath.
+ 
+ {{{
+ set mapred.reduce.tasks=12;
+ set hive.mapred.partitioner=org.apache.hadoop.mapred.lib.TotalOrderPartitioner;
+ set total.order.partitioner.natural.order=false;
+ set total.order.partitioner.path=/tmp/hb_range_key_list;
+ 
+ create table hbsort(transaction_id string, user_name string, amount double, ...)
+ stored as
+ INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
+ OUTPUTFORMAT 'org.apache.hadoop.hive.hbase.HiveHFileOutputFormat'
+ TBLPROPERTIES ('hfile.family.path' = '/tmp/hbsort/cf');
+ 
+ insert overwrite table hbsort
+ select transaction_id, user_name, amount, ...
+ from transactions
+ cluster by transaction_id;
+ }}}
+ 
+ The CREATE TABLE creates a dummy table which controls how the output of the sort is written.
 Note that it uses {{{HiveHFileOutputFormat}}} to do this, with the table property {{{hfile.family.path}}}
used to control the destination directory for the output.  Again, be sure to set the inputformat/outputformat
exactly as specified.  
+ 
+ The {{{cf}}} in the path specifies the name of the column family which will be created in
HBase, so the directory name you choose here is important.  (Note that we're not actually
using an HBase table here; {{{HiveHFileOutputFormat}}} writes directly to files.)
+ 
+ The CLUSTER BY clause provides the keys to be used by the partitioner; be sure that it matches
the range partitioning that you came up with in the earlier step.
+ 
+ The first column in the SELECT list is interpreted as the rowkey; subsequent columns become
cell values (all in a single column family, so their column names are important.
  
  = Run HBase Script =
  
+ Once the sort job completes successfully, one final step is required for importing the result
files into HBase.
+ 
+ If Hive and HBase are running in different clusters, use [[http://hadoop.apache.org/common/docs/current/distcp.html|distcp]]
to copy the files from one to the other.
+ 
+ Once the files are in the HBase cluster, use the {{{bin/loadtable.rb}}} script which comes
with HBase to import:
+ 
+ {{{
+ hbase org.jruby.Main loadtable.rb transactions /tmp/hbsort
+ }}}
+ 
+ The first argument ({{{transactions}}}) specifies the name of the new HBase table.  For
the second argument, pass the staging directory name, not the the column family child directory.
+ 
+ After this script finishes, you may need to wait a minute or two for the new table to be
picked up by the HBase meta scanner.  Use the hbase shell to verify that the new table was
created correctly, and do some sanity queries to locate individual cells and make sure they
can be found.
+ 

Mime
View raw message