hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Trivial Update of "Hive/HBaseBulkLoad" by CarlSteinbach
Date Tue, 08 Jun 2010 21:45:16 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/HBaseBulkLoad" page has been changed by CarlSteinbach.
http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad?action=diff&rev1=17&rev2=18

--------------------------------------------------

+ = Hive HBase Bulk Load =
+ 
+ <<TableOfContents>>
+ 
  This page explains how to use Hive to bulk load data into a new (empty) HBase table per
[[https://issues.apache.org/jira/browse/HIVE-1295|HIVE-1295]].  (If you're not using a build
which contains this functionality yet, you'll need to build from source and make sure this
patch and HIVE-1321 are both applied.)
  
- = Overview =
+ == Overview ==
  
  Ideally, bulk load from Hive into HBase would be part of [[Hive/HBaseIntegration]], making
it as simple as this:
  
@@ -32, +36 @@

  
  The rest of this page explains each step in greater detail.
  
- = Decide on Target HBase Schema =
+ == Decide on Target HBase Schema ==
  
  Currently there are a number of constraints here:
  
@@ -42, +46 @@

  
  Besides dealing with these constraints, probably the most important work here is deciding
on how you want to assign an HBase row key to each row coming from Hive.  To avoid inconsistencies
between lexical and binary comparators, it is simplest to design a string row key and use
it consistently all the way through.  If you want to combine multiple columns into the key,
use Hive's string concat expression for this purpose.  You can use CREATE VIEW to tack on
your rowkey logically without having to update any existing data in Hive.
  
- = Estimate Resources Needed =
+ == Estimate Resources Needed ==
  
  TBD:  provide some example numbers based on Facebook experiments; also reference [[http://www.hpl.hp.com/hosted/sortbenchmark/YahooHadoop.pdf|Hadoop
Terasort]]
  
- = Prepare Range Partitioning =
+ == Prepare Range Partitioning ==
  
  In order to perform a parallel sort on the data, we need to range-partition it.  The idea
is to divide the space of row keys up into nearly equal-sized ranges, one per reducer.  The
details will vary according to your source data, and you may need to run a number of exploratory
Hive queries in order to come up with a good enough set of ranges.  As a highly contrived
example, suppose your row keys are sequence-generated transaction ID strings (possibly with
gaps), you have a year's worth of data starting from January, your data growth is constant
month-over-month, and you want to run 12 reducers.  In that case, you could use a query such
as this one:
  
@@ -95, +99 @@

  dfs -cp /tmp/hb_range_keys/* /tmp/hb_range_key_list;
  }}}
  
- = Prepare Staging Location =
+ == Prepare Staging Location ==
  
  The sort is going to produce a lot of data, so make sure you have sufficient space in your
HDFS cluster, and choose the location where the files will be staged.  We'll use {{{/tmp/hbsort}}}
in this example.
  
@@ -106, +110 @@

  dfs -mkdir /tmp/hbsort;
  }}}
  
- = Sort Data =
+ == Sort Data ==
  
  Now comes the big step:  running a sort over all of the data to be bulk loaded.  Make sure
that your Hive instance has the HBase jars available on its auxpath.
  
@@ -138, +142 @@

  
  The first column in the SELECT list is interpreted as the rowkey; subsequent columns become
cell values (all in a single column family, so their column names are important).
  
- = Run HBase Script =
+ == Run HBase Script ==
  
  Once the sort job completes successfully, one final step is required for importing the result
files into HBase.
  
@@ -154, +158 @@

  
  After this script finishes, you may need to wait a minute or two for the new table to be
picked up by the HBase meta scanner.  Use the hbase shell to verify that the new table was
created correctly, and do some sanity queries to locate individual cells and make sure they
can be found.
  
- = Map New Table Back Into Hive =
+ == Map New Table Back Into Hive ==
  
  Finally, if you'd like to access the HBase table you just created via Hive:
  
@@ -165, +169 @@

  TBLPROPERTIES("hbase.table.name" = "transactions");
  }}}
  
- = Followups Needed =
+ == Followups Needed ==
  
   * Support sparse tables
   * Support loading binary data representations once HIVE-1245 is fixed

Mime
View raw message