hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/HBaseBulkLoad" by JohnSichi
Date Fri, 09 Apr 2010 20:06:36 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/HBaseBulkLoad" page has been changed by JohnSichi.
http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad?action=diff&rev1=4&rev2=5

--------------------------------------------------

  
  The procedure is based on [[http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#bulk|underlying
HBase recommendations]], and involves the following steps:
  
+  1. Decide how you want the data to look once it has been loaded into HBase.
   1. Decide on the number of reducers you're planning to use for parallelizing the sorting
and HFile creation.  This depends on the size of your data as well as cluster resources available.
   1. Run Hive sampling commands which will create a file containing "splitter" keys which
will be used for range-partitioning the data during sort.
   1. Prepare a staging location in HDFS where the HFiles will be generated.
@@ -33, +34 @@

  
  The rest of this page explains each step in greater detail.
  
+ = Decide on Target HBase Schema =
+ 
+ Currently there are a number of constraints here:
+ 
+ * The target table must be new (you can't bulk load into an existing table)
+ * The target table can only have a single column family ([[http://issues.apache.org/jira/browse/HBASE-1861|HBASE-1861]])
+ * The target table cannot be sparse (every row will have the same set of columns); this
should be easy to fix by either allowing a MAP value to be read from Hive, and/or by allowing
rows to be read from Hive in pivoted form (one row per HBase cell)
+ 
+ Besides dealing with these constraints, probably the most important work here is deciding
on how you want to assign an HBase row key to each row coming from Hive.  To avoid inconsistencies
between lexical and binary comparators, it is simplest to design a string row key and use
it consistently all the way through.  If you want to combine multiple columns into the key,
use Hive's string concat expression for this purpose.  You can use CREATE VIEW to tack on
your rowkey logically without having to update any existing data in Hive.
+ 
  = Estimate Resources Needed =
  
- '''tbd:  provide some example numbers based on Facebook experiments'''
+ TBD:  provide some example numbers based on Facebook experiments; also reference [[http://www.hpl.hp.com/hosted/sortbenchmark/YahooHadoop.pdf|Hadoop
Terasort]]
  
  = Prepare Range Partitioning =
  
@@ -54, +65 @@

  Note that we only want 11 values for breaking the data into 12 ranges, so we drop the max
timestamp for the last month.  Also note that the ORDER BY is necessary for producing the
range start keys in ascending order.
  
  ''Important:'' there are usually much cheaper ways to come up with good split keys; '''this
is just an example to give you an idea of the kind of result your sampling query should produce'''.
- 
- In this example, our partitioning key is a single column.  In theory, you could partition
over a compound key, but if you're going to do this, it might be safest to pre-concatenate
your keys into a single string column (maybe using a view) which will serve as your HBase
rowkey; this will avoid potential inconsistencies in comparators later on.
  
  Once you have your sampling query defined, the next step is to save its results to a properly
formatted file which will be used in a subsequent step.  To do this, run commands like the
following:
  

Mime
View raw message