hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/HBaseBulkLoad" by JohnSichi
Date Fri, 09 Apr 2010 00:58:51 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/HBaseBulkLoad" page has been changed by JohnSichi.
http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad?action=diff&rev1=1&rev2=2

--------------------------------------------------

  
  = Overview =
  
- Ideally, bulk load from Hive into HBase would be as simple as this:
+ Ideally, bulk load from Hive into HBase would be part of [[Hive/HBaseIntegration]], making
it as simple as this:
  
  {{{
  CREATE TABLE new_hbase_table(rowkey string, x int, y int) 
@@ -19, +19 @@

  SELECT ... FROM hive_query;
  }}}
  
- However, things aren't ''quite'' as simple as that yet.  Instead, a multistep procedure
is required involving both SQL and shell script commands.  It should still be a lot easier
and more flexible than writing your own map/reduce program, and over time we can enhance Hive
to move closer to the ideal.
+ However, things aren't ''quite'' as straightforward as that yet.  Instead, a procedure involving
a series of SQL commands is required.  It should still be a lot easier and more flexible than
writing your own map/reduce program, and over time we hope to enhance Hive to move closer
to the ideal.
  
  The procedure is based on [[http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#bulk|underlying
HBase recommendations]], and involves the following steps:
  
   1. Decide on the number of reducers you're planning to use for parallelizing the sorting
and HFile creation.  This depends on the size of your data as well as cluster resources available.
-  1. Run Hive commands which will create a file containing "splitter" keys which will be
used for range-partitioning the data during sort.
+  1. Run Hive sampling commands which will create a file containing "splitter" keys which
will be used for range-partitioning the data during sort.
   1. Prepare a staging location in HDFS where the HFiles will be generated.
   1. Run Hive commands which will execute the sort and generate the HFiles.
   1. (Optional:  if HBase and Hive are running in different clusters, distcp the generated
files from the Hive cluster to the HBase cluster.)
@@ -33, +33 @@

  
  The rest of this page explains each step in greater detail.
  
+ = Estimate Resources Needed =
+ 
+ '''tbd:  provide some example numbers based on Facebook experiments'''
+ 
+ = Run Sampling for Range Partitioning =
+ 
+ = Prepare Staging Location =
+ 
+ = Sort Data =
+ 
+ = Run HBase Script =
+ 

Mime
View raw message