hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/HBaseBulkLoad" by JohnSichi
Date Mon, 16 Aug 2010 21:42:12 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/HBaseBulkLoad" page has been changed by JohnSichi.
http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad?action=diff&rev1=18&rev2=19

--------------------------------------------------

  
  == Prepare Range Partitioning ==
  
- In order to perform a parallel sort on the data, we need to range-partition it.  The idea
is to divide the space of row keys up into nearly equal-sized ranges, one per reducer.  The
details will vary according to your source data, and you may need to run a number of exploratory
Hive queries in order to come up with a good enough set of ranges.  As a highly contrived
example, suppose your row keys are sequence-generated transaction ID strings (possibly with
gaps), you have a year's worth of data starting from January, your data growth is constant
month-over-month, and you want to run 12 reducers.  In that case, you could use a query such
as this one:
+ In order to perform a parallel sort on the data, we need to range-partition it.  The idea
is to divide the space of row keys up into nearly equal-sized ranges, one per reducer which
will be used in the parallel sort.  The details will vary according to your source data, and
you may need to run a number of exploratory Hive queries in order to come up with a good enough
set of ranges.  Here's one example:
  
  {{{
+ set mapred.reduce.tasks=1;
+ create temporary function row_sequence as 
+ 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';
+ select transaction_id from
- select transaction_id
+ (select transaction_id
- from
- (select month,max(transaction_id) as transaction_id
-  from transactions
+ from transactions
-  group by month) m
+ tablesample(bucket 1 out of 10000 on transaction_id) s 
- order by transaction_id
+ order by transaction_id 
+ limit 10000000) x
+ where (row_sequence() % 910000)=0
- limit 11
+ limit 11;
  }}}
  
+ This works by ordering all of the rows in a sample of the table (using a single reducer),
and then selecting every nth row (here n=910000).  The value of n is chosen by dividing the
total number of rows in the table by the desired number of ranges, e.g. 12 in this case (one
more than the number of partitioning keys produced by the LIMIT clause).  The assumption here
is that the distribution in the sample matches the overall distribution in the table; if this
is not the case, the resulting partition keys will lead to skew in the parallel sort.
- Note that we only want 11 values for breaking the data into 12 ranges, so we drop the max
timestamp for the last month.  Also note that the ORDER BY is necessary for producing the
range start keys in ascending order.
- 
- ''Important:'' there are usually much cheaper ways to come up with good split keys; '''this
is just an example to give you an idea of the kind of result your sampling query should produce'''.
  
  Once you have your sampling query defined, the next step is to save its results to a properly
formatted file which will be used in a subsequent step.  To do this, run commands like the
following:
  
@@ -82, +84 @@

  location '/tmp/hb_range_keys';
  
  insert overwrite table hb_range_keys
+ select transaction_id from
- select transaction_id
+ (select transaction_id
- from
- (select month,max(transaction_id) as transaction_id
-  from transactions
+ from transactions
-  group by month) m
+ tablesample(bucket 1 out of 10000 on transaction_id) s 
- order by transaction_id
+ order by transaction_id 
+ limit 10000000) x
+ where (row_sequence() % 910000)=0
  limit 11;
  }}}
  

Mime
View raw message