Mailing-List: contact common-commits-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-dev@hadoop.apache.org
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
From: Apache Wiki <wikidiffs@apache.org>
To: Apache Wiki <wikidiffs@apache.org>
Date: Fri, 09 Apr 2010 19:45:48 -0000
Message-ID: <20100409194548.19240.86955@eos.apache.org>
Subject: 
 =?utf-8?q?=5BHadoop_Wiki=5D_Update_of_=22Hive/HBaseBulkLoad=22_by_JohnSic?=
 =?utf-8?q?hi?=

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for ch=
ange notification.

The "Hive/HBaseBulkLoad" page has been changed by JohnSichi.
http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad?action=3Ddiff&rev1=3D3&rev=
2=3D4

--------------------------------------------------

  SET hive.hbase.bulk=3Dtrue;
  =

  INSERT OVERWRITE new_hbase_table
- SELECT rowid_expression, x, y FROM ...any_hive_query...;
+ SELECT rowkey_expression, x, y FROM ...any_hive_query...;
  }}}
  =

  However, things aren't ''quite'' as straightforward as that yet.  Instead=
, a procedure involving a series of SQL commands is required.  It should st=
ill be a lot easier and more flexible than writing your own map/reduce prog=
ram, and over time we hope to enhance Hive to move closer to the ideal.
@@ -39, +39 @@

  =

  =3D Prepare Range Partitioning =3D
  =

- In order to perform a parallel sort on the data, we need to range-partiti=
on it.  The idea is to divide the space of row keys up into nearly equal-si=
zed ranges, one per reducer.  The details will vary according to your sourc=
e data, and you may need to run a number of exploratory Hive queries in ord=
er to come up with a good enough set of ranges.  As an oversimplified examp=
le, suppose your row keys are transaction timestamps, you have a year's wor=
th of data starting from January, your data growth is constant month-over-m=
onth, and you want to run 12 reducers.  In that case, you could use a query=
 such as this one:
+ In order to perform a parallel sort on the data, we need to range-partiti=
on it.  The idea is to divide the space of row keys up into nearly equal-si=
zed ranges, one per reducer.  The details will vary according to your sourc=
e data, and you may need to run a number of exploratory Hive queries in ord=
er to come up with a good enough set of ranges.  As a highly contrived exam=
ple, suppose your row keys are sequence-generated transaction ID strings (p=
ossibly with gaps), you have a year's worth of data starting from January, =
your data growth is constant month-over-month, and you want to run 12 reduc=
ers.  In that case, you could use a query such as this one:
  =

  {{{
  select transaction_id
@@ -51, +51 @@

  limit 11
  }}}
  =

- Note that we only want 11 values for breaking the data into 12 ranges, so=
 we drop the max timestamp for the last month.  Also note that there are us=
ually much cheaper ways to come up with good split keys; '''this is just an=
 example'''.
+ Note that we only want 11 values for breaking the data into 12 ranges, so=
 we drop the max timestamp for the last month.  Also note that the ORDER BY=
 is necessary for producing the range start keys in ascending order.
+ =

+ ''Important:'' there are usually much cheaper ways to come up with good s=
plit keys; '''this is just an example to give you an idea of the kind of re=
sult your sampling query should produce'''.
+ =

+ In this example, our partitioning key is a single column.  In theory, you=
 could partition over a compound key, but if you're going to do this, it mi=
ght be safest to pre-concatenate your keys into a single string column (may=
be using a view) which will serve as your HBase rowkey; this will avoid pot=
ential inconsistencies in comparators later on.
+ =

+ Once you have your sampling query defined, the next step is to save its r=
esults to a properly formatted file which will be used in a subsequent step=
.  To do this, run commands like the following:
+ =

+ {{{
+ create external table hb_range_keys(transaction_id_range_start string)
+ row format serde =

+ 'org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe'
+ stored as =

+ inputformat =

+ 'org.apache.hadoop.mapred.TextInputFormat'
+ outputformat =

+ 'org.apache.hadoop.hive.ql.io.HiveNullValueSequenceFileOutputFormat'
+ location '/tmp/hb_range_keys';
+ =

+ insert overwrite table hb_range_keys
+ select transaction_id
+ from
+ (select month,max(transaction_id) as transaction_id
+  from transactions
+  group by month) m
+ order by transaction_id
+ limit 11;
+ }}}
+ =

+ The first command creates an external table defining the format of the fi=
le to be created; be sure to set the serde and inputformat/outputformat exa=
ctly as specified.  =

+ =

+ The second command populates it (using the sampling query previously defi=
ned).  Usage of ORDER BY guarantees that a single file will be produced in =
directory {{{/tmp/hb_range_keys}}}.  The filename is unknown, but it is nec=
essary to reference the file by name later, so run a command such as the fo=
llowing to copy it to a specific name:
+ =

+ {{{
+ dfs -cp /tmp/hb_range_keys/* /tmp/hb_range_key_list
+ }}}
  =

  =3D Prepare Staging Location =3D
  =

+ The sort is going to produce a lot of data, so make sure you have suffici=
ent space in your HDFS cluster, and choose the location where the files wil=
l be staged.  We'll use {{{/tmp/hbsort}}} in this example.
+ =

  =3D Sort Data =3D
+ =

+ Now comes the big step:  running a sort over all of the data to be bulk l=
oaded.  Make sure that your Hive instance has {{{hive_hbase-handler.jar}}} =
available on its auxpath.
+ =

+ {{{
+ set mapred.reduce.tasks=3D12;
+ set hive.mapred.partitioner=3Dorg.apache.hadoop.mapred.lib.TotalOrderPart=
itioner;
+ set total.order.partitioner.natural.order=3Dfalse;
+ set total.order.partitioner.path=3D/tmp/hb_range_key_list;
+ =

+ create table hbsort(transaction_id string, user_name string, amount doubl=
e, ...)
+ stored as
+ INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
+ OUTPUTFORMAT 'org.apache.hadoop.hive.hbase.HiveHFileOutputFormat'
+ TBLPROPERTIES ('hfile.family.path' =3D '/tmp/hbsort/cf');
+ =

+ insert overwrite table hbsort
+ select transaction_id, user_name, amount, ...
+ from transactions
+ cluster by transaction_id;
+ }}}
+ =

+ The CREATE TABLE creates a dummy table which controls how the output of t=
he sort is written.  Note that it uses {{{HiveHFileOutputFormat}}} to do th=
is, with the table property {{{hfile.family.path}}} used to control the des=
tination directory for the output.  Again, be sure to set the inputformat/o=
utputformat exactly as specified.  =

+ =

+ The {{{cf}}} in the path specifies the name of the column family which wi=
ll be created in HBase, so the directory name you choose here is important.=
  (Note that we're not actually using an HBase table here; {{{HiveHFileOutp=
utFormat}}} writes directly to files.)
+ =

+ The CLUSTER BY clause provides the keys to be used by the partitioner; be=
 sure that it matches the range partitioning that you came up with in the e=
arlier step.
+ =

+ The first column in the SELECT list is interpreted as the rowkey; subsequ=
ent columns become cell values (all in a single column family, so their col=
umn names are important.
  =

  =3D Run HBase Script =3D
  =

+ Once the sort job completes successfully, one final step is required for =
importing the result files into HBase.
+ =

+ If Hive and HBase are running in different clusters, use [[http://hadoop.=
apache.org/common/docs/current/distcp.html|distcp]] to copy the files from =
one to the other.
+ =

+ Once the files are in the HBase cluster, use the {{{bin/loadtable.rb}}} s=
cript which comes with HBase to import:
+ =

+ {{{
+ hbase org.jruby.Main loadtable.rb transactions /tmp/hbsort
+ }}}
+ =

+ The first argument ({{{transactions}}}) specifies the name of the new HBa=
se table.  For the second argument, pass the staging directory name, not th=
e the column family child directory.
+ =

+ After this script finishes, you may need to wait a minute or two for the =
new table to be picked up by the HBase meta scanner.  Use the hbase shell t=
o verify that the new table was created correctly, and do some sanity queri=
es to locate individual cells and make sure they can be found.
+=20