Return-Path: Delivered-To: apmail-hadoop-common-commits-archive@www.apache.org Received: (qmail 56145 invoked from network); 9 Apr 2010 19:46:10 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 9 Apr 2010 19:46:10 -0000 Received: (qmail 1004 invoked by uid 500); 9 Apr 2010 19:46:10 -0000 Delivered-To: apmail-hadoop-common-commits-archive@hadoop.apache.org Received: (qmail 825 invoked by uid 500); 9 Apr 2010 19:46:10 -0000 Mailing-List: contact common-commits-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-dev@hadoop.apache.org Delivered-To: mailing list common-commits@hadoop.apache.org Received: (qmail 818 invoked by uid 500); 9 Apr 2010 19:46:10 -0000 Delivered-To: apmail-hadoop-core-commits@hadoop.apache.org Received: (qmail 814 invoked by uid 99); 9 Apr 2010 19:46:10 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Apr 2010 19:46:10 +0000 X-ASF-Spam-Status: No, hits=-1316.2 required=10.0 tests=ALL_TRUSTED,AWL X-Spam-Check-By: apache.org Received: from [140.211.11.130] (HELO eos.apache.org) (140.211.11.130) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Apr 2010 19:46:08 +0000 Received: from eos.apache.org (localhost [127.0.0.1]) by eos.apache.org (Postfix) with ESMTP id 7F09516E36; Fri, 9 Apr 2010 19:45:48 +0000 (GMT) MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable From: Apache Wiki To: Apache Wiki Date: Fri, 09 Apr 2010 19:45:48 -0000 Message-ID: <20100409194548.19240.86955@eos.apache.org> Subject: =?utf-8?q?=5BHadoop_Wiki=5D_Update_of_=22Hive/HBaseBulkLoad=22_by_JohnSic?= =?utf-8?q?hi?= Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for ch= ange notification. The "Hive/HBaseBulkLoad" page has been changed by JohnSichi. http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad?action=3Ddiff&rev1=3D3&rev= 2=3D4 -------------------------------------------------- SET hive.hbase.bulk=3Dtrue; = INSERT OVERWRITE new_hbase_table - SELECT rowid_expression, x, y FROM ...any_hive_query...; + SELECT rowkey_expression, x, y FROM ...any_hive_query...; }}} = However, things aren't ''quite'' as straightforward as that yet. Instead= , a procedure involving a series of SQL commands is required. It should st= ill be a lot easier and more flexible than writing your own map/reduce prog= ram, and over time we hope to enhance Hive to move closer to the ideal. @@ -39, +39 @@ = =3D Prepare Range Partitioning =3D = - In order to perform a parallel sort on the data, we need to range-partiti= on it. The idea is to divide the space of row keys up into nearly equal-si= zed ranges, one per reducer. The details will vary according to your sourc= e data, and you may need to run a number of exploratory Hive queries in ord= er to come up with a good enough set of ranges. As an oversimplified examp= le, suppose your row keys are transaction timestamps, you have a year's wor= th of data starting from January, your data growth is constant month-over-m= onth, and you want to run 12 reducers. In that case, you could use a query= such as this one: + In order to perform a parallel sort on the data, we need to range-partiti= on it. The idea is to divide the space of row keys up into nearly equal-si= zed ranges, one per reducer. The details will vary according to your sourc= e data, and you may need to run a number of exploratory Hive queries in ord= er to come up with a good enough set of ranges. As a highly contrived exam= ple, suppose your row keys are sequence-generated transaction ID strings (p= ossibly with gaps), you have a year's worth of data starting from January, = your data growth is constant month-over-month, and you want to run 12 reduc= ers. In that case, you could use a query such as this one: = {{{ select transaction_id @@ -51, +51 @@ limit 11 }}} = - Note that we only want 11 values for breaking the data into 12 ranges, so= we drop the max timestamp for the last month. Also note that there are us= ually much cheaper ways to come up with good split keys; '''this is just an= example'''. + Note that we only want 11 values for breaking the data into 12 ranges, so= we drop the max timestamp for the last month. Also note that the ORDER BY= is necessary for producing the range start keys in ascending order. + = + ''Important:'' there are usually much cheaper ways to come up with good s= plit keys; '''this is just an example to give you an idea of the kind of re= sult your sampling query should produce'''. + = + In this example, our partitioning key is a single column. In theory, you= could partition over a compound key, but if you're going to do this, it mi= ght be safest to pre-concatenate your keys into a single string column (may= be using a view) which will serve as your HBase rowkey; this will avoid pot= ential inconsistencies in comparators later on. + = + Once you have your sampling query defined, the next step is to save its r= esults to a properly formatted file which will be used in a subsequent step= . To do this, run commands like the following: + = + {{{ + create external table hb_range_keys(transaction_id_range_start string) + row format serde = + 'org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe' + stored as = + inputformat = + 'org.apache.hadoop.mapred.TextInputFormat' + outputformat = + 'org.apache.hadoop.hive.ql.io.HiveNullValueSequenceFileOutputFormat' + location '/tmp/hb_range_keys'; + = + insert overwrite table hb_range_keys + select transaction_id + from + (select month,max(transaction_id) as transaction_id + from transactions + group by month) m + order by transaction_id + limit 11; + }}} + = + The first command creates an external table defining the format of the fi= le to be created; be sure to set the serde and inputformat/outputformat exa= ctly as specified. = + = + The second command populates it (using the sampling query previously defi= ned). Usage of ORDER BY guarantees that a single file will be produced in = directory {{{/tmp/hb_range_keys}}}. The filename is unknown, but it is nec= essary to reference the file by name later, so run a command such as the fo= llowing to copy it to a specific name: + = + {{{ + dfs -cp /tmp/hb_range_keys/* /tmp/hb_range_key_list + }}} = =3D Prepare Staging Location =3D = + The sort is going to produce a lot of data, so make sure you have suffici= ent space in your HDFS cluster, and choose the location where the files wil= l be staged. We'll use {{{/tmp/hbsort}}} in this example. + = =3D Sort Data =3D + = + Now comes the big step: running a sort over all of the data to be bulk l= oaded. Make sure that your Hive instance has {{{hive_hbase-handler.jar}}} = available on its auxpath. + = + {{{ + set mapred.reduce.tasks=3D12; + set hive.mapred.partitioner=3Dorg.apache.hadoop.mapred.lib.TotalOrderPart= itioner; + set total.order.partitioner.natural.order=3Dfalse; + set total.order.partitioner.path=3D/tmp/hb_range_key_list; + = + create table hbsort(transaction_id string, user_name string, amount doubl= e, ...) + stored as + INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' + OUTPUTFORMAT 'org.apache.hadoop.hive.hbase.HiveHFileOutputFormat' + TBLPROPERTIES ('hfile.family.path' =3D '/tmp/hbsort/cf'); + = + insert overwrite table hbsort + select transaction_id, user_name, amount, ... + from transactions + cluster by transaction_id; + }}} + = + The CREATE TABLE creates a dummy table which controls how the output of t= he sort is written. Note that it uses {{{HiveHFileOutputFormat}}} to do th= is, with the table property {{{hfile.family.path}}} used to control the des= tination directory for the output. Again, be sure to set the inputformat/o= utputformat exactly as specified. = + = + The {{{cf}}} in the path specifies the name of the column family which wi= ll be created in HBase, so the directory name you choose here is important.= (Note that we're not actually using an HBase table here; {{{HiveHFileOutp= utFormat}}} writes directly to files.) + = + The CLUSTER BY clause provides the keys to be used by the partitioner; be= sure that it matches the range partitioning that you came up with in the e= arlier step. + = + The first column in the SELECT list is interpreted as the rowkey; subsequ= ent columns become cell values (all in a single column family, so their col= umn names are important. = =3D Run HBase Script =3D = + Once the sort job completes successfully, one final step is required for = importing the result files into HBase. + = + If Hive and HBase are running in different clusters, use [[http://hadoop.= apache.org/common/docs/current/distcp.html|distcp]] to copy the files from = one to the other. + = + Once the files are in the HBase cluster, use the {{{bin/loadtable.rb}}} s= cript which comes with HBase to import: + = + {{{ + hbase org.jruby.Main loadtable.rb transactions /tmp/hbsort + }}} + = + The first argument ({{{transactions}}}) specifies the name of the new HBa= se table. For the second argument, pass the staging directory name, not th= e the column family child directory. + = + After this script finishes, you may need to wait a minute or two for the = new table to be picked up by the HBase meta scanner. Use the hbase shell t= o verify that the new table was created correctly, and do some sanity queri= es to locate individual cells and make sure they can be found. +=20