Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3F8ED10301 for ; Fri, 31 May 2013 20:20:50 +0000 (UTC) Received: (qmail 18918 invoked by uid 500); 31 May 2013 20:20:48 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 18699 invoked by uid 500); 31 May 2013 20:20:48 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 18691 invoked by uid 99); 31 May 2013 20:20:48 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 31 May 2013 20:20:48 +0000 X-ASF-Spam-Status: No, hits=-2.3 required=5.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [192.240.6.15] (HELO fujitsu25.fnanic.fujitsu.com) (192.240.6.15) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 31 May 2013 20:20:40 +0000 Received: from pps.filterd (fujitsu25 [127.0.0.1]) by fujitsu25.fnanic.fujitsu.com (8.14.4/8.14.4) with SMTP id r4VKH0uc014013 for ; Fri, 31 May 2013 13:20:17 -0700 Received: from fujitsuii.fna.fujitsu.com ([133.164.253.2]) by fujitsu25.fnanic.fujitsu.com with ESMTP id 1cpw7a09g3-1 for ; Fri, 31 May 2013 13:20:17 -0700 Received: from sv-hub1.Corp.FC.LOCAL (localhost [127.0.0.1]) by fujitsuii.fna.fujitsu.com (8.13.8/8.13.8) with ESMTP id r4VKJqLN024457 for ; Fri, 31 May 2013 13:20:16 -0700 (PDT) Received: from MTL-HUB1.Corp.FC.LOCAL (10.152.216.14) by sv-hub1.Corp.FC.LOCAL (10.159.216.14) with Microsoft SMTP Server (TLS) id 8.3.297.1; Fri, 31 May 2013 13:20:01 -0700 Received: from MTL-EXCHCLUS-01.Corp.FC.LOCAL ([fe80::58e4:53de:2141:9fdd]) by MTL-HUB1.Corp.FC.LOCAL ([fe80::28bf:a1ff:2cd:211a%16]) with mapi; Fri, 31 May 2013 16:20:29 -0400 From: David Poisson To: "user@hbase.apache.org" Date: Fri, 31 May 2013 16:19:56 -0400 Subject: Best practices for loading data into hbase Thread-Topic: Best practices for loading data into hbase Thread-Index: AQHOXjwdoujHPnWQpUibq7WzkNHFgA== Message-ID: Accept-Language: en-US, en-CA Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US, en-CA x-tm-as-product-ver: SMEX-10.0.0.1412-7.000.1014-19906.004 x-tm-as-result: No--40.378300-0.000000-31 x-tm-as-user-approved-sender: Yes x-tm-as-user-blocked-sender: No Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.10.8626,1.0.431,0.0.0000 definitions=2013-05-31_08:2013-05-31,2013-05-31,1970-01-01 signatures=0 X-Virus-Checked: Checked by ClamAV on apache.org Hi, We are still very new at all of this hbase/hadoop/mapreduce stuff. We = are looking for the best practices that will fit our requirements. We are c= urrently using the latest cloudera vmware's (single node) for our developme= nt tests. The problem is as follows:=20 We have multiple sources in different format (xml, csv, etc), which are dum= ps of existing systems. As one might think, there will be an initial "impor= t" of the data into hbase=20 and afterwards, the systems would most likely dump whatever data they have = accumulated since the initial import into hbase or since the last data dump= . Another thing, we would require to have an intermediary step, so that we can ensure all of a source's data can be succ= essfully processed, something which would look like: XML data file --(MR JOB)--> Intermediate (hbase table or hfile?) --(MR JOB)= --> production tables in hbase We're guessing we can't use something like a transaction in hbase, so we th= ought about using a intermediate step: Is that how things are normally done= ? As we import data into hbase, we will be populating several tables that lin= ks data parts together (account X in System 1 =3D=3D account Y in System 2)= as tuples in 3 tables. Currently,=20 this is being done by a mapreduce job which reads the XML source and uses m= ultiTableOutputFormat to "put" data into those 3 hbase tables. This method isn't that fast using our test sample (2 minutes for 5Mb), so we are lookin= g at optimizing the loading of data. We have been researching bulk loading but we are unsure of a couple of thin= gs: Once we process an xml file and we populate our 3 "production" hbase tables= , could we bulk load another xml file and append this new data to our 3 tab= les or would it write over what was written before? In order to bulk load, we need to output a file using HFileOutputFormat. Si= nce MultiHFileOutputFormat doesn't seem to officially exist yet (still in t= he works, right?), should we process our input xml file with 3 MapReduce jobs instead of 1 and output an hfile for each, which we c= ould then become our intermediate step (if all 3 hfiles were created withou= t errors, then process was successful: bulk load in hbase)? Can you experiment with bulk loading on a vmware? We're experien= cing problems with partition file not being found with the following except= ion: java.lang.Exception: java.lang.IllegalArgumentException: Can't read partiti= ons file at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:404= ) Caused by: java.lang.IllegalArgumentException: Can't read partitions file at org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.setConf= (TotalOrderPartitioner.java:108) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:70) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java= :130) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.(MapTask.java= :588) We also tried another idea on how to speed things up: What if instead of do= ing individual puts, we passed a list of puts to put() (eg: htable.put(putL= ist) ). Internally in hbase, would there be less overhead vs multiple calls to put()? It seems to be faster, however since we're not using contex= t.write, I'm guessing this will lead to problems later on, right? Turning off WAL on puts to speed things up isn't an option, since data loss= would be unacceptable, even if the chances of a failure occurring are slim= . Thanks, David=