Return-Path: X-Original-To: apmail-hbase-dev-archive@www.apache.org Delivered-To: apmail-hbase-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D9EC1DFFE for ; Tue, 18 Dec 2012 03:28:47 +0000 (UTC) Received: (qmail 89446 invoked by uid 500); 18 Dec 2012 03:28:47 -0000 Delivered-To: apmail-hbase-dev-archive@hbase.apache.org Received: (qmail 89371 invoked by uid 500); 18 Dec 2012 03:28:46 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 89321 invoked by uid 99); 18 Dec 2012 03:28:44 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 18 Dec 2012 03:28:44 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of yuzhihong@gmail.com designates 209.85.212.170 as permitted sender) Received: from [209.85.212.170] (HELO mail-wi0-f170.google.com) (209.85.212.170) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 18 Dec 2012 03:28:35 +0000 Received: by mail-wi0-f170.google.com with SMTP id hq7so2578788wib.3 for ; Mon, 17 Dec 2012 19:28:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=qYzadEeIIUfDK1yyS6tbVw5qQpadUnXtefRA2pay80M=; b=ZTlWkWNPMzyw1epBou43Wb3KGnw2DZmbGkeWB2LW44m+dHi7fHSucJ/Uce3W5QqdH4 QapT0kCPBJhgSZMqCc6pZkUt2XER9+h/imemUgr1LNziw7wwS9cS5S0nG3HjTo+Vv3dU ZoHXsKPH3cBrLVkP2H3rwLnA/Cg+4xAbvPNFGoDAJoTG6d6H8a09HG3EyBaCtrqJybv5 GonUnrMFHGeMtiLZTqeCHg7fxbZI7KyfRZIO2ed7JBmyasm3Wl4BhZRA5053Sh7GDeAL oKhg86gED5DSzInWSvqZyeackFxVDm8dgb30cZplGS4FwSG5wY1ExFEKV/llSs0uMb5Q +Mnw== MIME-Version: 1.0 Received: by 10.194.57.206 with SMTP id k14mr1006505wjq.26.1355801295246; Mon, 17 Dec 2012 19:28:15 -0800 (PST) Received: by 10.216.73.194 with HTTP; Mon, 17 Dec 2012 19:28:14 -0800 (PST) In-Reply-To: References: Date: Mon, 17 Dec 2012 19:28:14 -0800 Message-ID: Subject: Re: HBase Map/Reduce Data Ingest Performance From: Ted Yu To: dev@hbase.apache.org Content-Type: multipart/alternative; boundary=047d7ba97e8896d3f104d11816ea X-Virus-Checked: Checked by ClamAV on apache.org --047d7ba97e8896d3f104d11816ea Content-Type: text/plain; charset=ISO-8859-1 Experts from Cloudera would be more familiar with security in hadoop-0.20.2-cdh3u If you can show us the exception (using pastebin e.g.), that would help find the root cause. Cheers On Mon, Dec 17, 2012 at 6:30 PM, Upender K. Nimbekar < upender.kumar@gmail.com> wrote: > Thanks ! I'm calling doBulkLoad() from mapper cleanup() method. But running > into permission issues while hbase user tries to import Hfile into Hbase. > Not sure, if there is way to change the target HDFS file permission via > HFileOutputFormat. > > > On Mon, Dec 17, 2012 at 7:52 PM, Ted Yu wrote: > > > I think second approach is better. > > > > Cheers > > > > On Mon, Dec 17, 2012 at 11:11 AM, Upender K. Nimbekar < > > upender.kumar@gmail.com> wrote: > > > > > Sure. I can try that. Just curious, out of these 2 strategies, which > one > > do > > > you thin is better ? Do you have any experience of trying one or the > > other > > > ? > > > > > > Thanks > > > Upen > > > > > > On Mon, Dec 17, 2012 at 12:45 PM, Ted Yu wrote: > > > > > > > Thanks for sharing your experiences. > > > > > > > > Have you considered upgrading to HBase 0.92 or 0.94 ? > > > > There have been several bug fixes / enhancements > > > > to LoadIncrementHFiles.bulkLoad() API in newer HBase releases. > > > > > > > > Cheers > > > > > > > > On Mon, Dec 17, 2012 at 7:34 AM, Upender K. Nimbekar < > > > > upender.kumar@gmail.com> wrote: > > > > > > > > > Hi All, > > > > > I have question about improving the Map / Reduce job performance > > while > > > > > ingesting huge amount of data into Hbase using HFileOutputFormat. > > Here > > > is > > > > > what we are using: > > > > > > > > > > 1) *Cloudera hadoop-0.20.2-cdh3u* > > > > > 2) *hbase-0.90.40cdh3u2* > > > > > > > > > > I've used 2 different strategies as described below: > > > > > > > > > > *Strategy#1:* PreSplit the number of regions with 10 regions per > > region > > > > > server. And then subsequently kick off the hadoop job with > > > > > HFileOutputFormat.configureIncrementLoad. This mchanism does create > > > > reduce > > > > > tasks equal to the number of regions * 10. We used the "hash" of > each > > > > > record as the Key to Mapoutput. This process resulted in each > mapper > > > > finish > > > > > process in accepetable amount of time. But the reduce task takes > > > forever > > > > to > > > > > finish. We found that first the copy/shuffle process too > condierable > > > > amoun > > > > > of time and then the sort process took foreever to finish. > > > > > We tried to address this issue by constructing the key as > > > > > "fixedhash1"_"hash2" where "fixedhash1" is fixed for all the > records > > > of a > > > > > gven mapper. The idea was to reduce shuffling / copying from each > > > mapper. > > > > > But even this solution didn't save us anytime and the reduce step > > took > > > > > significant amount to finish. I played with adjusting the number of > > > > > pre-split regions in both dierctions but to no avail. > > > > > This led us to move to Strategy#2 we got rid of the reduce step. > > > > > > > > > > *QUESTION:* Is there anything I could've done better in this > strategy > > > to > > > > > make reduce step finish faster ? Do I need to produce Row Keys > > > > differently > > > > > than "hash1"_"hash2" of the text ? Is it a known issue with CDH3 or > > > > > Hbase0.90 ? Please help me troubleshoot. > > > > > > > > > > Strategy#2: PreSplit the number of regions with 10 regions per > region > > > > > server. And then subsequently kick off the hadoop job with > > > > > HFileOutputFormat.configureIncrementLoad. But set the number of > > > reducer = > > > > > 0. In this strategy (current), I pre-sorted all the mapper input > > using > > > > > Treeset before writing to output. With No. of reducers = 0, this > > > resulted > > > > > the mapper to write directly to HFiles. This was cool because > > > map/reduce > > > > > (no reduce phase actually) finished very fast and we noticed the > > HFiles > > > > got > > > > > written very quickly. Then I used * > > > > > hbase.utils.LoadIncrementHFiles.bulkLoad()* API to move HFiles into > > > > Hbase. > > > > > I called this method on successful completon of the job in the > > > > > driver class. This is working much better than the Strategy#1 in > > terms > > > of > > > > > performance. But the bulkLoad() call in the driver sometimes takes > > > longer > > > > > if there is huge amount of data. > > > > > > > > > > *QUESTION:* Is there anyway to make the bulkLoad() run faster ? > Can I > > > > call > > > > > this api from Mapper directly, instead of waiting the whole job to > > > finish > > > > > first? I've used used habse "completebulkload" utilty but it has > two > > > > > issues with it. First, I do not see any performance improvement > with > > > it. > > > > > Second, it needs to be run separately from Hadoop Job driver class > > and > > > we > > > > > wanted to integrate both the piece. So we used > > > > > *hbase.utils.LoadIncrementHFiles.bulkLoad(). > > > > > * > > > > > Also, we used Hbase RegionSplitter to pre-split the regions. But > > hbase > > > > 0.90 > > > > > version doesn't have the option to pass ALGORITHM. Is that > something > > we > > > > > need to worry about? > > > > > > > > > > Please help me point in the right direction to address this > problem. > > > > > > > > > > Thanks > > > > > Upen > > > > > > > > > > > > > > > --047d7ba97e8896d3f104d11816ea--