Return-Path: X-Original-To: apmail-hbase-dev-archive@www.apache.org Delivered-To: apmail-hbase-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 65FA6E11C for ; Mon, 17 Dec 2012 15:35:41 +0000 (UTC) Received: (qmail 11938 invoked by uid 500); 17 Dec 2012 15:35:37 -0000 Delivered-To: apmail-hbase-dev-archive@hbase.apache.org Received: (qmail 11220 invoked by uid 500); 17 Dec 2012 15:35:36 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 10709 invoked by uid 99); 17 Dec 2012 15:35:36 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 17 Dec 2012 15:35:36 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of upender.kumar@gmail.com designates 209.85.219.41 as permitted sender) Received: from [209.85.219.41] (HELO mail-oa0-f41.google.com) (209.85.219.41) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 17 Dec 2012 15:35:03 +0000 Received: by mail-oa0-f41.google.com with SMTP id k14so6673369oag.14 for ; Mon, 17 Dec 2012 07:34:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=muq7skQVRuazX2ez/AFMq26iIRZcZCeHbCwE+7PR2c8=; b=IG00sKYhew93o0vlrnOrggigqMTRK25jPqoHjLV20cRkV5JsiaoQj9Q43KuMhq7SNQ uQr5/vSkey0zyqv+ldTNlnfgonvbnff/CdgqWURf/7jJPZQmG7xK+7EmPNsiQR4duOfX W1JDEc1jiFUSuDVSAkgTqHKf9Yfjq+DYPC4/UnU8Ar7dAMCf7Bniy7KVzxg0F1YGzNmd Mpx9fsEhab9JcvYijayzYrEeBfV2OK0UmZuwwK19+3KwjykqEtxkH8Mp/ZdgjKxfylGp r6TVLEntTwR5ld2K0Ydm0LyfbkY7c09SuysSHB1ODQ+7nuV9a5/L5GSgrsRJwlh611WJ 9fNQ== MIME-Version: 1.0 Received: by 10.182.12.101 with SMTP id x5mr10757161obb.47.1355758482259; Mon, 17 Dec 2012 07:34:42 -0800 (PST) Received: by 10.60.171.18 with HTTP; Mon, 17 Dec 2012 07:34:42 -0800 (PST) Date: Mon, 17 Dec 2012 10:34:42 -0500 Message-ID: Subject: HBase Map/Reduce Data Ingest Performance From: "Upender K. Nimbekar" To: dev@hbase.apache.org Content-Type: multipart/alternative; boundary=f46d04446909bc7e5e04d10e1e80 X-Virus-Checked: Checked by ClamAV on apache.org --f46d04446909bc7e5e04d10e1e80 Content-Type: text/plain; charset=ISO-8859-1 Hi All, I have question about improving the Map / Reduce job performance while ingesting huge amount of data into Hbase using HFileOutputFormat. Here is what we are using: 1) *Cloudera hadoop-0.20.2-cdh3u* 2) *hbase-0.90.40cdh3u2* I've used 2 different strategies as described below: *Strategy#1:* PreSplit the number of regions with 10 regions per region server. And then subsequently kick off the hadoop job with HFileOutputFormat.configureIncrementLoad. This mchanism does create reduce tasks equal to the number of regions * 10. We used the "hash" of each record as the Key to Mapoutput. This process resulted in each mapper finish process in accepetable amount of time. But the reduce task takes forever to finish. We found that first the copy/shuffle process too condierable amoun of time and then the sort process took foreever to finish. We tried to address this issue by constructing the key as "fixedhash1"_"hash2" where "fixedhash1" is fixed for all the records of a gven mapper. The idea was to reduce shuffling / copying from each mapper. But even this solution didn't save us anytime and the reduce step took significant amount to finish. I played with adjusting the number of pre-split regions in both dierctions but to no avail. This led us to move to Strategy#2 we got rid of the reduce step. *QUESTION:* Is there anything I could've done better in this strategy to make reduce step finish faster ? Do I need to produce Row Keys differently than "hash1"_"hash2" of the text ? Is it a known issue with CDH3 or Hbase0.90 ? Please help me troubleshoot. Strategy#2: PreSplit the number of regions with 10 regions per region server. And then subsequently kick off the hadoop job with HFileOutputFormat.configureIncrementLoad. But set the number of reducer = 0. In this strategy (current), I pre-sorted all the mapper input using Treeset before writing to output. With No. of reducers = 0, this resulted the mapper to write directly to HFiles. This was cool because map/reduce (no reduce phase actually) finished very fast and we noticed the HFiles got written very quickly. Then I used * hbase.utils.LoadIncrementHFiles.bulkLoad()* API to move HFiles into Hbase. I called this method on successful completon of the job in the driver class. This is working much better than the Strategy#1 in terms of performance. But the bulkLoad() call in the driver sometimes takes longer if there is huge amount of data. *QUESTION:* Is there anyway to make the bulkLoad() run faster ? Can I call this api from Mapper directly, instead of waiting the whole job to finish first? I've used used habse "completebulkload" utilty but it has two issues with it. First, I do not see any performance improvement with it. Second, it needs to be run separately from Hadoop Job driver class and we wanted to integrate both the piece. So we used *hbase.utils.LoadIncrementHFiles.bulkLoad(). * Also, we used Hbase RegionSplitter to pre-split the regions. But hbase 0.90 version doesn't have the option to pass ALGORITHM. Is that something we need to worry about? Please help me point in the right direction to address this problem. Thanks Upen --f46d04446909bc7e5e04d10e1e80--