Return-Path: Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: (qmail 24799 invoked from network); 12 Jan 2010 21:36:07 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 12 Jan 2010 21:36:07 -0000 Received: (qmail 23636 invoked by uid 500); 12 Jan 2010 21:36:06 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 23566 invoked by uid 500); 12 Jan 2010 21:36:06 -0000 Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-user@hadoop.apache.org Delivered-To: mailing list mapreduce-user@hadoop.apache.org Received: (qmail 23557 invoked by uid 99); 12 Jan 2010 21:36:06 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Jan 2010 21:36:06 +0000 X-ASF-Spam-Status: No, hits=-8.0 required=10.0 tests=RCVD_IN_DNSWL_HI,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of Michael.Clements@disney.com designates 204.128.192.18 as permitted sender) Received: from [204.128.192.18] (HELO mail1.disney.com) (204.128.192.18) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Jan 2010 21:35:56 +0000 Received: from imr1.disney.pvt (imr1.disney.pvt [153.7.231.20]) by mail1.disney.com with ESMTP; Tue, 12 Jan 2010 21:35:36 Z Received: from sm-cala-xgw02b.swna.wdpr.disney.com (sm-cala-xgw02b.swna.wdpr.disney.com [153.7.30.143]) by imr1.disney.pvt with ESMTP; Tue, 12 Jan 2010 21:35:35 Z Received: from SM-CALA-VXMB03C.swna.wdpr.disney.com ([153.7.195.153]) by sm-cala-xgw02b.swna.wdpr.disney.com with Microsoft SMTPSVC(6.0.3790.3959); Tue, 12 Jan 2010 13:35:35 -0800 X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Subject: RE: how to load big files into Hbase without crashing? Date: Tue, 12 Jan 2010 21:35:34 +0000 Message-Id: <23E512539066824E9B836EA8709831A905EBA343@SM-CALA-VXMB03C.swna.wdpr.disney.com> In-Reply-To: <23E512539066824E9B836EA8709831A905EBA275@SM-CALA-VXMB03C.swna.wdpr.disney.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: how to load big files into Hbase without crashing? thread-index: AcqTuHliOtSOZQGERGiM07b09rzm8QAFgLlg References: <23E512539066824E9B836EA8709831A905EBA275@SM-CALA-VXMB03C.swna.wdpr.disney.com> From: "Clements, Michael" To: X-OriginalArrivalTime: 12 Jan 2010 21:35:35.0821 (UTC) FILETIME=[2D2233D0:01CA93CF] This leads to one quick & easy question: how does one reduce the number of map tasks for a job? My goal is to limit the # of Map tasks so they don't overwhelm the HBase region servers. The Docs point in several directions. There's a method job.setNumReduceTasks(), but no setNumMapTasks(). There is a job Configuration setting setNumMapTasks(), but it's deprecated and says it only can increase, not reduce, the number of tasks. There's InputFormat and its subclasses, which do the actual file splits. But no single method to simply set the number of splits. One would have to write his own subclass to measure the total size of all input files, divide by the desired # of mappers and split it all up. The last option is not trivial but it is doable. Before I jump in I figured I'd ask if there is an easier way. Thanks -----Original Message----- From: mapreduce-user-return-267-Michael.Clements=3Ddisney.com@hadoop.apache.org= [mailto:mapreduce-user-return-267-Michael.Clements=3Ddisney.com@hadoop.ap= a che.org] On Behalf Of Clements, Michael Sent: Tuesday, January 12, 2010 10:53 AM To: mapreduce-user@hadoop.apache.org Subject: how to load big files into Hbase without crashing? I have 15-node Hadoop cluster that is working for most jobs. But every time I upload large data files into HBase, the job fails. I surmise that this file (15GB in size) is big enough, there are so many tasks (about 55 at once), they swamp the region server processes. Each cluster node is also an HBase region server, so there are a minimum of about 4 jobs for each region server. But when the table is small, there are few regions so each region server is hosting many more tasks. For example if the table starts out empty there is a single region, so a single region server has to handle calls from all 55 tasks. It can't handle this, the tasks give up and the job fails. This is just conjecture on my part. Does it sound reasonable? If so, what methods are there to prevent this? Limiting the number of tasks for the upload job is one obvious solution, but what is a good limit? The more general question is, how many map tasks can a typical region server support? Limiting the number of tasks is tedious and error-prone, as it requires somebody to look at the HBase table, see how many regions it has, on which servers, and manually configure the job accordingly. If the job is big enough, then the number of regions will grow during the job and the initial task counts won't be ideal anymore. Ideally, the Hadoop framework would be smart enough to look at how many regions & region servers exist and dynamically allocate a reasonable number of tasks. Does the community have any knowledge or techniques to handle this? Thanks Michael Clements Solutions Architect michael.clements@disney.com 206 664-4374 office 360 317 5051 mobile