Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of Michael.Clements@disney.com
 designates 204.128.192.18 as permitted sender)
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Subject: RE: how to load big files into Hbase without crashing?
Date: Tue, 12 Jan 2010 21:35:34 +0000
Message-Id: 
 <23E512539066824E9B836EA8709831A905EBA343@SM-CALA-VXMB03C.swna.wdpr.disney.com>
In-Reply-To: 
 <23E512539066824E9B836EA8709831A905EBA275@SM-CALA-VXMB03C.swna.wdpr.disney.com>
Thread-Topic: how to load big files into Hbase without crashing?
thread-index: AcqTuHliOtSOZQGERGiM07b09rzm8QAFgLlg
References: 
 <23E512539066824E9B836EA8709831A905EBA275@SM-CALA-VXMB03C.swna.wdpr.disney.com>
From: "Clements, Michael" <Michael.Clements@disney.com>
To: <mapreduce-user@hadoop.apache.org>

This leads to one quick & easy question: how does one reduce the number
of map tasks for a job? My goal is to limit the # of Map tasks so they
don't overwhelm the HBase region servers.

The Docs point in several directions.

There's a method job.setNumReduceTasks(), but no setNumMapTasks().

There is a job Configuration setting setNumMapTasks(), but it's
deprecated and says it only can increase, not reduce, the number of
tasks.

There's InputFormat and its subclasses, which do the actual file splits.
But no single method to simply set the number of splits. One would have
to write his own subclass to measure the total size of all input files,
divide by the desired # of mappers and split it all up.

The last option is not trivial but it is doable. Before I jump in I
figured I'd ask if there is an easier way.

Thanks

-----Original Message-----
From:
mapreduce-user-return-267-Michael.Clements=3Ddisney.com@hadoop.apache.org=

[mailto:mapreduce-user-return-267-Michael.Clements=3Ddisney.com@hadoop.ap=
a
che.org] On Behalf Of Clements, Michael
Sent: Tuesday, January 12, 2010 10:53 AM
To: mapreduce-user@hadoop.apache.org
Subject: how to load big files into Hbase without crashing?

I have 15-node Hadoop cluster that is working for most jobs. But every
time I upload large data files into HBase, the job fails.

I surmise that this file (15GB in size) is big enough, there are so many
tasks (about 55 at once), they swamp the region server processes.

Each cluster node is also an HBase region server, so there are a minimum
of about 4 jobs for each region server. But when the table is small,
there are few regions so each region server is hosting many more tasks.
For example if the table starts out empty there is a single region, so a
single region server has to handle calls from all 55 tasks. It can't
handle this, the tasks give up and the job fails.

This is just conjecture on my part. Does it sound reasonable?

If so, what methods are there to prevent this? Limiting the number of
tasks for the upload job is one obvious solution, but what is a good
limit? The more general question is, how many map tasks can a typical
region server support?

Limiting the number of tasks is tedious and error-prone, as it requires
somebody to look at the HBase table, see how many regions it has, on
which servers, and manually configure the job accordingly. If the job is
big enough, then the number of regions will grow during the job and the
initial task counts won't be ideal anymore.

Ideally, the Hadoop framework would be smart enough to look at how many
regions & region servers exist and dynamically allocate a reasonable
number of tasks.

Does the community have any knowledge or techniques to handle this?

Thanks

Michael Clements
Solutions Architect
michael.clements@disney.com
206 664-4374 office
360 317 5051 mobile