Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of Michael.Clements@disney.com
 designates 204.128.192.16 as permitted sender)
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Subject: how to load big files into Hbase without crashing?
Date: Tue, 12 Jan 2010 18:53:05 +0000
Message-Id: 
 <23E512539066824E9B836EA8709831A905EBA275@SM-CALA-VXMB03C.swna.wdpr.disney.com>
Thread-Topic: how to load big files into Hbase without crashing?
thread-index: AcqTuHliOtSOZQGERGiM07b09rzm8Q==
From: "Clements, Michael" <Michael.Clements@disney.com>
To: <mapreduce-user@hadoop.apache.org>

I have 15-node Hadoop cluster that is working for most jobs. But every
time I upload large data files into HBase, the job fails.

I surmise that this file (15GB in size) is big enough, there are so many
tasks (about 55 at once), they swamp the region server processes.

Each cluster node is also an HBase region server, so there are a minimum
of about 4 jobs for each region server. But when the table is small,
there are few regions so each region server is hosting many more tasks.
For example if the table starts out empty there is a single region, so a
single region server has to handle calls from all 55 tasks. It can't
handle this, the tasks give up and the job fails.

This is just conjecture on my part. Does it sound reasonable?

If so, what methods are there to prevent this? Limiting the number of
tasks for the upload job is one obvious solution, but what is a good
limit? The more general question is, how many map tasks can a typical
region server support?

Limiting the number of tasks is tedious and error-prone, as it requires
somebody to look at the HBase table, see how many regions it has, on
which servers, and manually configure the job accordingly. If the job is
big enough, then the number of regions will grow during the job and the
initial task counts won't be ideal anymore.

Ideally, the Hadoop framework would be smart enough to look at how many
regions & region servers exist and dynamically allocate a reasonable
number of tasks.

Does the community have any knowledge or techniques to handle this?

Thanks

Michael Clements
Solutions Architect
michael.clements@disney.com
206 664-4374 office
360 317 5051 mobile