hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: how to load big files into Hbase without crashing?
Date Tue, 12 Jan 2010 21:45:03 GMT
Michael,

This question should be addressed to the hbase-user mailing list as it
is strictly about HBase's usage of MapReduce, the framework itself
doen't have any knowledge of how the region servers are configured. I
CC'd it.

Uploading into an empty table is always a problem as you saw since
there's no load distribution. I would recommend instead to write
directly into HFiles as documented here:
http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#bulk

Other useful information for us: hbase/hadoop versions, hardware used,
optimizations used to do the insert, configuration files.

Thx,

J-D

On Tue, Jan 12, 2010 at 1:35 PM, Clements, Michael
<Michael.Clements@disney.com> wrote:
> This leads to one quick & easy question: how does one reduce the number
> of map tasks for a job? My goal is to limit the # of Map tasks so they
> don't overwhelm the HBase region servers.
>
> The Docs point in several directions.
>
> There's a method job.setNumReduceTasks(), but no setNumMapTasks().
>
> There is a job Configuration setting setNumMapTasks(), but it's
> deprecated and says it only can increase, not reduce, the number of
> tasks.
>
> There's InputFormat and its subclasses, which do the actual file splits.
> But no single method to simply set the number of splits. One would have
> to write his own subclass to measure the total size of all input files,
> divide by the desired # of mappers and split it all up.
>
> The last option is not trivial but it is doable. Before I jump in I
> figured I'd ask if there is an easier way.
>
> Thanks
>
> -----Original Message-----
> From:
> mapreduce-user-return-267-Michael.Clements=disney.com@hadoop.apache.org
> [mailto:mapreduce-user-return-267-Michael.Clements=disney.com@hadoop.apa
> che.org] On Behalf Of Clements, Michael
> Sent: Tuesday, January 12, 2010 10:53 AM
> To: mapreduce-user@hadoop.apache.org
> Subject: how to load big files into Hbase without crashing?
>
> I have 15-node Hadoop cluster that is working for most jobs. But every
> time I upload large data files into HBase, the job fails.
>
> I surmise that this file (15GB in size) is big enough, there are so many
> tasks (about 55 at once), they swamp the region server processes.
>
> Each cluster node is also an HBase region server, so there are a minimum
> of about 4 jobs for each region server. But when the table is small,
> there are few regions so each region server is hosting many more tasks.
> For example if the table starts out empty there is a single region, so a
> single region server has to handle calls from all 55 tasks. It can't
> handle this, the tasks give up and the job fails.
>
> This is just conjecture on my part. Does it sound reasonable?
>
> If so, what methods are there to prevent this? Limiting the number of
> tasks for the upload job is one obvious solution, but what is a good
> limit? The more general question is, how many map tasks can a typical
> region server support?
>
> Limiting the number of tasks is tedious and error-prone, as it requires
> somebody to look at the HBase table, see how many regions it has, on
> which servers, and manually configure the job accordingly. If the job is
> big enough, then the number of regions will grow during the job and the
> initial task counts won't be ideal anymore.
>
> Ideally, the Hadoop framework would be smart enough to look at how many
> regions & region servers exist and dynamically allocate a reasonable
> number of tasks.
>
> Does the community have any knowledge or techniques to handle this?
>
> Thanks
>
> Michael Clements
> Solutions Architect
> michael.clements@disney.com
> 206 664-4374 office
> 360 317 5051 mobile
>
>
>

Mime
View raw message