hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jeremy p <athomewithagroove...@gmail.com>
Subject Re: What happens when you have fewer input files than mapper slots?
Date Tue, 19 Mar 2013 21:18:50 GMT
Thank you for your help.

We're using MRv1.  I've tried setting mapred.tasktracker.map.tasks.maximum
and mapred.map.tasks, and neither one helped me at all.

Per-job control is definitely what I need.  I need to be able to say, "For
Job A, only use one mapper per node, but for Job B, use 16 mappers per
node".  I have not found any way to do this.

I will definitely look into schedulers.  Are there any examples you can
point me to where someone does what I'm needing to do?


On Tue, Mar 19, 2013 at 2:08 PM, Rahul Jain <rjain7@gmail.com> wrote:

> Which version of hadoop are you using ? MRV1 or MRV2 (yarn) ??
> For MRv2 (yarn): you can pretty much achieve this using:
> yarn.nodemanager.resource.memory-mb (system wide setting)
> and
> mapreduce.map.memory.mb  (job level setting)
> e.g. if yarn.nodemanager.resource.memory-mb=100
> and mapreduce.map.memory.mb= 40
> a maximum of two mapper can run on a node at any time.
> For MRv1, The equivalent way will be to control mapper slots on each
> machine:
> mapred.tasktracker.map.tasks.maximum,  of course this does not give you
> 'per job' control. on mappers.
> In addition in both cases, you can use a scheduler with 'pools / queues'
> capability in addition to restrict the overall use of grid resource. Do
> read fair scheduler and capacity scheduler documentation...
> -Rahul
> On Tue, Mar 19, 2013 at 1:55 PM, jeremy p <athomewithagroovebox@gmail.com>wrote:
>> Short version : let's say you have 20 nodes, and each node has 10 mapper
>> slots.  You start a job with 20 very small input files.  How is the work
>> distributed to the cluster?  Will it be even, with each node spawning one
>> mapper task?  Is there any way of predicting or controlling how the work
>> will be distributed?
>> Long version : My cluster is currently used for two different jobs.  The
>> cluster is currently optimized for Job A, so each node has a maximum of 18
>> mapper slots.  However, I also need to run Job B.  Job B is VERY
>> cpu-intensive, so we really only want one mapper to run on a node at any
>> given time.  I've done a bunch of research, and it doesn't seem like Hadoop
>> gives you any way to set the maximum number of mappers per node on a
>> per-job basis.  I'm at my wit's end here, and considering some rather
>> egregious workarounds.  If you can think of anything that can help me, I'd
>> very much appreciate it.
>> Thanks!
>> --Jeremy

View raw message