hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jeremy p <athomewithagroove...@gmail.com>
Subject Re: What happens when you have fewer input files than mapper slots?
Date Tue, 19 Mar 2013 22:38:09 GMT
The job we need to run executes some third-party code that utilizes
multiple cores.  The only way the job will get done in a timely fashion is
if we give it all the cores available on the machine.  This is not a task
that can be split up.

Yes, I know, it's not ideal, but this is the situation I have to deal with.

On Tue, Mar 19, 2013 at 3:15 PM, hari <haribaha@gmail.com> wrote:

> This may not be what you were looking for, but I was just curious when you
> mentioned that
>  you would only want to run only one map task because it was cpu
> intensive. Well, the map
> tasks are supposed to be cpu intensive, isn't it. If the maximum map slots
> are 10 then that
> would mean you have close to 10 cores available in each node. So, if you
> run only one
> map task, no matter how much cpu intensive it is, it will only be able to
> max out one core, so the
> rest of the  9 cores would go under utilized. So, you can still run 9 more
> map tasks on that machine.
> Or, maybe your node's core count is way less than 10, in which case you
> might be better off setting
> the mapper slots to a lower value anyway.
> On Tue, Mar 19, 2013 at 5:18 PM, jeremy p <athomewithagroovebox@gmail.com>wrote:
>> Thank you for your help.
>> We're using MRv1.  I've tried
>> setting mapred.tasktracker.map.tasks.maximum and mapred.map.tasks, and
>> neither one helped me at all.
>> Per-job control is definitely what I need.  I need to be able to say,
>> "For Job A, only use one mapper per node, but for Job B, use 16 mappers per
>> node".  I have not found any way to do this.
>> I will definitely look into schedulers.  Are there any examples you can
>> point me to where someone does what I'm needing to do?
>> --Jeremy
>> On Tue, Mar 19, 2013 at 2:08 PM, Rahul Jain <rjain7@gmail.com> wrote:
>>> Which version of hadoop are you using ? MRV1 or MRV2 (yarn) ??
>>> For MRv2 (yarn): you can pretty much achieve this using:
>>> yarn.nodemanager.resource.memory-mb (system wide setting)
>>> and
>>> mapreduce.map.memory.mb  (job level setting)
>>> e.g. if yarn.nodemanager.resource.memory-mb=100
>>> and mapreduce.map.memory.mb= 40
>>> a maximum of two mapper can run on a node at any time.
>>> For MRv1, The equivalent way will be to control mapper slots on each
>>> machine:
>>> mapred.tasktracker.map.tasks.maximum,  of course this does not give you
>>> 'per job' control. on mappers.
>>> In addition in both cases, you can use a scheduler with 'pools / queues'
>>> capability in addition to restrict the overall use of grid resource. Do
>>> read fair scheduler and capacity scheduler documentation...
>>> -Rahul
>>> On Tue, Mar 19, 2013 at 1:55 PM, jeremy p <
>>> athomewithagroovebox@gmail.com> wrote:
>>>> Short version : let's say you have 20 nodes, and each node has 10
>>>> mapper slots.  You start a job with 20 very small input files.  How is the
>>>> work distributed to the cluster?  Will it be even, with each node spawning
>>>> one mapper task?  Is there any way of predicting or controlling how the
>>>> work will be distributed?
>>>> Long version : My cluster is currently used for two different jobs.
>>>>  The cluster is currently optimized for Job A, so each node has a maximum
>>>> of 18 mapper slots.  However, I also need to run Job B.  Job B is VERY
>>>> cpu-intensive, so we really only want one mapper to run on a node at any
>>>> given time.  I've done a bunch of research, and it doesn't seem like Hadoop
>>>> gives you any way to set the maximum number of mappers per node on a
>>>> per-job basis.  I'm at my wit's end here, and considering some rather
>>>> egregious workarounds.  If you can think of anything that can help me, I'd
>>>> very much appreciate it.
>>>> Thanks!
>>>> --Jeremy

View raw message