hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jeremy p <athomewithagroove...@gmail.com>
Subject Re: What happens when you have fewer input files than mapper slots?
Date Fri, 22 Mar 2013 21:14:20 GMT
Apologies -- I don't understand this advice : "If the evenness is the goal
you can also write your own input format that return empty locations for
each split and read the small files in map task directly."  How would
manually reading the files into the map task help me?  Hadoop would still
spawn multiple mappers per machine, which is what I'm trying to avoid.  I'm
trying to get one mapper per machine for this job.

--Jeremy

On Thu, Mar 21, 2013 at 11:44 AM, Luke Lu <llu@apache.org> wrote:

>
> Short version : let's say you have 20 nodes, and each node has 10 mapper
>> slots.  You start a job with 20 very small input files.  How is the work
>> distributed to the cluster?  Will it be even, with each node spawning one
>> mapper task?  Is there any way of predicting or controlling how the work
>> will be distributed?
>
>
> You're right in expecting that the tasks of the small job will likely be
> evenly distributed among 20 nodes, if the 20 files are evenly distributed
> among the nodes and that there are free slots on every node.
>
>
>> Long version : My cluster is currently used for two different jobs.  The
>> cluster is currently optimized for Job A, so each node has a maximum of 18
>> mapper slots.  However, I also need to run Job B.  Job B is VERY
>> cpu-intensive, so we really only want one mapper to run on a node at any
>> given time.  I've done a bunch of research, and it doesn't seem like Hadoop
>> gives you any way to set the maximum number of mappers per node on a
>> per-job basis.  I'm at my wit's end here, and considering some rather
>> egregious workarounds.  If you can think of anything that can help me, I'd
>> very much appreciate it.
>>
>
> Are you seeing Job B tasks are not being evenly distributed to each node?
> You can check the locations of the files by hadoop fsck. If the evenness is
> the goal you can also write your own input format that return empty
> locations for each split and read the small files in map task directly. If
> you're using Hadoop 1.0.x and fair scheduler, you might need to set
> mapred.fairscheduler.assignmultiple to false in mapred-site.xml (JT restart
> required) to work around a bug in fairscheduler (MAPREDUCE-2905) that
> causes the tasks be assigned unevenly. The bug is fixed in Hadoop 1.1+.
>
> __Luke
>

Mime
View raw message