hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chiku Singh <hakise...@gmail.com>
Subject Re: How the number of mapper tasks is calculated
Date Tue, 26 Jul 2011 03:53:29 GMT
What is your use case? Why would you only want to use only 5 mappers and not
the whole 10 task trackers?

"If an individual file is so large that it will affect seek time it will be
split to several Splits" (http://wiki.apache.org/hadoop/HadoopMapReduce)

"if a split span over more than one dfs block, you lose the data locality
scheduling benefits." (https://issues.apache.org/jira/browse/HADOOP-2560)

On Tue, Jul 26, 2011 at 12:53 AM, Anfernee Xu <anfernee.xu@gmail.com> wrote:

> I have a generic question about how the number of mapper tasks is
> calculated, as far as I know, the number is primarily based on the number of
> splits, say if I have 5 splits and I have 10 tasktracker running in the
> cluster, I will have 5 mapper tasks running in my MR job, right?
>
> But what I found is that sometimes if the input is huge(5 GB), at this
> point I still have 5 splits which is on purpose, but I got more than 40
> mapper tasks running, how this happens? Now, if I compress the huge input to
> smaller size, the number of mapper got back to 5 again, is  something tricky
> happens here relevant to DFS block location of the input?
>
> BTW, our InputFormat is a special kind of FileInputFormat which does not
> split each file, whereas we copy each file to DFS and the location of the
> file on DFS will be the input key to mapper task.
>
> --
> --Anfernee
>

Mime
View raw message