hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From abhishek sharma <absha...@usc.edu>
Subject posted again: how are the splits for map tasks computed?
Date Thu, 25 Mar 2010 02:27:01 GMT
I realized that I made a mistake in my earlier post. So here is the correct one.

I have a job ("loadgen") with only 1 input (say) part-00000 of size
1368654 bytes.

So when I submit this job, I get the following output:

INFO mapred.FileInputFormat: Total input paths to process : 1

However, in the JobTracker log, I see the following entry:

 Split info for job:job_201003131110_0043 with 2 splits

and subsequently 2 map tasks are started to process these two splits.
The size of input splits to these 2 map tasks is 6843283. So the input
is divided equally into two splits.

My question is: Why are two map tasks created instead of one and why
is the combined size of the two splits greater than the size of my
input?

I also noticed that if I run the same job with 2 inputs (say)
part-00000 and part-00001, then only 2 map tasks are created.

To my knowledge, the number of map tasks should be the same as the
number of inputs.

Thanks,

Mime
View raw message