hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Agarwal, Nikhil" <Nikhil.Agar...@netapp.com>
Subject How to combine input files for a MapReduce job
Date Mon, 13 May 2013 07:20:21 GMT

I  have a 3-node cluster, with JobTracker running on one machine and TaskTrackers on other
two. Instead of using HDFS, I have written my own FileSystem implementation. As an experiment,
I kept 1000 text files (all of same size) on both the slave nodes and ran a simple Wordcount
MR job. It took around 50 mins to complete the task. Afterwards, I concatenated all the 1000
files into a single file and then ran a Wordcount MR job, it took 35 secs. From the JobTracker
UI I could make out that the problem is because of the number of mappers that JobTracker is
creating. For 1000 files it creates 1000 maps and for 1 file it creates 1 map (irrespective
of file size).

Thus, is there a way to reduce the number of mappers i.e. can I control the number of mappers
through some configuration parameter so that Hadoop would club all the files until it reaches
some specified size (say, 64 MB) and then make 1 map per 64 MB block?

Also, I wanted to know how to see which file is being submitted to which TaskTracker or if
that is not possible then how do I check if some data transfer is happening in between my
slave nodes during a MR job?

Sorry for so many questions and Thank you for your time.


View raw message