hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Agarwal, Nikhil" <Nikhil.Agar...@netapp.com>
Subject RE: How to combine input files for a MapReduce job
Date Mon, 13 May 2013 07:55:56 GMT

@Harsh: Thanks for the reply. Would the patch work in Hadoop 1.0.4 release?

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Monday, May 13, 2013 1:03 PM
To: <user@hadoop.apache.org>
Subject: Re: How to combine input files for a MapReduce job

For "control number of mappers" question: You can use http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html
which is designed to solve similar cases. However, you cannot beat the speed you get out of
a single large file (or a few large files), as you'll still have file open/close overheads
which will bog you down.

For "which file is being submitted to which" question: Having
https://issues.apache.org/jira/browse/MAPREDUCE-3678 in the version/distribution of Apache
Hadoop you use would help.

On Mon, May 13, 2013 at 12:50 PM, Agarwal, Nikhil <Nikhil.Agarwal@netapp.com> wrote:
> Hi,
> I  have a 3-node cluster, with JobTracker running on one machine and 
> TaskTrackers on other two. Instead of using HDFS, I have written my 
> own FileSystem implementation. As an experiment, I kept 1000 text 
> files (all of same size) on both the slave nodes and ran a simple 
> Wordcount MR job. It took around 50 mins to complete the task. 
> Afterwards, I concatenated all the
> 1000 files into a single file and then ran a Wordcount MR job, it took 
> 35 secs. From the JobTracker UI I could make out that the problem is 
> because of the number of mappers that JobTracker is creating. For 1000 
> files it creates
> 1000 maps and for 1 file it creates 1 map (irrespective of file size).
> Thus, is there a way to reduce the number of mappers i.e. can I 
> control the number of mappers through some configuration parameter so 
> that Hadoop would club all the files until it reaches some specified 
> size (say, 64 MB) and then make 1 map per 64 MB block?
> Also, I wanted to know how to see which file is being submitted to 
> which TaskTracker or if that is not possible then how do I check if 
> some data transfer is happening in between my slave nodes during a MR job?
> Sorry for so many questions and Thank you for your time.
> Regards,
> Nikhil

Harsh J

View raw message