Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of Nikhil.Agarwal@netapp.com
 designates 216.240.18.77 as permitted sender)
From: "Agarwal, Nikhil" <Nikhil.Agarwal@netapp.com>
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Subject: RE: How to combine input files for a MapReduce job
Thread-Topic: How to combine input files for a MapReduce job
Thread-Index: Ac5PqUMEswkMZhT0SfS/aEbd0o2PXwAPXPiAAA3k50D//5gMAIAASPRQgACIMCA=
Date: Mon, 13 May 2013 11:12:44 +0000
Message-ID: 
 <7B0D51053A50034199FF706B2513104F09C55F92@SACEXCMBX01-PRD.hq.netapp.com>
References: 
 <7B0D51053A50034199FF706B2513104F09C54F01@SACEXCMBX01-PRD.hq.netapp.com>
 <CAOcnVr1ErGdpM7FWh8wZ9zKJORXkB+C6UK8FOCTpJW0tOt+T-Q@mail.gmail.com>
 <7B0D51053A50034199FF706B2513104F09C55F27@SACEXCMBX01-PRD.hq.netapp.com>
 <CAOcnVr3QVBbsr51ND-PKS27uEgHs19S9MGXsOQKZV2qnfu4gTw@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

Hi,

I got it. The log info is printed in userlogs folder in slave nodes, in the=
 file syslog.

Thanks,
Nikhil

-----Original Message-----
From: Agarwal, Nikhil=20
Sent: Monday, May 13, 2013 4:10 PM
To: 'user@hadoop.apache.org'
Subject: RE: How to combine input files for a MapReduce job

Hi Harsh,

I applied the changes of the patch in hadoop source code but can you please=
 tell exactly where is this log being printed? I checked in log files of Jo=
bTracker and TaskTracker but it is not there. It is not getting printed in =
_logs folder creates inside output directory for MR job.

Regards,
Nikhil=20

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com]
Sent: Monday, May 13, 2013 1:28 PM
To: <user@hadoop.apache.org>
Subject: Re: How to combine input files for a MapReduce job

Yes I believe the branch-1 patch attached there should apply cleanly to 1.0=
.4.

On Mon, May 13, 2013 at 1:25 PM, Agarwal, Nikhil <Nikhil.Agarwal@netapp.com=
> wrote:
> Hi,
>
> @Harsh: Thanks for the reply. Would the patch work in Hadoop 1.0.4 releas=
e?
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Monday, May 13, 2013 1:03 PM
> To: <user@hadoop.apache.org>
> Subject: Re: How to combine input files for a MapReduce job
>
> For "control number of mappers" question: You can use=20
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib
> /CombineFileInputFormat.html which is designed to solve similar cases.=20
> However, you cannot beat the speed you get out of a single large file (or=
 a few large files), as you'll still have file open/close overheads which w=
ill bog you down.
>
> For "which file is being submitted to which" question: Having
> https://issues.apache.org/jira/browse/MAPREDUCE-3678 in the version/distr=
ibution of Apache Hadoop you use would help.
>
> On Mon, May 13, 2013 at 12:50 PM, Agarwal, Nikhil <Nikhil.Agarwal@netapp.=
com> wrote:
>> Hi,
>>
>>
>>
>> I  have a 3-node cluster, with JobTracker running on one machine and=20
>> TaskTrackers on other two. Instead of using HDFS, I have written my=20
>> own FileSystem implementation. As an experiment, I kept 1000 text=20
>> files (all of same size) on both the slave nodes and ran a simple=20
>> Wordcount MR job. It took around 50 mins to complete the task.
>> Afterwards, I concatenated all the
>> 1000 files into a single file and then ran a Wordcount MR job, it=20
>> took
>> 35 secs. From the JobTracker UI I could make out that the problem is=20
>> because of the number of mappers that JobTracker is creating. For
>> 1000 files it creates
>> 1000 maps and for 1 file it creates 1 map (irrespective of file size).
>>
>>
>>
>> Thus, is there a way to reduce the number of mappers i.e. can I=20
>> control the number of mappers through some configuration parameter so=20
>> that Hadoop would club all the files until it reaches some specified=20
>> size (say, 64 MB) and then make 1 map per 64 MB block?
>>
>>
>>
>> Also, I wanted to know how to see which file is being submitted to=20
>> which TaskTracker or if that is not possible then how do I check if=20
>> some data transfer is happening in between my slave nodes during a MR jo=
b?
>>
>>
>>
>> Sorry for so many questions and Thank you for your time.
>>
>>
>>
>> Regards,
>>
>> Nikhil
>
>
>
> --
> Harsh J


--
Harsh J