Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 930FE11DC6 for ; Mon, 13 May 2013 11:13:18 +0000 (UTC) Received: (qmail 12583 invoked by uid 500); 13 May 2013 11:13:13 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 12319 invoked by uid 500); 13 May 2013 11:13:13 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 12304 invoked by uid 99); 13 May 2013 11:13:13 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 13 May 2013 11:13:13 +0000 X-ASF-Spam-Status: No, hits=-5.0 required=5.0 tests=RCVD_IN_DNSWL_HI,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of Nikhil.Agarwal@netapp.com designates 216.240.18.77 as permitted sender) Received: from [216.240.18.77] (HELO mx12.netapp.com) (216.240.18.77) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 13 May 2013 11:13:07 +0000 X-IronPort-AV: E=Sophos;i="4.87,662,1363158000"; d="scan'208";a="52620050" Received: from smtp1.corp.netapp.com ([10.57.156.124]) by mx12-out.netapp.com with ESMTP; 13 May 2013 04:12:45 -0700 Received: from vmwexceht02-prd.hq.netapp.com (vmwexceht02-prd.hq.netapp.com [10.106.76.240]) by smtp1.corp.netapp.com (8.13.1/8.13.1/NTAP-1.6) with ESMTP id r4DBCjWC012213 for ; Mon, 13 May 2013 04:12:45 -0700 (PDT) Received: from VMWEXCEHT06-PRD.hq.netapp.com (10.106.77.104) by vmwexceht02-prd.hq.netapp.com (10.106.76.240) with Microsoft SMTP Server (TLS) id 14.3.123.3; Mon, 13 May 2013 04:12:45 -0700 Received: from SACEXCMBX01-PRD.hq.netapp.com ([169.254.2.208]) by vmwexceht06-prd.hq.netapp.com ([10.106.77.104]) with mapi id 14.03.0123.003; Mon, 13 May 2013 04:12:45 -0700 From: "Agarwal, Nikhil" To: "user@hadoop.apache.org" Subject: RE: How to combine input files for a MapReduce job Thread-Topic: How to combine input files for a MapReduce job Thread-Index: Ac5PqUMEswkMZhT0SfS/aEbd0o2PXwAPXPiAAA3k50D//5gMAIAASPRQgACIMCA= Date: Mon, 13 May 2013 11:12:44 +0000 Message-ID: <7B0D51053A50034199FF706B2513104F09C55F92@SACEXCMBX01-PRD.hq.netapp.com> References: <7B0D51053A50034199FF706B2513104F09C54F01@SACEXCMBX01-PRD.hq.netapp.com> <7B0D51053A50034199FF706B2513104F09C55F27@SACEXCMBX01-PRD.hq.netapp.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.106.53.53] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org Hi, I got it. The log info is printed in userlogs folder in slave nodes, in the= file syslog. Thanks, Nikhil -----Original Message----- From: Agarwal, Nikhil=20 Sent: Monday, May 13, 2013 4:10 PM To: 'user@hadoop.apache.org' Subject: RE: How to combine input files for a MapReduce job Hi Harsh, I applied the changes of the patch in hadoop source code but can you please= tell exactly where is this log being printed? I checked in log files of Jo= bTracker and TaskTracker but it is not there. It is not getting printed in = _logs folder creates inside output directory for MR job. Regards, Nikhil=20 -----Original Message----- From: Harsh J [mailto:harsh@cloudera.com] Sent: Monday, May 13, 2013 1:28 PM To: Subject: Re: How to combine input files for a MapReduce job Yes I believe the branch-1 patch attached there should apply cleanly to 1.0= .4. On Mon, May 13, 2013 at 1:25 PM, Agarwal, Nikhil wrote: > Hi, > > @Harsh: Thanks for the reply. Would the patch work in Hadoop 1.0.4 releas= e? > > -----Original Message----- > From: Harsh J [mailto:harsh@cloudera.com] > Sent: Monday, May 13, 2013 1:03 PM > To: > Subject: Re: How to combine input files for a MapReduce job > > For "control number of mappers" question: You can use=20 > http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib > /CombineFileInputFormat.html which is designed to solve similar cases.=20 > However, you cannot beat the speed you get out of a single large file (or= a few large files), as you'll still have file open/close overheads which w= ill bog you down. > > For "which file is being submitted to which" question: Having > https://issues.apache.org/jira/browse/MAPREDUCE-3678 in the version/distr= ibution of Apache Hadoop you use would help. > > On Mon, May 13, 2013 at 12:50 PM, Agarwal, Nikhil wrote: >> Hi, >> >> >> >> I have a 3-node cluster, with JobTracker running on one machine and=20 >> TaskTrackers on other two. Instead of using HDFS, I have written my=20 >> own FileSystem implementation. As an experiment, I kept 1000 text=20 >> files (all of same size) on both the slave nodes and ran a simple=20 >> Wordcount MR job. It took around 50 mins to complete the task. >> Afterwards, I concatenated all the >> 1000 files into a single file and then ran a Wordcount MR job, it=20 >> took >> 35 secs. From the JobTracker UI I could make out that the problem is=20 >> because of the number of mappers that JobTracker is creating. For >> 1000 files it creates >> 1000 maps and for 1 file it creates 1 map (irrespective of file size). >> >> >> >> Thus, is there a way to reduce the number of mappers i.e. can I=20 >> control the number of mappers through some configuration parameter so=20 >> that Hadoop would club all the files until it reaches some specified=20 >> size (say, 64 MB) and then make 1 map per 64 MB block? >> >> >> >> Also, I wanted to know how to see which file is being submitted to=20 >> which TaskTracker or if that is not possible then how do I check if=20 >> some data transfer is happening in between my slave nodes during a MR jo= b? >> >> >> >> Sorry for so many questions and Thank you for your time. >> >> >> >> Regards, >> >> Nikhil > > > > -- > Harsh J -- Harsh J