Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C4D5C116C5 for ; Mon, 13 May 2013 07:59:14 +0000 (UTC) Received: (qmail 91069 invoked by uid 500); 13 May 2013 07:59:10 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 90751 invoked by uid 500); 13 May 2013 07:59:09 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 90732 invoked by uid 99); 13 May 2013 07:59:08 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 13 May 2013 07:59:08 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of harsh@cloudera.com designates 209.85.223.176 as permitted sender) Received: from [209.85.223.176] (HELO mail-ie0-f176.google.com) (209.85.223.176) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 13 May 2013 07:59:04 +0000 Received: by mail-ie0-f176.google.com with SMTP id at1so11700255iec.35 for ; Mon, 13 May 2013 00:58:44 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:mime-version:in-reply-to:references:from:date:message-id :subject:to:content-type:x-gm-message-state; bh=/s068lGHOPRwiK1ydFfhDtKHCiJHVt2cOh023cxfkxA=; b=jaD/tGRNZ1Hgo31l+px+RUDYv1nmXpG3OlcdHrSIJ9iYLxe215ln0MWR3XcqjODizK 1uFfPZkrgMbNoPMAVBi2kC6QXSN6dwBA/d2ikh0r4oim/fqRlB+YuDADD6R5ZoYyNrTy u6B4viMi0rQzX//e2LkZmuimiVf103yuR4qWwpfpBfdGRgbgJO0qt/x41MmNhQTFNzHr wd/j22UUVQNsbHf2rkf03L+ASgndP9SZ+eGyCGR3yfmws6SLMg8ZTW4bqwlFtYK/ULkp nB43ksEKb77NnN6j7cKpm8ccK0QtHTSyoNwx9MYsZoQhU9DRZ85P1JvE/aT3dOEAtkP3 QiOQ== X-Received: by 10.50.80.9 with SMTP id n9mr8967604igx.54.1368431924158; Mon, 13 May 2013 00:58:44 -0700 (PDT) MIME-Version: 1.0 Received: by 10.50.95.34 with HTTP; Mon, 13 May 2013 00:58:24 -0700 (PDT) In-Reply-To: <7B0D51053A50034199FF706B2513104F09C55F27@SACEXCMBX01-PRD.hq.netapp.com> References: <7B0D51053A50034199FF706B2513104F09C54F01@SACEXCMBX01-PRD.hq.netapp.com> <7B0D51053A50034199FF706B2513104F09C55F27@SACEXCMBX01-PRD.hq.netapp.com> From: Harsh J Date: Mon, 13 May 2013 13:28:24 +0530 Message-ID: Subject: Re: How to combine input files for a MapReduce job To: "" Content-Type: text/plain; charset=ISO-8859-1 X-Gm-Message-State: ALoCoQnHsXtWyqtD9NXzgyUdAZ3ZVCNIiij55NmON6TXvun8usH00pwonHeWU12Eaiq6wh0z8C5a X-Virus-Checked: Checked by ClamAV on apache.org Yes I believe the branch-1 patch attached there should apply cleanly to 1.0.4. On Mon, May 13, 2013 at 1:25 PM, Agarwal, Nikhil wrote: > Hi, > > @Harsh: Thanks for the reply. Would the patch work in Hadoop 1.0.4 release? > > -----Original Message----- > From: Harsh J [mailto:harsh@cloudera.com] > Sent: Monday, May 13, 2013 1:03 PM > To: > Subject: Re: How to combine input files for a MapReduce job > > For "control number of mappers" question: You can use http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html > which is designed to solve similar cases. However, you cannot beat the speed you get out of a single large file (or a few large files), as you'll still have file open/close overheads which will bog you down. > > For "which file is being submitted to which" question: Having > https://issues.apache.org/jira/browse/MAPREDUCE-3678 in the version/distribution of Apache Hadoop you use would help. > > On Mon, May 13, 2013 at 12:50 PM, Agarwal, Nikhil wrote: >> Hi, >> >> >> >> I have a 3-node cluster, with JobTracker running on one machine and >> TaskTrackers on other two. Instead of using HDFS, I have written my >> own FileSystem implementation. As an experiment, I kept 1000 text >> files (all of same size) on both the slave nodes and ran a simple >> Wordcount MR job. It took around 50 mins to complete the task. >> Afterwards, I concatenated all the >> 1000 files into a single file and then ran a Wordcount MR job, it took >> 35 secs. From the JobTracker UI I could make out that the problem is >> because of the number of mappers that JobTracker is creating. For 1000 >> files it creates >> 1000 maps and for 1 file it creates 1 map (irrespective of file size). >> >> >> >> Thus, is there a way to reduce the number of mappers i.e. can I >> control the number of mappers through some configuration parameter so >> that Hadoop would club all the files until it reaches some specified >> size (say, 64 MB) and then make 1 map per 64 MB block? >> >> >> >> Also, I wanted to know how to see which file is being submitted to >> which TaskTracker or if that is not possible then how do I check if >> some data transfer is happening in between my slave nodes during a MR job? >> >> >> >> Sorry for so many questions and Thank you for your time. >> >> >> >> Regards, >> >> Nikhil > > > > -- > Harsh J -- Harsh J