Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BA3C49113 for ; Fri, 17 Feb 2012 09:39:26 +0000 (UTC) Received: (qmail 89873 invoked by uid 500); 17 Feb 2012 09:39:25 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 89811 invoked by uid 500); 17 Feb 2012 09:39:25 -0000 Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-user@hadoop.apache.org Delivered-To: mailing list mapreduce-user@hadoop.apache.org Received: (qmail 89803 invoked by uid 99); 17 Feb 2012 09:39:25 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 Feb 2012 09:39:25 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of bejoy.hadoop@gmail.com designates 209.85.210.48 as permitted sender) Received: from [209.85.210.48] (HELO mail-pz0-f48.google.com) (209.85.210.48) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 Feb 2012 09:39:20 +0000 Received: by dadp13 with SMTP id p13so3687806dad.35 for ; Fri, 17 Feb 2012 01:39:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=VdRwEKux3Cfw4wTlxVW7ds9dLdYBzJCZh3WW2de/6n0=; b=X7Ybg2mMj9vIWtjvfe/W6ubrIC8UhHdDB1LPzUf9ZJNZl2L44vmJgM9TqX8CqixEkQ v0v3UOg4RCGBbyb/cYHWdStXPBzGq5fcJBYa6o51/hDxALf8Reog0wxI5fYPpAWOSy/v O2C1OC86Rq56wdkL/UyvzSVAFwzTDsz6hvclE= MIME-Version: 1.0 Received: by 10.68.216.134 with SMTP id oq6mr22524079pbc.118.1329471539954; Fri, 17 Feb 2012 01:38:59 -0800 (PST) Received: by 10.143.148.9 with HTTP; Fri, 17 Feb 2012 01:38:59 -0800 (PST) In-Reply-To: References: Date: Fri, 17 Feb 2012 15:08:59 +0530 Message-ID: Subject: Re: num of reducer From: Bejoy Ks To: mapreduce-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=047d7b2e0953e0c3e504b925b68e --047d7b2e0953e0c3e504b925b68e Content-Type: text/plain; charset=ISO-8859-1 Hi Tamizh MultiFileInputFormat / CombineFileInputFormat is typically used where the input files are relatively small (typically less than a block size). When you use these, there is some loss in data locality, as all the splits a mapper process won't be in the same node. TextInputFormat spawns one mapper each for one block in default (not one per file). Here you hold data locality pretty much compared to MultiFileInputFormat. If your mapper is not very short lived and has some decent amount of processing involved then you can go with TextInputFormat . The one consideration you need to make is, on your specified input when this job is running it may span a larger number of map tasks there by occupying almost all your map task slots in your cluster. If there are other tasks to be triggred they may have to wait for free map slots. You may need to consider using a Scheduler for fair share of slots to other parallel jobs as well, if any. Regards Bejoy.K.S On Fri, Feb 17, 2012 at 10:26 AM, Thamizhannal Paramasivam < thamizhannal.p@gmail.com> wrote: > Thank you so much to Joey & Bejoy for your suggestions. > > The Job's input path has 1300-1400 text files and each of 100-200MB. > > I thought, TextInputFormat spans single mapper per file and > MultiFileInputFormat spans less number mapper(<(1300-1400)) that processes > more many input files. > > Which input format do you thing would be most appropriate in my case and > why? > > Looking forward to your reply. > > Thanks, > Thamizh > > > > On Thu, Feb 16, 2012 at 10:06 PM, Joey Echeverria wrote: > >> Is your data size 100-200MB *total*? >> >> If so, then this is the expected behavior for MultiFileInputFormat. As >> Bejoy says, you can switch to TextInputFormat to get one mapper per block >> (min one mapper per file). >> >> -Joey >> >> >> On Thu, Feb 16, 2012 at 11:03 AM, Thamizhannal Paramasivam < >> thamizhannal.p@gmail.com> wrote: >> >>> Here are the input format for mapper. >>> Input Format: MultiFileInputFormat >>> MapperOutputKey : Text >>> MapperOutputValue: CustomWritable >>> >>> I shall not be in the position to upgrade hadoop-0.19.2 for some reason. >>> >>> I have checked in number of mapper on job-tracker. >>> >>> Thanks, >>> Thamizh >>> >>> >>> On Thu, Feb 16, 2012 at 6:56 PM, Joey Echeverria wrote: >>> >>>> Hi Tamil, >>>> >>>> I'd recommend upgrading to a newer release as 0.19.2 is very old. As >>>> for your question, most input formats should set the number mappers >>>> correctly. What input format are you using? Where did you see the number of >>>> tasks it assigned to the job? >>>> >>>> -Joey >>>> >>>> >>>> On Thu, Feb 16, 2012 at 1:40 AM, Thamizhannal Paramasivam < >>>> thamizhannal.p@gmail.com> wrote: >>>> >>>>> Hi All, >>>>> I am using hadoop-0.19.2 and running a Mapper only Job on cluster. >>>>> It's input path has >1000 files of 100-200MB. Since, it is Mapper only job, >>>>> I gave number Of reducer=0. So, it is using 2 mapper to run all the input >>>>> files. If we did not state the number of mapper, would n't it pick the 1 >>>>> mapper per input file? Or Does the default won't it pick a fair num of >>>>> mapper according to number input file? >>>>> Thanks, >>>>> tamil >>>> >>>> >>>> >>>> >>>> -- >>>> Joseph Echeverria >>>> Cloudera, Inc. >>>> 443.305.9434 >>>> >>>> >>> >> >> >> -- >> Joseph Echeverria >> Cloudera, Inc. >> 443.305.9434 >> >> > --047d7b2e0953e0c3e504b925b68e Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi Tamizh
=A0=A0=A0=A0=A0=A0=A0=A0 MultiFileInputFormat / CombineFileInp= utFormat is typically used where the input files are relatively small (typi= cally less than a block size). When you use these, there is some loss in da= ta locality, as all the splits a mapper process won't be in the same no= de.
=A0=A0=A0=A0=A0=A0 TextInputFormat spawns one mapper each for one block in = default (not one per file). Here you hold data locality pretty much compare= d to MultiFileInputFormat.
=A0=A0=A0=A0=A0 If your mapper is not very sh= ort lived and has some decent amount of processing involved then you can go= with TextInputFormat . The one consideration you need to make is, on your = specified input when this job is running it may span a larger number of map= tasks there by occupying almost all your map task slots in your cluster. I= f there are other tasks to be triggred they may have to wait for free map s= lots. You may need to consider using a Scheduler for fair share of slots to= other parallel jobs as well, if any.

Regards
Bejoy.K.S
=A0=A0=A0=A0=A0

On Fri, Feb 17, 2012 at 10:26 AM, Thamizhannal Paramasivam <thamizhannal.p@gmail.= com> wrote:
Thank you so much to Joey & Bejoy for yo= ur suggestions.

The Job's input path has 1300-1400 text files an= d each of 100-200MB.

I thought, TextInputFormat spans single mapper per file and MultiFileIn= putFormat spans less number mapper(<(1300-1400)) that processes more man= y input files.

Which input format do you thing would be most appropriate in my case an= d why?

Looking forward to your reply.

Thanks,
Thamizh
=



On Thu, Feb 16, 2012 at 10:06 PM, Joey E= cheverria <joey@cloudera.com> wrote:
Is your data size 100-200MB *total*?

If so, then this is= the expected behavior for MultiFileInputFormat. As Bejoy says, you can swi= tch to TextInputFormat to get one mapper per block (min one mapper per file= ).

-Joey


On Thu, Feb 16, 2012 at 11:03 AM, Tham= izhannal Paramasivam <thamizhannal.p@gmail.com> wrote= :
Here are the input format for mapper.
In= put Format: MultiFileInputFormat
MapperOutputKey : Text
MapperOutputV= alue: CustomWritable
=A0
I shall not be in the position to upgrade hadoop-0.19.2 for some rea= son.

I have checked in number of mapper on job-tracker.

Thanks,
Th= amizh


On Thu, Feb 16, 2012 at 6= :56 PM, Joey Echeverria <joey@cloudera.com> wrote:
Hi Tamil,

I'd recomme= nd upgrading to a newer release as 0.19.2 is very old. As for your question= , most input formats should set the number mappers correctly. What input fo= rmat are you using? Where did you see the number of tasks it assigned to th= e job?

-Joey


On Thu, Feb 16, 2012 at 1:40 AM, Thamizhannal Paramasivam <= thamizhannal.p@gmail.com> wrote:
Hi All,
I am using hadoop-0.19.2 and running a Mapper only Job on cluster. It's= input path has >1000 files of 100-200MB. Since, it is Mapper only job, = I gave number Of reducer=3D0. So, it is using 2 mapper to run all the input= files. If we did not state the number of mapper, would n't it pick the= 1 mapper per input file? Or Does the default won't it pick a fair num = of mapper according to number input file?
Thanks,
tamil



--
Joseph Echeverria
Cloudera, Inc.
=





--
=
Joseph Echeverria
Cloudera, Inc.
443.305.9434
<= /div>


--047d7b2e0953e0c3e504b925b68e--