Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of nutch.buddy@gmail.com
 designates 209.85.160.48 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAO6W-2eKt061BEiCbXD3wedwYNXrr-CpNbaZpTEuT+hdnS8Ohw@mail.gmail.com>
References: 
 <CAE8Cr2tRo4iG_rxNsZ0dLixbR7Ok4RXdnCQO66guLmbtWvELeQ@mail.gmail.com>
	<CAO6W-2eKt061BEiCbXD3wedwYNXrr-CpNbaZpTEuT+hdnS8Ohw@mail.gmail.com>
Date: Wed, 22 Aug 2012 08:57:31 +0300
Message-ID: 
 <CAE8Cr2vs4rWZV_VeLCoAeOhRHZHw-htBpZ7pei6JWAWXFxBGkA@mail.gmail.com>
Subject: Re: why is num of map tasks gets overridden?
From: nutch buddy <nutch.buddy@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=f46d042f94ca221b4904c7d46b66

--f46d042f94ca221b4904c7d46b66
Content-Type: text/plain; charset=ISO-8859-1

So what can I do If I have a given input, and my job needs a lot of memroy
per map task?
I can't control the amount of map tasks, and my total memory per machine is
limited - I'll eventaully get each machine's memory full.

On Tue, Aug 21, 2012 at 3:52 PM, Bertrand Dechoux <dechouxb@gmail.com>wrote:

> Actually controlling the number of maps is subtle. The mapred.map.tasks
>> parameter is just a hint to the InputFormat for the number of maps. The
>> default InputFormat behavior is to split the total number of bytes into the
>> right number of fragments. However, in the default case the DFS block size
>> of the input files is treated as an upper bound for input splits. A lower
>> bound on the split size can be set via mapred.min.split.size. Thus, if you
>> expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k
>> maps, unless your mapred.map.tasks is even larger. Ultimately the
>> InputFormat<http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html>determines the number of maps.
>>
>
> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
>
> Bertrand
>
>
> On Tue, Aug 21, 2012 at 2:19 PM, nutch buddy <nutch.buddy@gmail.com>wrote:
>
>> I configure a job in hadoop ,set the number of map tasks in the code to 8.
>>
>> Then I run the job and it gets 152 map tasks. Can't get why its being
>> overriden and whhere it get 152 from.
>>
>> The mapred-site.xml has 24 as mapred.map.tasks.
>>
>> any idea?
>>
>
>
>
> --
> Bertrand Dechoux
>

--f46d042f94ca221b4904c7d46b66
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">So what can I do If I have a given input, and my job needs=
 a lot of memroy per map task?<div>I can&#39;t control the amount of map ta=
sks, and my total memory per machine is limited - I&#39;ll eventaully get e=
ach machine&#39;s memory full.<br>
<br><div class=3D"gmail_quote">On Tue, Aug 21, 2012 at 3:52 PM, Bertrand De=
choux <span dir=3D"ltr">&lt;<a href=3D"mailto:dechouxb@gmail.com" target=3D=
"_blank">dechouxb@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"g=
mail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-l=
eft:1ex">
<blockquote style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204=
,204,204);padding-left:1ex" class=3D"gmail_quote">Actually controlling the =
number of maps is subtle. The mapred.map.tasks=20
parameter is just a hint to the InputFormat for the number of maps. The=20
default InputFormat behavior is to split the total number of bytes into=20
the right number of fragments. However, in the default case the DFS=20
block size of the input files is treated as an upper bound for input=20
splits. A lower bound on the split size can be set via=20
mapred.min.split.size. Thus, if you expect 10TB of input data and have=20
128MB DFS blocks, you&#39;ll end up with 82k maps, unless your=20
mapred.map.tasks is even larger. Ultimately the <a href=3D"http://hadoop.ap=
ache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html" t=
arget=3D"_blank">InputFormat</a> determines the number of maps. <br></block=
quote>

<br><a href=3D"http://wiki.apache.org/hadoop/HowManyMapsAndReduces" target=
=3D"_blank">http://wiki.apache.org/hadoop/HowManyMapsAndReduces</a><br><br>=
Bertrand<div class=3D"HOEnZb"><div class=3D"h5"><br><br><div class=3D"gmail=
_quote">
On Tue, Aug 21, 2012 at 2:19 PM, nutch buddy <span dir=3D"ltr">&lt;<a href=
=3D"mailto:nutch.buddy@gmail.com" target=3D"_blank">nutch.buddy@gmail.com</=
a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><p style=3D"clear:both;vert=
ical-align:baseline;line-height:18px;font-size:14px;font-family:Arial,&#39;=
Liberation Sans&#39;,&#39;DejaVu Sans&#39;,sans-serif;margin:0px 0px 1em;wo=
rd-wrap:break-word;border:0px;padding:0px">


I configure a job in hadoop ,set the number of map tasks in the code to 8.<=
/p><p style=3D"clear:both;vertical-align:baseline;line-height:18px;font-siz=
e:14px;font-family:Arial,&#39;Liberation Sans&#39;,&#39;DejaVu Sans&#39;,sa=
ns-serif;margin:0px 0px 1em;word-wrap:break-word;border:0px;padding:0px">


Then I run the job and it gets 152 map tasks. Can&#39;t get why its being o=
verriden and whhere it get 152 from.</p><p style=3D"clear:both;vertical-ali=
gn:baseline;line-height:18px;font-size:14px;font-family:Arial,&#39;Liberati=
on Sans&#39;,&#39;DejaVu Sans&#39;,sans-serif;margin:0px 0px 1em;word-wrap:=
break-word;border:0px;padding:0px">


The mapred-site.xml has 24 as mapred.map.tasks.</p><p style=3D"clear:both;v=
ertical-align:baseline;line-height:18px;font-size:14px;font-family:Arial,&#=
39;Liberation Sans&#39;,&#39;DejaVu Sans&#39;,sans-serif;margin:0px 0px 1em=
;word-wrap:break-word;border:0px;padding:0px">


any idea?</p></div>
</blockquote></div><br><br clear=3D"all"><br></div></div><span class=3D"HOE=
nZb"><font color=3D"#888888">-- <br>Bertrand Dechoux<br>
</font></span></blockquote></div><br></div></div>

--f46d042f94ca221b4904c7d46b66--