Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of kawa.adam@gmail.com designates
 209.85.223.172 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <CECE8B37.4A52A%idror@akamai.com>
References: <CECE5892.4A4FB%idror@akamai.com>
	<CALKHTG-Ex1ZTjwUzPiJK9p5fjr7w4+RhTN9LK96HdOxOJO6c5g@mail.gmail.com>
	<CECE88C2.4A51A%idror@akamai.com>
	<CAHodO=+nye+p30iSRD=ctq=k9TAxzAP8R+_5ZU-jOYK9+yuAgQ@mail.gmail.com>
	<CECE8B37.4A52A%idror@akamai.com>
Date: Wed, 11 Dec 2013 20:46:04 +0100
Message-ID: 
 <CAHodO=K=aMeiPBs2B_E2FK-m5XdtkZBhWMT7pvbB8sd-2uuE3Q@mail.gmail.com>
Subject: Re: Why is Hadoop always running just 4 tasks?
From: Adam Kawa <kawa.adam@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=047d7bdc119ac337ab04ed477a46

--047d7bdc119ac337ab04ed477a46
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

I am not sure if Hadoop detects that. I guess that it will run one map
tasks for them. Please let me know, if I am wrong.


2013/12/11 Dror, Ittay <idror@akamai.com>

> OK, thank you for the solution.
>
> BTW I just concatenated several .gz files together with cat  (without
> uncompressing first). So they should each uncompress individually
>
>
>
> From: Adam Kawa <kawa.adam@gmail.com>
> Reply-To: "user@hadoop.apache.org" <user@hadoop.apache.org>
> Date: Wednesday, December 11, 2013 9:33 PM
>
> To: "user@hadoop.apache.org" <user@hadoop.apache.org>
> Subject: Re: Why is Hadoop always running just 4 tasks?
>
> mapred.map.tasks is rather a hint to InputFormat (
> http://wiki.apache.org/hadoop/HowManyMapsAndReduces) and it is ignored in
> your case.
>
> You process gz files, and InputFormat has isSplitatble method that for gz
> files it returns false, so that each map tasks process a whole file (this
> is related with gz files - you can not uncompress a part of gzipped file.
> To uncompress it, you must read it from the beginning to the end).
>
>
>
>
> 2013/12/11 Dror, Ittay <idror@akamai.com>
>
>> Thank you.
>>
>> The command is:
>> hadoop jar /tmp/Algo-0.0.1.jar com.twitter.scalding.Tool com.akamai.Algo
>> --hdfs --header --input /algo/input{0..3}.gz --output /algo/output
>>
>> Btw, the Hadoop version is 1.2.1
>>
>> Not sure what driver you are referring to.
>> Regards,
>> Ittay
>>
>> From: Mirko K=E4mpf <mirko.kaempf@gmail.com>
>> Reply-To: "user@hadoop.apache.org" <user@hadoop.apache.org>
>> Date: Wednesday, December 11, 2013 6:21 PM
>> To: "user@hadoop.apache.org" <user@hadoop.apache.org>
>> Subject: Re: Why is Hadoop always running just 4 tasks?
>>
>> Hi,
>>
>> what is the command you execute to submit the job?
>> Please share also the driver code ....
>>
>> So we can troubleshoot better.
>>
>> Best wishes
>> Mirko
>>
>>
>>
>>
>> 2013/12/11 Dror, Ittay <idror@akamai.com>
>>
>>> I have a cluster of 4 machines with 24 cores and 7 disks each.
>>>
>>> On each node I copied from local a file of 500G. So I have 4 files in
>>> hdfs with many blocks. My replication factor is 1.
>>>
>>> I run a job (a scalding flow) and while there are 96 reducers pending,
>>> there are only 4 active map tasks.
>>>
>>> What am I doing wrong? Below is the configuration
>>>
>>> Thanks,
>>> Ittay
>>>
>>> <configuration>
>>> <property>
>>> <name>mapred.job.tracker</name>
>>>  <value>master:54311</value>
>>> </property>
>>>
>>> <property>
>>>  <name>mapred.map.tasks</name>
>>>  <value>96</value>
>>> </property>
>>>
>>> <property>
>>>  <name>mapred.reduce.tasks</name>
>>>  <value>96</value>
>>> </property>
>>>
>>> <property>
>>> <name>mapred.local.dir</name>
>>>
>>> <value>/hdfs/0/mapred/local,/hdfs/1/mapred/local,/hdfs/2/mapred/local,/=
hdfs/3/mapred/local,/hdfs/4/mapred/local,/hdfs/5/mapred/local,/hdfs/6/mapre=
d/local,/hdfs/7/mapred/local</value>
>>> </property>
>>>
>>> <property>
>>> <name>mapred.tasktracker.map.tasks.maximum</name>
>>> <value>24</value>
>>> </property>
>>>
>>> <property>
>>>     <name>mapred.tasktracker.reduce.tasks.maximum</name>
>>>     <value>24</value>
>>> </property>
>>> </configuration>
>>>
>>
>>
>

--047d7bdc119ac337ab04ed477a46
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I am not sure if Hadoop detects that. I guess that it will=
 run one map tasks for them. Please let me know, if I am wrong.</div><div c=
lass=3D"gmail_extra"><br><br><div class=3D"gmail_quote">2013/12/11 Dror, It=
tay <span dir=3D"ltr">&lt;<a href=3D"mailto:idror@akamai.com" target=3D"_bl=
ank">idror@akamai.com</a>&gt;</span><br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div style=3D"font-size:14px;font-family:Cal=
ibri,sans-serif;word-wrap:break-word"><div>OK, thank you for the solution.<=
/div>
<div><br></div><div>BTW I just concatenated several .gz files together with=
 cat =A0(without uncompressing first). So they should each uncompress indiv=
idually</div><div><br></div><div><br></div><div><br></div><span><div style=
=3D"border-right:medium none;padding-right:0in;padding-left:0in;padding-top=
:3pt;text-align:left;font-size:11pt;border-bottom:medium none;font-family:C=
alibri;border-top:#b5c4df 1pt solid;padding-bottom:0in;border-left:medium n=
one">
<span style=3D"font-weight:bold">From: </span> Adam Kawa &lt;<a href=3D"mai=
lto:kawa.adam@gmail.com" target=3D"_blank">kawa.adam@gmail.com</a>&gt;<br><=
span style=3D"font-weight:bold">Reply-To: </span> &quot;<a href=3D"mailto:u=
ser@hadoop.apache.org" target=3D"_blank">user@hadoop.apache.org</a>&quot; &=
lt;<a href=3D"mailto:user@hadoop.apache.org" target=3D"_blank">user@hadoop.=
apache.org</a>&gt;<br>
<span style=3D"font-weight:bold">Date: </span> Wednesday, December 11, 2013=
 9:33 PM<div><div class=3D"h5"><br><span style=3D"font-weight:bold">To: </s=
pan> &quot;<a href=3D"mailto:user@hadoop.apache.org" target=3D"_blank">user=
@hadoop.apache.org</a>&quot; &lt;<a href=3D"mailto:user@hadoop.apache.org" =
target=3D"_blank">user@hadoop.apache.org</a>&gt;<br>
<span style=3D"font-weight:bold">Subject: </span> Re: Why is Hadoop always =
running just 4 tasks?<br></div></div></div><div><div class=3D"h5"><div><br>=
</div><blockquote style=3D"BORDER-LEFT:#b5c4df 5 solid;PADDING:0 0 0 5;MARG=
IN:0 0 0 5">
<div><div><div dir=3D"ltr"><span style=3D"font-family:Calibri,sans-serif;fo=
nt-size:14px">mapred.map.tasks is rather a hint to InputFormat (</span><a h=
ref=3D"http://wiki.apache.org/hadoop/HowManyMapsAndReduces" target=3D"_blan=
k">http://wiki.apache.org/hadoop/HowManyMapsAndReduces</a>) and it
 is ignored in your case.
<div><br></div><div>You process gz files, and InputFormat has isSplitatble =
method that for gz files it returns false, so that each map tasks process a=
 whole file (this is related with gz files - you can not uncompress a part =
of gzipped file. To uncompress it, you must read
 it from the beginning to the end).<br><div><br></div><div><br></div></div>=
</div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">2013/12=
/11 Dror, Ittay <span dir=3D"ltr">&lt;<a href=3D"mailto:idror@akamai.com" t=
arget=3D"_blank">idror@akamai.com</a>&gt;</span><br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div style=3D"font-size:14px;font-family:Cal=
ibri,sans-serif;word-wrap:break-word"><div>Thank you.</div><div><br></div><=
div>
The command is:</div><div>hadoop jar /tmp/Algo-0.0.1.jar com.twitter.scaldi=
ng.Tool com.akamai.Algo --hdfs --header --input /algo/input{0..3}.gz --outp=
ut /algo/output</div><div><br></div><div>Btw, the Hadoop version is 1.2.1</=
div>
<div><br></div><div>Not sure what driver you are referring to.=A0</div><div=
>Regards,</div><div>Ittay</div><div><br></div><span><div style=3D"border-ri=
ght:medium none;padding-right:0in;padding-left:0in;padding-top:3pt;text-ali=
gn:left;font-size:11pt;border-bottom:medium none;font-family:Calibri;border=
-top:#b5c4df 1pt solid;padding-bottom:0in;border-left:medium none">
<span style=3D"font-weight:bold">From: </span>Mirko K=E4mpf &lt;<a href=3D"=
mailto:mirko.kaempf@gmail.com" target=3D"_blank">mirko.kaempf@gmail.com</a>=
&gt;<br><span style=3D"font-weight:bold">Reply-To: </span>&quot;<a href=3D"=
mailto:user@hadoop.apache.org" target=3D"_blank">user@hadoop.apache.org</a>=
&quot; &lt;<a href=3D"mailto:user@hadoop.apache.org" target=3D"_blank">user=
@hadoop.apache.org</a>&gt;<br>
<span style=3D"font-weight:bold">Date: </span>Wednesday, December 11, 2013 =
6:21 PM<br><span style=3D"font-weight:bold">To: </span>&quot;<a href=3D"mai=
lto:user@hadoop.apache.org" target=3D"_blank">user@hadoop.apache.org</a>&qu=
ot; &lt;<a href=3D"mailto:user@hadoop.apache.org" target=3D"_blank">user@ha=
doop.apache.org</a>&gt;<br>
<span style=3D"font-weight:bold">Subject: </span>Re: Why is Hadoop always r=
unning just 4 tasks?<br></div><div><div><div><br></div><blockquote style=3D=
"BORDER-LEFT:#b5c4df 5 solid;PADDING:0 0 0 5;MARGIN:0 0 0 5"><div><div><div=
 dir=3D"ltr">
<div>Hi,</div><div>=A0</div><div>what is the command you execute to submit =
the job?<br>
Please share also the driver code ....</div><div>=A0</div><div>So we can tr=
oubleshoot better.</div><div>=A0</div><div>Best wishes</div><div>Mirko</div=
><div>=A0</div><div>=A0</div></div><div class=3D"gmail_extra"><br><br><div =
class=3D"gmail_quote">
2013/12/11 Dror, Ittay <span dir=3D"ltr">&lt;<a href=3D"mailto:idror@akamai=
.com" target=3D"_blank">idror@akamai.com</a>&gt;</span><br><blockquote clas=
s=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;pad=
ding-left:1ex">
<div style=3D"font-family:Calibri,sans-serif;font-size:14px;word-wrap:break=
-word"><div>I have a cluster of 4 machines with 24 cores and 7 disks each.<=
/div><div><br></div><div>On each node I copied from local a file of 500G. S=
o I have 4 files in hdfs with many blocks. My replication factor is 1.</div=
>
<div><br></div><div>I run a job (a scalding flow) and while there are 96 re=
ducers pending, there are only 4 active map tasks.=A0</div><div><br></div><=
div>What am I doing wrong? Below is the configuration</div><div><br></div>
<div>Thanks,</div><div>Ittay</div><div><br></div><div><div>&lt;configuratio=
n&gt;</div><div><span style=3D"white-space:pre-wrap"></span>&lt;property&gt=
;</div><div><span style=3D"white-space:pre-wrap"></span>&lt;name&gt;mapred.=
job.tracker&lt;/name&gt;</div>
<div><span style=3D"white-space:pre-wrap"></span>=A0<span style=3D"white-sp=
ace:pre-wrap"></span>&lt;value&gt;master:54311&lt;/value&gt;</div><div><spa=
n style=3D"white-space:pre-wrap"></span>&lt;/property&gt;</div><div><br></d=
iv><div>
<span style=3D"white-space:pre-wrap"></span>&lt;property&gt;</div><div><spa=
n style=3D"white-space:pre-wrap"></span>=A0&lt;name&gt;mapred.map.tasks&lt;=
/name&gt;</div><div><span style=3D"white-space:pre-wrap"></span>=A0&lt;valu=
e&gt;96&lt;/value&gt;</div>
<div><span style=3D"white-space:pre-wrap"></span>&lt;/property&gt;</div><di=
v><br></div><div><span style=3D"white-space:pre-wrap"></span>&lt;property&g=
t;</div><div><span style=3D"white-space:pre-wrap"></span>=A0<span style=3D"=
white-space:pre-wrap"></span>&lt;name&gt;mapred.reduce.tasks&lt;/name&gt;</=
div>
<div><span style=3D"white-space:pre-wrap"></span>=A0<span style=3D"white-sp=
ace:pre-wrap"></span>&lt;value&gt;96&lt;/value&gt;</div><div><span style=3D=
"white-space:pre-wrap"></span>&lt;/property&gt;</div><div><br></div><div><s=
pan style=3D"white-space:pre-wrap"></span>&lt;property&gt;</div>
<div><span style=3D"white-space:pre-wrap"></span>&lt;name&gt;mapred.local.d=
ir&lt;/name&gt;</div><div><span style=3D"white-space:pre-wrap"></span>&lt;v=
alue&gt;/hdfs/0/mapred/local,/hdfs/1/mapred/local,/hdfs/2/mapred/local,/hdf=
s/3/mapred/local,/hdfs/4/mapred/local,/hdfs/5/mapred/local,/hdfs/6/mapred/l=
ocal,/hdfs/7/mapred/local&lt;/value&gt;</div>
<div><span style=3D"white-space:pre-wrap"></span>&lt;/property&gt;</div><di=
v><br></div><div><span style=3D"white-space:pre-wrap"></span>&lt;property&g=
t;</div><div><span style=3D"white-space:pre-wrap"></span>&lt;name&gt;mapred=
.tasktracker.map.tasks.maximum&lt;/name&gt;</div>
<div><span style=3D"white-space:pre-wrap"></span>&lt;value&gt;24&lt;/value&=
gt;</div><div><span style=3D"white-space:pre-wrap"></span>&lt;/property&gt;=
</div><div><br></div><div><span style=3D"white-space:pre-wrap"></span>&lt;p=
roperty&gt;</div>
<div>=A0 =A0 <span style=3D"white-space:pre-wrap"></span>&lt;name&gt;mapred=
.tasktracker.reduce.tasks.maximum&lt;/name&gt;</div><div>=A0 =A0 <span styl=
e=3D"white-space:pre-wrap"></span>&lt;value&gt;24&lt;/value&gt;</div><div><=
span style=3D"white-space:pre-wrap"></span>&lt;/property&gt;</div>
<div>&lt;/configuration&gt;</div></div></div></blockquote></div><br></div><=
/div></div></blockquote></div></div></span></div></blockquote></div><br></d=
iv></div></div></blockquote></div></div></span></div>
</blockquote></div><br></div>

--047d7bdc119ac337ab04ed477a46--