Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of vinodkv@hortonworks.com
 designates 209.85.214.178 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CABKQiduXzamSnBRJmmH8z7DXJ8qgekWkaCNSfJEdKPT0keQu5Q@mail.gmail.com>
References: 
 <CABKQidvAifxKB+OYp+xrJxThSa3iHoaMsbq5keCgRnKPEVmuTw@mail.gmail.com>
	<CAOcnVr1sY13FTFb3u4M9T2pJ99-qbUvBj06jbsWiGvn-jDm2Bg@mail.gmail.com>
	<CABKQidtEq5pt2EWXRhg0hGiUmDK6ySLRJ=sd4S0uKVP-o-O+mw@mail.gmail.com>
	<CAOcnVr37s+OAEXrC=H+AJiArxB9NPhrhjkrtyVCjnRg5eoT2=A@mail.gmail.com>
	<CABKQiduNiwnOGtf=_mjBAv8D5ywppE6ALXf5B2OO86FEQtiwjQ@mail.gmail.com>
	<CABKQidsMorK8V8=WgFpru8Z6yt95y_6LQcXnZT8WzmaBAvSgpQ@mail.gmail.com>
	<7F7E1558-2A7C-487A-AE0F-6CDCEE1EE3E8@apache.org>
	<CABKQidtWJ-z1nwLE4tWGQfCe1GONjB23b2xx7TFNJRvXx_07QA@mail.gmail.com>
	<CABKQiduXzamSnBRJmmH8z7DXJ8qgekWkaCNSfJEdKPT0keQu5Q@mail.gmail.com>
Date: Tue, 29 Jan 2013 12:08:22 -0800
Message-ID: 
 <CAMEEY-JywnS4y5Or5DRd3j8Hq7JxbUtf3YJOWWeaTNnLG6zp=Q@mail.gmail.com>
Subject: Re: number of mapper tasks
From: Vinod Kumar Vavilapalli <vinodkv@hortonworks.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=14dae9399495a66c1104d472f4d4

--14dae9399495a66c1104d472f4d4
Content-Type: text/plain; charset=ISO-8859-1

Tried looking at your code, it's a bit involved. Instead of trying to run
the job, try unit-testing your input format. Test for getSplits(), whatever
number of splits that method returns, that will be the number of mappers
that will run.

You can also use LocalJobRunner also for this - set mapred.job.tracker to
local and run your job locally on your machine instead of trying on a
cluster.

HTH,
+Vinod


On Tue, Jan 29, 2013 at 4:53 AM, Marcelo Elias Del Valle <mvallebr@gmail.com
> wrote:

> Hello,
>
>     I have been able to make this work. I don't know why, but when but
> input file is zipped (read as a input stream) it creates only 1 mapper.
> However, when it's not zipped, it creates more mappers (running 3 instances
> it created 4 mappers and running 5 instances, it created 8 mappers).
>     I really would like to know why this happens and even with this number
> of mappers, I would like to know why more mappers aren't created. I was
> reading part of the book "Hadoop - The definitive guide" (
> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats)
> which says:
>
> "The JobClient calls the getSplits() method, passing the desired number
> of map tasks as the numSplits argument. This number is treated as a hint,
> as InputFormat implementations are free to return a different number of
> splits to the number specified in numSplits. Having calculated the
> splits, the client sends them to the jobtracker, which uses their storage
> locations to schedule map tasks to process them on the tasktrackers. ..."
>
>      I am not sure on how to get more info.
>
>      Would you recommend me to try to find the answer on the book? Or
> should I read hadoop source code directly?
>
> Best regards,
> Marcelo.
>
>
> 2013/1/29 Marcelo Elias Del Valle <mvallebr@gmail.com>
>
>> I implemented my custom input format. Here is how I used it:
>>
>> https://github.com/mvallebr/CSVInputFormat/blob/master/src/test/java/org/apache/hadoop/mapreduce/lib/input/test/CSVTestRunner.java
>>
>> As you can see, I do:
>> importerJob.setInputFormatClass(CSVNLineInputFormat.class);
>>
>> And here is the Input format and the linereader:
>>
>> https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java
>>
>> https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVLineRecordReader.java
>>
>> In this input format, I completely ignore these other parameters and get
>> the splits by the number of lines. The amount of lines per map can be
>> controlled by the same parameter used in NLineInputFormat:
>>
>> public static final String LINES_PER_MAP =
>> "mapreduce.input.lineinputformat.linespermap";
>> However, it has really no effect on the number of maps.
>>
>>
>>
>> 2013/1/29 Vinod Kumar Vavilapalli <vinodkv@hortonworks.com>
>>
>>>
>>> Regarding your original question, you can use the min and max split
>>> settings to control the number of maps:
>>> http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html. See #setMinInputSplitSize and #setMaxInputSplitSize. Or
>>> use mapred.min.split.size directly.
>>>
>>> W.r.t your custom inputformat, are you sure you job is using this
>>> InputFormat and not the default one?
>>>
>>>  HTH,
>>> +Vinod Kumar Vavilapalli
>>> Hortonworks Inc.
>>> http://hortonworks.com/
>>>
>>> On Jan 28, 2013, at 12:56 PM, Marcelo Elias Del Valle wrote:
>>>
>>> Just to complement the last question, I have implemented the getSplits
>>> method in my input format:
>>>
>>> https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java
>>>
>>> However, it still doesn't create more than 2 map tasks. Is there
>>> something I could do about it to assure more map tasks are created?
>>>
>>> Thanks
>>> Marcelo.
>>>
>>>
>>> 2013/1/28 Marcelo Elias Del Valle <mvallebr@gmail.com>
>>>
>>>> Sorry for asking too many questions, but the answers are really
>>>> happening.
>>>>
>>>>
>>>> 2013/1/28 Harsh J <harsh@cloudera.com>
>>>>
>>>>> This seems CPU-oriented. You probably want the NLineInputFormat? See
>>>>>
>>>>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>>>>> .
>>>>> This should let you spawn more maps as we, based on your N factor.
>>>>>
>>>>
>>>> Indeed, CPU is my bottleneck. That's why I want more things in parallel.
>>>> Actually, I wrote my own InputFormat, to be able to process multiline
>>>> CSVs: https://github.com/mvallebr/CSVInputFormat
>>>> I could change it to read several lines at a time, but would this alone
>>>> allow more tasks running in parallel?
>>>>
>>>>
>>>>> Not really - "Slots" are capacities, rather than split factors
>>>>> themselves. You can have N slots always available, but your job has to
>>>>> supply as many map tasks (based on its input/needs/etc.) to use them
>>>>> up.
>>>>>
>>>>
>>>> But how can I do that (supply map tasks) in my job? changing its code?
>>>> hadoop config?
>>>>
>>>>
>>>>> Unless your job sets the number of reducers to 0 manually, 1 default
>>>>> reducer is always run that waits to see if it has any outputs from
>>>>> maps. If it does not receive any outputs after maps have all
>>>>> completed, it dies out with behavior equivalent to a NOP.
>>>>>
>>>> Ok, I did job.setNumReduceTasks(0); , guess this will solve this part,
>>>> thanks!
>>>>
>>>>
>>>> --
>>>> Marcelo Elias Del Valle
>>>> http://mvalle.com - @mvallebr
>>>>
>>>
>>>
>>>
>>> --
>>> Marcelo Elias Del Valle
>>> http://mvalle.com - @mvallebr
>>>
>>>
>>>
>>
>>
>> --
>> Marcelo Elias Del Valle
>> http://mvalle.com - @mvallebr
>>
>
>
>
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>


-- 
+Vinod
Hortonworks Inc.
http://hortonworks.com/

--14dae9399495a66c1104d472f4d4
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><div><div><br></div>Tried looking at your code, it=
9;s a bit involved. Instead of trying to run the job, try unit-testing your=
 input format. Test for getSplits(), whatever number of splits that method =
returns, that will be the number of mappers that will run.<br>
<br>You can also use LocalJobRunner also for this - set mapred.job.tracker =
to local and run your job locally on your machine instead of trying on a cl=
uster.<br><br></div>HTH,<br></div>+Vinod<br><br></div><div class=3D"gmail_e=
xtra">
<br><br><div class=3D"gmail_quote">On Tue, Jan 29, 2013 at 4:53 AM, Marcelo=
 Elias Del Valle <span dir=3D"ltr">&lt;<a href=3D"mailto:mvallebr@gmail.com=
" target=3D"_blank">mvallebr@gmail.com</a>&gt;</span> wrote:<br><blockquote=
 class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc soli=
d;padding-left:1ex">
<div dir=3D"ltr">Hello,=A0<div><br></div><div>=A0 =A0 I have been able to m=
ake this work. I don&#39;t know why, but when but input file is zipped (rea=
d as a input stream) it creates only 1 mapper. However, when it&#39;s not z=
ipped, it creates more mappers (running 3 instances it created 4 mappers an=
d running 5 instances, it created 8 mappers).</div>

<div>=A0 =A0 I really would like to know why this happens and even with thi=
s number of mappers, I would like to know why more mappers aren&#39;t creat=
ed. I was reading part of the book &quot;Hadoop - The definitive guide&quot=
; (<a href=3D"https://www.inkling.com/read/hadoop-definitive-guide-tom-whit=
e-3rd/chapter-7/input-formats" target=3D"_blank">https://www.inkling.com/re=
ad/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats</a>) which=
 says:=A0</div>

<div><br></div><div>&quot;<span style=3D"line-height:23.799999237060547px;f=
ont-size:17px;font-family:&#39;Chronicle Text G3 A&#39;,&#39;Chronicle Text=
 G3 B&#39;,Chronicle,Georgia,serif">The=A0</span><code style=3D"vertical-al=
ign:baseline;line-height:23.799999237060547px;font:inherit;font-family:Cour=
ier,&#39;Courier New&#39;,monospace;margin:0px;border:0px;padding:0px">JobC=
lient</code><span style=3D"line-height:23.799999237060547px;font-size:17px;=
font-family:&#39;Chronicle Text G3 A&#39;,&#39;Chronicle Text G3 B&#39;,Chr=
onicle,Georgia,serif">=A0calls the=A0</span><code style=3D"vertical-align:b=
aseline;line-height:23.799999237060547px;font:inherit;font-family:Courier,&=
#39;Courier New&#39;,monospace;margin:0px;border:0px;padding:0px">getSplits=
()</code><span style=3D"line-height:23.799999237060547px;font-size:17px;fon=
t-family:&#39;Chronicle Text G3 A&#39;,&#39;Chronicle Text G3 B&#39;,Chroni=
cle,Georgia,serif">=A0method, passing the desired number of map tasks as th=
e=A0</span><code style=3D"vertical-align:baseline;line-height:23.7999992370=
60547px;font:inherit;font-family:Courier,&#39;Courier New&#39;,monospace;ma=
rgin:0px;border:0px;padding:0px">numSplits</code><span style=3D"line-height=
:23.799999237060547px;font-size:17px;font-family:&#39;Chronicle Text G3 A&#=
39;,&#39;Chronicle Text G3 B&#39;,Chronicle,Georgia,serif">=A0argument. Thi=
s number is treated as a hint, as=A0</span><code style=3D"vertical-align:ba=
seline;line-height:23.799999237060547px;font:inherit;font-family:Courier,&#=
39;Courier New&#39;,monospace;margin:0px;border:0px;padding:0px">InputForma=
t</code><span style=3D"line-height:23.799999237060547px;font-size:17px;font=
-family:&#39;Chronicle Text G3 A&#39;,&#39;Chronicle Text G3 B&#39;,Chronic=
le,Georgia,serif">=A0implementations are free to return a different number =
of splits to the number specified in=A0</span><code style=3D"vertical-align=
:baseline;line-height:23.799999237060547px;font:inherit;font-family:Courier=
,&#39;Courier New&#39;,monospace;margin:0px;border:0px;padding:0px">numSpli=
ts</code><span style=3D"line-height:23.799999237060547px;font-size:17px;fon=
t-family:&#39;Chronicle Text G3 A&#39;,&#39;Chronicle Text G3 B&#39;,Chroni=
cle,Georgia,serif">. Having calculated the splits, the client sends them to=
 the jobtracker, which uses their storage locations to schedule map tasks t=
o process them on the tasktrackers. ...</span>&quot;</div>

<div>=A0 =A0=A0</div><div>=A0 =A0 =A0I am not sure on how to get more info.=
=A0</div><div><br></div><div>=A0 =A0 =A0Would you recommend me to try to fi=
nd the answer on the book? Or should I read hadoop source code directly?</d=
iv>
<div><br></div><div>Best regards,</div><div>Marcelo.</div></div><div class=
=3D"HOEnZb"><div class=3D"h5"><div class=3D"gmail_extra"><br><br><div class=
=3D"gmail_quote">2013/1/29 Marcelo Elias Del Valle <span dir=3D"ltr">&lt;<a=
 href=3D"mailto:mvallebr@gmail.com" target=3D"_blank">mvallebr@gmail.com</a=
>&gt;</span><br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">I implemented my custom inp=
ut format. Here is how I used it:<div><a href=3D"https://github.com/mvalleb=
r/CSVInputFormat/blob/master/src/test/java/org/apache/hadoop/mapreduce/lib/=
input/test/CSVTestRunner.java" target=3D"_blank">https://github.com/mvalleb=
r/CSVInputFormat/blob/master/src/test/java/org/apache/hadoop/mapreduce/lib/=
input/test/CSVTestRunner.java</a><br>


</div><div><br></div><div>As you can see, I do:=A0</div><div><span style=3D=
"line-height:16px;color:rgb(51,51,51);font-size:12px;white-space:pre-wrap;f=
ont-family:Consolas,&#39;Liberation Mono&#39;,Courier,monospace;margin:0px;=
border:0px;padding:0px">importerJob</span><span style=3D"margin:0px;padding=
:0px;border:0px;font-weight:bold;color:rgb(51,51,51);font-family:Consolas,&=
#39;Liberation Mono&#39;,Courier,monospace;font-size:12px;line-height:16px;=
white-space:pre-wrap">.</span><span style=3D"margin:0px;padding:0px;border:=
0px;color:rgb(0,128,128);font-family:Consolas,&#39;Liberation Mono&#39;,Cou=
rier,monospace;font-size:12px;line-height:16px;text-align:start;white-space=
:pre-wrap">setInputFormatClass</span><span style=3D"margin:0px;padding:0px;=
border:0px;font-weight:bold;color:rgb(51,51,51);font-family:Consolas,&#39;L=
iberation Mono&#39;,Courier,monospace;font-size:12px;line-height:16px;white=
-space:pre-wrap">(</span><span style=3D"line-height:16px;color:rgb(51,51,51=
);font-size:12px;white-space:pre-wrap;font-family:Consolas,&#39;Liberation =
Mono&#39;,Courier,monospace;margin:0px;border:0px;padding:0px">CSVNLineInpu=
tFormat</span><span style=3D"margin:0px;padding:0px;border:0px;font-weight:=
bold;color:rgb(51,51,51);font-family:Consolas,&#39;Liberation Mono&#39;,Cou=
rier,monospace;font-size:12px;line-height:16px;white-space:pre-wrap">.</spa=
n><span style=3D"margin:0px;padding:0px;border:0px;color:rgb(0,128,128);fon=
t-family:Consolas,&#39;Liberation Mono&#39;,Courier,monospace;font-size:12p=
x;line-height:16px;text-align:start;white-space:pre-wrap">class</span><span=
 style=3D"margin:0px;padding:0px;border:0px;font-weight:bold;color:rgb(51,5=
1,51);font-family:Consolas,&#39;Liberation Mono&#39;,Courier,monospace;font=
-size:12px;line-height:16px;white-space:pre-wrap">);</span><br>


</div><div><br></div><div>And here is the Input format and the linereader:<=
/div><div><a href=3D"https://github.com/mvallebr/CSVInputFormat/blob/master=
/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.ja=
va" target=3D"_blank">https://github.com/mvallebr/CSVInputFormat/blob/maste=
r/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.j=
ava</a><br>


</div><div><a href=3D"https://github.com/mvallebr/CSVInputFormat/blob/maste=
r/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVLineRecordReader.j=
ava" target=3D"_blank">https://github.com/mvallebr/CSVInputFormat/blob/mast=
er/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVLineRecordReader.=
java</a><br>


</div><div><br></div><div>In this input format, I completely ignore these o=
ther parameters and get the splits by the number of lines. The amount of li=
nes per map can be controlled by the same parameter used in NLineInputForma=
t:</div>


<div><br></div><div><span style=3D"margin:0px;padding:0px;border:0px;font-w=
eight:bold;color:rgb(51,51,51);font-family:Consolas,&#39;Liberation Mono=
9;,Courier,monospace;font-size:12px;line-height:16px;white-space:pre-wrap">=
public</span><span style=3D"color:rgb(51,51,51);font-family:Consolas,&#39;L=
iberation Mono&#39;,Courier,monospace;font-size:12px;line-height:16px;white=
-space:pre-wrap"> </span><span style=3D"margin:0px;padding:0px;border:0px;f=
ont-weight:bold;color:rgb(51,51,51);font-family:Consolas,&#39;Liberation Mo=
no&#39;,Courier,monospace;font-size:12px;line-height:16px;white-space:pre-w=
rap">static</span><span style=3D"color:rgb(51,51,51);font-family:Consolas,&=
#39;Liberation Mono&#39;,Courier,monospace;font-size:12px;line-height:16px;=
white-space:pre-wrap"> </span><span style=3D"margin:0px;padding:0px;border:=
0px;font-weight:bold;color:rgb(51,51,51);font-family:Consolas,&#39;Liberati=
on Mono&#39;,Courier,monospace;font-size:12px;line-height:16px;white-space:=
pre-wrap">final</span><span style=3D"color:rgb(51,51,51);font-family:Consol=
as,&#39;Liberation Mono&#39;,Courier,monospace;font-size:12px;line-height:1=
6px;white-space:pre-wrap"> </span><span style=3D"line-height:16px;color:rgb=
(51,51,51);font-size:12px;white-space:pre-wrap;font-family:Consolas,&#39;Li=
beration Mono&#39;,Courier,monospace;margin:0px;border:0px;padding:0px">Str=
ing</span><span style=3D"color:rgb(51,51,51);font-family:Consolas,&#39;Libe=
ration Mono&#39;,Courier,monospace;font-size:12px;line-height:16px;white-sp=
ace:pre-wrap"> </span><span style=3D"line-height:16px;color:rgb(51,51,51);f=
ont-size:12px;white-space:pre-wrap;font-family:Consolas,&#39;Liberation Mon=
o&#39;,Courier,monospace;margin:0px;border:0px;padding:0px">LINES_PER_MAP</=
span><span style=3D"color:rgb(51,51,51);font-family:Consolas,&#39;Liberatio=
n Mono&#39;,Courier,monospace;font-size:12px;line-height:16px;white-space:p=
re-wrap"> </span><span style=3D"margin:0px;padding:0px;border:0px;font-weig=
ht:bold;color:rgb(51,51,51);font-family:Consolas,&#39;Liberation Mono&#39;,=
Courier,monospace;font-size:12px;line-height:16px;white-space:pre-wrap">=3D=
</span><span style=3D"color:rgb(51,51,51);font-family:Consolas,&#39;Liberat=
ion Mono&#39;,Courier,monospace;font-size:12px;line-height:16px;white-space=
:pre-wrap"> </span><span style=3D"margin:0px;padding:0px;border:0px;color:r=
gb(221,17,68);font-family:Consolas,&#39;Liberation Mono&#39;,Courier,monosp=
ace;font-size:12px;line-height:16px;white-space:pre-wrap">&quot;mapreduce.i=
nput.lineinputformat.linespermap&quot;</span><span style=3D"margin:0px;padd=
ing:0px;border:0px;font-weight:bold;color:rgb(51,51,51);font-family:Consola=
s,&#39;Liberation Mono&#39;,Courier,monospace;font-size:12px;line-height:16=
px;white-space:pre-wrap">;</span><br>


</div><div>However, it has really no effect on the number of maps.</div><di=
v><br></div></div><div><div><div class=3D"gmail_extra"><br><br><div class=
=3D"gmail_quote">2013/1/29 Vinod Kumar Vavilapalli <span dir=3D"ltr">&lt;<a=
 href=3D"mailto:vinodkv@hortonworks.com" target=3D"_blank">vinodkv@hortonwo=
rks.com</a>&gt;</span><br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div style=3D"word-wrap:break-word"><div><br=
></div><div>Regarding your original question, you can use the min and max s=
plit settings to control the number of maps: <a href=3D"http://hadoop.apach=
e.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat=
.html" target=3D"_blank">http://hadoop.apache.org/docs/stable/api/org/apach=
e/hadoop/mapreduce/lib/input/FileInputFormat.html</a> . See #setMinInputSpl=
itSize and #setMaxInputSplitSize. Or use=A0mapred.min.split.size directly.<=
/div>


<div><br></div>W.r.t your custom inputformat, are you sure you job is using=
 this InputFormat and not the default one?<div><br><div>
<span style=3D"text-indent:0px;letter-spacing:normal;font-variant:normal;te=
xt-align:-webkit-auto;font-style:normal;font-weight:normal;line-height:norm=
al;border-collapse:separate;text-transform:none;font-size:medium;white-spac=
e:normal;font-family:Helvetica;word-spacing:0px"><div>


<div>HTH,</div><div>+Vinod Kumar Vavilapalli</div><div>Hortonworks Inc.<br>=
<a href=3D"http://hortonworks.com/" target=3D"_blank">http://hortonworks.co=
m/</a></div></div></span>
</div><div><div>

<br><div><div>On Jan 28, 2013, at 12:56 PM, Marcelo Elias Del Valle wrote:<=
/div><br><blockquote type=3D"cite"><div dir=3D"ltr">Just to complement the =
last question, I have implemented the=A0getSplits method in my input format=
:<div>


<a href=3D"https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/=
java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java" target=
=3D"_blank">https://github.com/mvallebr/CSVInputFormat/blob/master/src/main=
/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java</a><br=
>


</div><div><br></div><div>However, it still doesn&#39;t create more than 2 =
map tasks. Is there something I could do about it to assure more map tasks =
are created?</div><div><br></div><div>Thanks</div><div>
Marcelo.</div></div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_=
quote">2013/1/28 Marcelo Elias Del Valle <span dir=3D"ltr">&lt;<a href=3D"m=
ailto:mvallebr@gmail.com" target=3D"_blank">mvallebr@gmail.com</a>&gt;</spa=
n><br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Sorry for asking too many q=
uestions, but the answers are really happening.<div class=3D"gmail_extra"><=
br>


<br>
<div class=3D"gmail_quote"><div>2013/1/28 Harsh J <span dir=3D"ltr">&lt;<a =
href=3D"mailto:harsh@cloudera.com" target=3D"_blank">harsh@cloudera.com</a>=
&gt;</span><br>
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;p=
adding-left:1ex">This seems CPU-oriented. You probably want the NLineInputF=
ormat? See<br>


<a href=3D"http://hadoop.apache.org/common/docs/current/api/org/apache/hado=
op/mapred/lib/NLineInputFormat.html" target=3D"_blank">http://hadoop.apache=
.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.=
html</a>.<br>


This should let you spawn more maps as we, based on your N factor.<br></blo=
ckquote><div><br></div></div><div>Indeed, CPU is my bottleneck. That&#39;s =
why I want more things in parallel.</div><div>Actually, I wrote my own Inpu=
tFormat, to be able to process multiline CSVs:=A0<a href=3D"https://github.=
com/mvallebr/CSVInputFormat" target=3D"_blank">https://github.com/mvallebr/=
CSVInputFormat</a></div>


<div>I could change it to read several lines at a time, but would this alon=
e allow more tasks running in parallel?</div><div><div>=A0</div><blockquote=
 class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left-width:=
1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left=
:1ex">


<div>Not really - &quot;Slots&quot; are capacities, rather than split facto=
rs<br></div>
themselves. You can have N slots always available, but your job has to<br>
supply as many map tasks (based on its input/needs/etc.) to use them<br>
up.<br></blockquote><div><br></div></div><div>But how can I do that (supply=
 map tasks) in my job? changing its code? hadoop config?</div><div><div>=A0=
</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;b=
order-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:s=
olid;padding-left:1ex">


<div>Unless your job sets the number of reducers to 0 manually, 1 default<b=
r></div>
reducer is always run that waits to see if it has any outputs from<br>
maps. If it does not receive any outputs after maps have all<br>
completed, it dies out with behavior equivalent to a NOP.<br></blockquote><=
/div></div>Ok, I did job.setNumReduceTasks(0); , guess this will solve this=
 part, thanks!<div><br clear=3D"all"><div><br></div>-- <br>Marcelo Elias De=
l Valle<br>


<a href=3D"http://mvalle.com/" target=3D"_blank">http://mvalle.com</a>=A0- =
@mvallebr
</div></div></div>
</blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>Marcelo Elia=
s Del Valle<br><a href=3D"http://mvalle.com/" target=3D"_blank">http://mval=
le.com</a>=A0- @mvallebr
</div>
</blockquote></div><br></div></div></div></div></blockquote></div><br><br c=
lear=3D"all"><div><br></div>-- <br>Marcelo Elias Del Valle<br><a href=3D"ht=
tp://mvalle.com" target=3D"_blank">http://mvalle.com</a>=A0- @mvallebr
</div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>=
Marcelo Elias Del Valle<br><a href=3D"http://mvalle.com" target=3D"_blank">=
http://mvalle.com</a>=A0- @mvallebr
</div>
</div></div></blockquote></div><br><br clear=3D"all"><br>-- <br>+Vinod<br>H=
ortonworks Inc.<br><a href=3D"http://hortonworks.com/" target=3D"_blank">ht=
tp://hortonworks.com/</a>
</div>

--14dae9399495a66c1104d472f4d4--