Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of bejoy.hadoop@gmail.com
 designates 209.85.216.41 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <620012C16AC105498BB52AC8FD97452802FEA23DDC@CHN-HCLT-EVS07.HCLT.CORP.HCL.IN>
References: 
 <620012C16AC105498BB52AC8FD9745280265386D1C@CHN-HCLT-EVS07.HCLT.CORP.HCL.IN>
	<CACD21EP5p=Vuzn9mOOXwrxTF29jp4sFpfdZz+LjZZwUsK5+GQQ@mail.gmail.com>
	<620012C16AC105498BB52AC8FD9745280265386D20@CHN-HCLT-EVS07.HCLT.CORP.HCL.IN>
	<620012C16AC105498BB52AC8FD9745280265386D21@CHN-HCLT-EVS07.HCLT.CORP.HCL.IN>
	<620012C16AC105498BB52AC8FD9745280265386D25@CHN-HCLT-EVS07.HCLT.CORP.HCL.IN>
	<620012C16AC105498BB52AC8FD9745280265386D27@CHN-HCLT-EVS07.HCLT.CORP.HCL.IN>
	<CACD21EMeHyH2WohN=DBcuDdwyW5gVwNgorTjBjHipETzsxrBjw@mail.gmail.com>
	<620012C16AC105498BB52AC8FD9745280265386D2B@CHN-HCLT-EVS07.HCLT.CORP.HCL.IN>
	<620012C16AC105498BB52AC8FD97452802FEA23DDB@CHN-HCLT-EVS07.HCLT.CORP.HCL.IN>
	<620012C16AC105498BB52AC8FD97452802FEA23DDC@CHN-HCLT-EVS07.HCLT.CORP.HCL.IN>
Date: Mon, 9 Jan 2012 23:13:12 +0530
Message-ID: 
 <CACD21EPUsfjMD0W_BqQTySd2VRWBQOvUnfARb03=FVOV6qkD4Q@mail.gmail.com>
Subject: Re: hadoop
From: Bejoy Ks <bejoy.hadoop@gmail.com>
To: "Satish Setty (HCL Financial Services)" <Satish.Setty@hcl.com>
Cc: "mapreduce-user@hadoop.apache.org" <mapreduce-user@hadoop.apache.org>
Content-Type: multipart/alternative; boundary=20cf30363efdbebf9c04b61bee60

--20cf30363efdbebf9c04b61bee60
Content-Type: text/plain; charset=ISO-8859-1

Hi Satish
      It would be good if you don't cross post your queries. Just post it
once on the right list.

      What is your value for mapred.max.split.size? Try setting these
values as well
mapred.min.split.size=0 (it is the default value)
mapred.max.split.size=40

Try executing your job once you apply these changes on top of others you
did.

Regards
Bejoy.K.S

On Mon, Jan 9, 2012 at 5:09 PM, Satish Setty (HCL Financial Services) <
Satish.Setty@hcl.com> wrote:

>  Hi Bejoy,
>
> Even with below settings map tasks never go beyound 2, any way to make
> this spawn 10 tasks. Basically it should look like compute grid -
> computation in parallel.
>
> <property>
>   <name>io.bytes.per.checksum</name>
>   <value>30</value>
>   <description>The number of bytes per checksum.  Must not be larger than
>   io.file.buffer.size.</description>
> </property>
>
> <property>
>   <name>dfs.block.size</name>
>    <value>30</value>
>   <description>The default block size for new files.</description>
> </property>
>  <property>
>   <name>mapred.tasktracker.map.tasks.maximum</name>
>   <value>10</value>
>   <description>The maximum number of map tasks that will be run
>   simultaneously by a task tracker.
>   </description>
> </property>
>
>  ------------------------------
> *From:* Satish Setty (HCL Financial Services)
> *Sent:* Monday, January 09, 2012 1:21 PM
>
> *To:* Bejoy Ks
> *Cc:* mapreduce-user@hadoop.apache.org
> *Subject:* RE: hadoop
>
>   Hi Bejoy,
>
> In hdfs I have set block size - 40bytes . Input Data set is as below
> data1   (5*8=40 bytes)
> data2
> ......
> data10
>
>
> But still I see only 2 map tasks spawned, should have been atleast 10 map
> tasks. Not sure how works internally. Line feed does not work [as you have
> explained below]
>
> Thanks
>  ------------------------------
> *From:* Satish Setty (HCL Financial Services)
> *Sent:* Saturday, January 07, 2012 9:17 PM
> *To:* Bejoy Ks
> *Cc:* mapreduce-user@hadoop.apache.org
> *Subject:* RE: hadoop
>
>   Thanks Bejoy - great information - will try out.
>
> I meant for below problem single node with high configuration -> 8 cpus
> and 8gb memory. Hence taking an example of 10 data items with line feeds.
> We want to utilize full power of machine - hence want at least 10 map tasks
> - each task needs to perform highly complex mathematical simulation.  At
> present it looks like file data is the only way to specify number of map
> tasks via splitsize (in bytes) - but I prefer some criteria like line feed
> or whatever.
>
> In below example - 'data1' corresponds to 5*8=40bytes, if I have data1
> .... data10 in theory I need to see 10 map tasks with split size of 40bytes.
>
> How do I perform logging - where is the log (apache logger) data written?
> system outs may not come as it is background process.
>
> Regards
>
>
>  ------------------------------
> *From:* Bejoy Ks [bejoy.hadoop@gmail.com]
> *Sent:* Saturday, January 07, 2012 7:35 PM
> *To:* Satish Setty (HCL Financial Services)
> *Cc:* mapreduce-user@hadoop.apache.org
> *Subject:* Re: hadoop
>
>  Hi Satish
>       Please find some pointers inline
>
> Problem - As per documentation filesplits corresponds to number of map
> tasks.  File split is governed  by bock size - 64mb in hadoop-0.20.203.0.
> Where can I find default settings for variour parameters like block size,
> number of map/reduce tasks.
>
> [Bejoy] I'd rather state it other way round, the number of map tasks
> triggered by a MR job is determined by number of input splits (and input
> format). If you use TextInputFormat with default settings the number of
> input splits is equal to the no of hdfs blocks occupied by the input. Size
> of an input split is equal to hdfs block size in default(64Mb). If you want
> to have more splits for one hdfs block itself you need to set a value less
> than 64 Mb for mapred.max.split.size.
>
> You can find pretty much all default configuration values from the
> downloaded .tar at
> hadoop-0.20.*/src/mapred/mapred-default.xml
> hadoop-0.20.*/src/hdfs/hdfs-default.xml
> hadoop-0.20.*/src/core/core-default.xml
>
> If you want to alter some of these values then you can provide the same in
> $HADOOP_HOME/conf/mapred-site.xml
> $HADOOP_HOME/conf/hdfs-site.xml
> $HADOOP_HOME/conf/core-site.xml
>
> These values provided in *-site.xml would be taken into account only if
> they are not marked in *-default.xml. If not final, the values provided in
> *-site.xml overrides the values in *-default.xml for corresponding
> configuration parameter.
>
> I require atleast  10 map taks which is same as number of "line feeds".
> Each corresponds to complex calculation to be done by map task. So I can
> have optimal cpu utilization - 8 cpus.
>
> [Bejoy] Hadoop is a good choice processing large amounts of data. It is
> not wise to choose one mapper for one record/line in a file, as creation of
> a map task itself is expensive with jvm spanning and all. Currently you may
> have 10 records in your input but I believe you are just testing Hadoop in
> dev env and in production that wouldn't be the case there could be n files
> having m records each and this m can be in millions.(Just assuming based on
> my experience). On larger data sets you may not need to split on line
> boundaries. There can be multiple lines in a file and if you use
> TextInputFormat it is just one line processed by a map task at an instant.
> If you have n map tasks then n lines could be getting processed at an
> instant of map task execution time frame one by each map task. In larger
> data volumes map tasks are spanned in specific nodes primarily based on
> data locality, then on available tasks slots on data local node and so on.
> It is possible that if you have a 10 node cluster, 10 hdfs blocks
> corresponding to a input file and assume that all the blocks are present
> only on 8 nodes and there are sufficient task slots available on all 8 ,
> then tasks for your job may be executed in 8 nodes alone instead of 10. So
> there are chances that there won't be 100% balanced CPU utilization across
> nodes in a cluster.
>                I'm not really sure how you can spawn map tasks based on
> line feeds in a file .Let us wait for others  to comment on this.
>            Also if your using map reduce for parallel computation alone
> the make sure you set the number of reducers to zero, with that you can
> save a lot of time that would be other wise spend on sort and shuffle
> phases.
> (-D  mapred.reduce.tasks=0)
>
>  Behaviour of maptasks looks strange to be as some times if I give in
> program jobconf.set(num map tasks) it takes 2 or 8.
>
> [Bejoy]There is no default value for number of map tasks, it is determined
> by input splits and  input format used by your job. You cannot set the
> number of map tasks even if you set them at your job level, it is not
> considered. (mapred.map.tasks) . But you can definitely specify the number
> of reduce tasks at your job level  by job.setNumReduceTasks(n) or
> mapred.reduce.tasks. If not set it would take the default value for reduce
> tasks specified in conf files.
>
>
> I see some files like part-00001...
> Are they partitions?
>
> [Bejoy] The part-000* files corresponds to reducers. You'd have n files if
> you have n reducers as one reducer produces one output file.
>
> Hope it helps!..
>
> Regards
> Bejoy.KS
>
>
> On Sat, Jan 7, 2012 at 3:32 PM, Satish Setty (HCL Financial Services) <
> Satish.Setty@hcl.com> wrote:
>
>>  Hi Bijoy,
>>
>> Just finished installation and tested sample applications.
>>
>> Problem - As per documentation filesplits corresponds to number of map
>> tasks.  File split is governed  by bock size - 64mb in hadoop-0.20.203.0.
>> Where can I find default settings for variour parameters like block size,
>> number of map/reduce tasks.
>>
>> Is it possible to control filesplit by "line feed - \n". I tried giving
>> sample input -> jobconf -> TextInputFormat
>>
>> date1
>> date2
>> date3
>> .......
>> ......
>> date10
>>
>> But when I run I see number of maptasks=2 or 1.
>> I require atleast  10 map taks which is same as number of "line feeds".
>> Each corresponds to complex calculation to be done by map task. So I can
>> have optimal cpu utilization - 8 cpus.
>>
>> Behaviour of maptasks looks strange to be as some times if I give in
>> program jobconf.set(num map tasks) it takes 2 or 8.  I see some files like
>> part-00001...
>> Are they partitions?
>>
>> Thanks
>>  ------------------------------
>> *From:* Satish Setty (HCL Financial Services)
>> *Sent:* Friday, January 06, 2012 12:29 PM
>> *To:* bejoy.hadoop@gmail.com
>> *Subject:* FW: hadoop
>>
>>
>>    Thanks Bejoy. Extremely useful information. We will try and come
>> back. WebApp application [jobtracker web UI ] does this require
>> deployment or application server container comes inbuilt with hadoop?
>>
>> Regards
>>
>>  ------------------------------
>> *From:* Bejoy Ks [bejoy.hadoop@gmail.com]
>> *Sent:* Friday, January 06, 2012 12:54 AM
>> *To:* mapreduce-user@hadoop.apache.org
>> *Subject:* Re: hadoop
>>
>>     Hi Satish
>>         Please find some pointers in line
>>
>> (a) How do we know number of  map tasks spawned?  Can this be controlled?
>> We notice only 4 jvms running on a single node - namenode, datanode,
>> jobtracker, tasktracker. As we understand depending on number of splits
>> that many map tasks are spawned - so we should see that many increase in
>> jvms.
>>
>> [Bejoy] namenode, datanode, jobtracker, tasktracker, secondaryNameNode
>> are the default process on hadoop it is not dependent on your tasks and
>> your tasks are custom tasks are launched in separate jvms. You can control
>> the maximum number of mappers on each tasktracker at an instance by setting
>> mapred.tasktracker.map.tasks.maximum. In default all the tasks (map or
>> reduce) are executed on individual jvms and once the task is completed the
>> jvms are destroyed. You are right, in default one map task is launched per
>> input split.
>> Just check the jobtracker web UI (
>> http://nameNodeHostName:50030/jobtracker.jsp), it would give you you all
>> details on the job including the number of map tasks spanned by a job. If
>> you want to run multiple task tracker and data node instances on the same
>> machine you need to ensure that there are no port conflicts.
>>
>> (b) Our mapper class should perform complex computations - it has plenty
>> of dependent jars so how do we add all jars in class path  while running
>> application? Since we require to perform parallel computations - we need
>> many map tasks running in parallel with different data. All are in same
>> machine with different jvms.
>>
>> [Bejoy] If these dependent jars are used by almost all your applications
>> include the same in class path of all your nodes.(in your case just one
>> node). Alternatively you can use -libjars option while submitting your job.
>> For more details refer
>>
>> http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/
>>
>> (c) How does data split happen?  JobClient does not talk about data
>> splits? As we understand we create format for distributed file system,
>> start-all.sh and then "hadoop fs -put". Do this write data to all
>> datanodes? But we are unable to see physical location? How does split
>> happen from this hdfs source?
>>
>> [Bejoy] Input files are split into blocks during copy into hdfs itself ,
>> the size of each block is detmined from the hadoop configuration of your
>> cluster. Name node decides on which all datanodes these blocks are to be
>> placed including its replicas and this details are passed on to the client.
>> The client copies the blocks to one data node and from this data node the
>> block is replicated to other datanodes. The splitting of a file happens in
>> HDFS API level.
>>
>>  thanks
>>
>> ------------------------------
>> ::DISCLAIMER::
>>
>> -----------------------------------------------------------------------------------------------------------------------
>>
>> The contents of this e-mail and any attachment(s) are confidential and
>> intended for the named recipient(s) only.
>> It shall not attach any liability on the originator or HCL or its
>> affiliates. Any views or opinions presented in
>> this email are solely those of the author and may not necessarily reflect
>> the opinions of HCL or its affiliates.
>> Any form of reproduction, dissemination, copying, disclosure,
>> modification, distribution and / or publication of
>> this message without the prior written consent of the author of this
>> e-mail is strictly prohibited. If you have
>> received this email in error please delete it and notify the sender
>> immediately. Before opening any mail and
>> attachments please check them for viruses and defect.
>>
>>
>> -----------------------------------------------------------------------------------------------------------------------
>>
>
>

--20cf30363efdbebf9c04b61bee60
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hi Satish<br>=A0=A0=A0=A0=A0 It would be good if you don&#39;t cross post y=
our queries. Just post it once on the right list.<br><br>=A0=A0=A0=A0=A0 Wh=
at is your value for <font face=3D"tahoma">mapred.max.split.size</font>? Tr=
y setting these values as well <br>
<font face=3D"tahoma">mapred.min.split.size=3D0 (it is the default value)</=
font><br><font face=3D"tahoma">mapred.max.split.size=3D40</font><br>
<br>Try executing your job once you apply these changes on top of others yo=
u did. <br><br>Regards<br>Bejoy.K.S<br><br><div class=3D"gmail_quote">On Mo=
n, Jan 9, 2012 at 5:09 PM, Satish Setty (HCL Financial Services) <span dir=
=3D"ltr">&lt;<a href=3D"mailto:Satish.Setty@hcl.com">Satish.Setty@hcl.com</=
a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">


<div>
<div dir=3D"ltr"><font color=3D"#000000" face=3D"Tahoma">Hi Bejoy,</font></=
div>
<div dir=3D"ltr"><font face=3D"tahoma"></font>=A0</div>
<div dir=3D"ltr"><font face=3D"tahoma">Even with below settings map tasks n=
ever go beyound 2, any way to make this spawn 10 tasks. Basically it should=
 look like compute grid - computation in parallel.</font></div>
<div dir=3D"ltr"><font face=3D"tahoma"></font>=A0</div>
<div dir=3D"ltr">&lt;property&gt;<br>
=A0 &lt;name&gt;io.bytes.per.checksum&lt;/name&gt;<br>
=A0 &lt;value&gt;30&lt;/value&gt;<br>
=A0 &lt;description&gt;The number of bytes per checksum.=A0 Must not be lar=
ger than<br>
=A0 io.file.buffer.size.&lt;/description&gt;<br>
&lt;/property&gt;</div>
<div dir=3D"ltr"><br>
&lt;property&gt;<br>
=A0 &lt;name&gt;dfs.block.size&lt;/name&gt;<br>
=A0=A0 &lt;value&gt;30&lt;/value&gt;<br>
=A0 &lt;description&gt;The default block size for new files.&lt;/descriptio=
n&gt;<br>
&lt;/property&gt;<br>
</div>
<div dir=3D"ltr"><font face=3D"times new roman">&lt;property&gt;<br>
=A0 &lt;name&gt;mapred.tasktracker.map.tasks.maximum&lt;/name&gt;<br>
=A0 &lt;value&gt;10&lt;/value&gt;<br>
=A0 &lt;description&gt;The maximum number of map tasks that will be run<br>
=A0 simultaneously by a task tracker.<br>
=A0 &lt;/description&gt;<br>
&lt;/property&gt;<br>
</font></div>
<div dir=3D"ltr"><font face=3D"times new roman"></font>=A0</div>
<div style=3D"DIRECTION:ltr">
<hr>
<font face=3D"Tahoma"><div class=3D"im"><b>From:</b> Satish Setty (HCL Fina=
ncial Services)<br>
</div><b>Sent:</b> Monday, January 09, 2012 1:21 PM<div><div></div><div cla=
ss=3D"h5"><br>
<b>To:</b> Bejoy Ks<br>
<b>Cc:</b> <a href=3D"mailto:mapreduce-user@hadoop.apache.org" target=3D"_b=
lank">mapreduce-user@hadoop.apache.org</a><br>
<b>Subject:</b> RE: hadoop<br>
</div></div></font><br>
</div><div><div></div><div class=3D"h5">
<div></div>
<div>
<div dir=3D"ltr"><font color=3D"#000000" face=3D"Tahoma">Hi Bejoy,</font></=
div>
<div dir=3D"ltr"><font face=3D"tahoma"></font>=A0</div>
<div dir=3D"ltr"><font face=3D"tahoma">In hdfs I have set block size - 40by=
tes . Input Data set is as below</font></div>
<div dir=3D"ltr"><font face=3D"tahoma">data1=A0=A0 (5*8=3D40 bytes)</font><=
/div>
<div dir=3D"ltr"><font face=3D"tahoma">data2</font></div>
<div dir=3D"ltr"><font face=3D"tahoma">......</font></div>
<div dir=3D"ltr"><font face=3D"tahoma">data10</font></div>
<div dir=3D"ltr"><font face=3D"tahoma"></font>=A0</div>
<div dir=3D"ltr"><font face=3D"tahoma"></font>=A0</div>
<div dir=3D"ltr"><font face=3D"tahoma">But still I see only 2 map tasks spa=
wned, should have been atleast 10 map tasks. Not sure how=A0works internall=
y. Line feed does not work [as you have explained below]</font></div>
<div dir=3D"ltr"><font face=3D"tahoma"></font>=A0</div>
<div dir=3D"ltr"><font face=3D"tahoma">Thanks</font></div>
<div style=3D"DIRECTION:ltr">
<hr>
<font face=3D"Tahoma"><b>From:</b> Satish Setty (HCL Financial Services)<br=
>
<b>Sent:</b> Saturday, January 07, 2012 9:17 PM<br>
<b>To:</b> Bejoy Ks<br>
<b>Cc:</b> <a href=3D"mailto:mapreduce-user@hadoop.apache.org" target=3D"_b=
lank">mapreduce-user@hadoop.apache.org</a><br>
<b>Subject:</b> RE: hadoop<br>
</font><br>
</div>
<div></div>
<div>
<div dir=3D"ltr"><font color=3D"#000000" face=3D"Tahoma">Thanks Bejoy - gre=
at information - will try out.</font></div>
<div dir=3D"ltr"><font face=3D"Tahoma"></font>=A0</div>
<div dir=3D"ltr"><font face=3D"Tahoma">I meant for below problem single nod=
e with high configuration -&gt; 8 cpus and 8gb memory. Hence taking an exam=
ple of 10 data items with line feeds. We want to=A0utilize full power of ma=
chine - hence want at least 10 map
 tasks - each task needs to perform highly complex mathematical simulation.=
=A0 At present it looks like file data is the only way to specify number of=
 map tasks via splitsize (in=A0bytes)=A0- but I prefer some criteria like l=
ine feed or whatever.</font></div>

<div dir=3D"ltr"><font face=3D"Tahoma"></font>=A0</div>
<div dir=3D"ltr"><font face=3D"Tahoma">In below example - &#39;data1&#39; c=
orresponds to 5*8=3D40bytes, if I have data1 .... data10 in theory I need t=
o see 10 map tasks with split size of 40bytes.</font></div>
<div dir=3D"ltr"><font face=3D"Tahoma"></font>=A0</div>
<div dir=3D"ltr"><font face=3D"Tahoma">How do I perform logging - where is =
the log (apache logger) data written? system outs may not come as it is bac=
kground process.</font></div>
<div dir=3D"ltr"><font face=3D"Tahoma"></font>=A0</div>
<div dir=3D"ltr"><font face=3D"Tahoma">Regards</font></div>
<div dir=3D"ltr"><font face=3D"Tahoma"></font>=A0</div>
<div dir=3D"ltr"><font face=3D"Tahoma"></font>=A0</div>
<div style=3D"DIRECTION:ltr">
<hr>
<font face=3D"Tahoma"><b>From:</b> Bejoy Ks [<a href=3D"mailto:bejoy.hadoop=
@gmail.com" target=3D"_blank">bejoy.hadoop@gmail.com</a>]<br>
<b>Sent:</b> Saturday, January 07, 2012 7:35 PM<br>
<b>To:</b> Satish Setty (HCL Financial Services)<br>
<b>Cc:</b> <a href=3D"mailto:mapreduce-user@hadoop.apache.org" target=3D"_b=
lank">mapreduce-user@hadoop.apache.org</a><br>
<b>Subject:</b> Re: hadoop<br>
</font><br>
</div>
<div></div>
<div>Hi Satish<br>
=A0=A0=A0=A0=A0 Please find some pointers inline<br>
<br>
<font face=3D"tahoma">Problem - As per documentation filesplits corresponds=
 to number of map tasks.=A0 File split is governed=A0 by bock size - 64mb i=
n hadoop-0.20.203.0. Where can I find default settings for variour paramete=
rs like block size, number of map/reduce
 tasks.<br>
<br>
[Bejoy] I&#39;d rather state it other way round, the number of map tasks tr=
iggered by a MR job is determined by number of input splits (and input form=
at). If you use TextInputFormat with default settings the number of input s=
plits is equal to the no of hdfs blocks
 occupied by the input. Size of an input split is equal to hdfs block size =
in default(64Mb). If you want to have more splits for one hdfs block itself=
 you need to set a value less than 64 Mb for mapred.max.split.size.
<br>
<br>
You can find pretty much all default configuration values from the download=
ed .tar at<br>
hadoop-0.20.*/src/mapred/mapred-default.xml<br>
</font><font face=3D"tahoma">hadoop-0.20.*</font><font face=3D"tahoma">/src=
/hdfs/hdfs-default.xml<br>
</font><font face=3D"tahoma">hadoop-0.20.*</font><font face=3D"tahoma">/src=
/core/core-default.xml<br>
<br>
If you want to alter some of these values then you can provide the same in =
<br>
</font><font face=3D"tahoma">$HADOOP_HOME</font><font face=3D"tahoma">/conf=
/mapred-site.xml<br>
</font><font face=3D"tahoma">$HADOOP_HOME</font><font face=3D"tahoma">/conf=
/hdfs-site.xml<br>
</font><font face=3D"tahoma">$HADOOP_HOME</font><font face=3D"tahoma">/conf=
/core-site.xml</font><br>
<br>
These values provided in *-site.xml would be taken into account only if the=
y are not marked in *-default.xml. If not final, the values provided in *-s=
ite.xml overrides the values in *-default.xml for corresponding configurati=
on parameter.<br>

<br>
<font face=3D"tahoma">I require atleast=A0 10 map taks=A0which is=A0same=A0=
as number of &quot;line feeds&quot;. Each corresponds to complex calculatio=
n to be done by map task. So I can have optimal cpu utilization - 8 cpus.<b=
r>
<br>
[Bejoy] Hadoop is a good choice processing large amounts of data. It is not=
 wise to choose one mapper for one record/line in a file, as creation of a =
map task itself is expensive with jvm spanning and all. Currently you may h=
ave 10 records in your input but
 I believe you are just testing Hadoop in dev env and in production that wo=
uldn&#39;t be the case there could be n files having m records each and thi=
s m can be in millions.(Just assuming based on my experience). On larger da=
ta sets you may not need to split on
 line boundaries. There can be multiple lines in a file and if you use Text=
InputFormat it is just one line processed by a map task at an instant. If y=
ou have n map tasks then n lines could be getting processed at an instant o=
f map task execution time frame
 one by each map task. In larger data volumes map tasks are spanned in spec=
ific nodes primarily based on data locality, then on available tasks slots =
on data local node and so on. It is possible that if you have a 10 node clu=
ster, 10 hdfs blocks corresponding
 to a input file and assume that all the blocks are present only on 8 nodes=
 and there are sufficient task slots available on all 8 , then tasks for yo=
ur job may be executed in 8 nodes alone instead of 10. So there are chances=
 that there won&#39;t be 100% balanced
 CPU utilization across nodes in a cluster. <br>
=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 I&#39;m not really sure how you =
can spawn map tasks based on line feeds in a file .Let us wait for others=
=A0 to comment on this.
<br>
=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 Also if your using map reduce for parallel c=
omputation alone the make sure you set the number of reducers to zero, with=
 that you can save a lot of time that would be other wise spend on sort and=
 shuffle phases.
<br>
(-D=A0 mapred.reduce.tasks=3D0)<br>
<br>
</font>
<div dir=3D"ltr"><font face=3D"tahoma">Behaviour of maptasks looks strange =
to be as some times if I give in program jobconf.set(num map tasks) it take=
s 2 or 8.=A0
<br>
<br>
</font>[Bejoy]There is no default value for number of map tasks, it is dete=
rmined by input splits and=A0 input format used by your job. You cannot set=
 the number of map tasks even if you set them at your job level, it is not =
considered. (mapred.map.tasks) . But
 you can definitely specify the number of reduce tasks at your job level=A0=
 by job.setNumReduceTasks(n) or mapred.reduce.tasks. If not set it would ta=
ke the default value for reduce tasks specified in conf files.<br>
<font face=3D"tahoma"><br>
<br>
I see some files like part-00001...</font></div>
<div dir=3D"ltr"><font face=3D"tahoma">Are they partitions?</font></div>
<br>
[Bejoy] The part-000* files corresponds to reducers. You&#39;d have n files=
 if you have n reducers as one reducer produces one output file.<br>
<font face=3D"tahoma"><br>
</font>Hope it helps!..<br>
<br>
Regards<br>
Bejoy.KS<br>
<br>
<br>
<div class=3D"gmail_quote">On Sat, Jan 7, 2012 at 3:32 PM, Satish Setty (HC=
L Financial Services)
<span dir=3D"ltr">&lt;<a href=3D"mailto:Satish.Setty@hcl.com" target=3D"_bl=
ank">Satish.Setty@hcl.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"PADDING-LEFT:1ex;MARGIN:0pt 0pt =
0pt 0.8ex;BORDER-LEFT:rgb(204,204,204) 1px solid">
<div>
<div dir=3D"ltr"><font color=3D"#000000" face=3D"Tahoma">Hi Bijoy,</font></=
div>
<div>
<div>
<div></div>
<div>
<div dir=3D"ltr"><font face=3D"tahoma"></font>=A0</div>
<div dir=3D"ltr"><font face=3D"tahoma">Just finished installation and teste=
d sample applications.
</font></div>
<div dir=3D"ltr"><font face=3D"tahoma"></font>=A0</div>
<div dir=3D"ltr"><font face=3D"tahoma">Problem - As per documentation files=
plits corresponds to number of map tasks.=A0 File split is governed=A0 by b=
ock size - 64mb in hadoop-0.20.203.0. Where can I find default settings for=
 variour parameters like block size, number
 of map/reduce tasks. </font></div>
<div dir=3D"ltr"><font face=3D"tahoma"></font>=A0</div>
<div dir=3D"ltr"><font face=3D"tahoma">Is it possible to control filesplit =
by &quot;line feed - \n&quot;. I tried giving sample input=A0-&gt; jobconf =
-&gt; TextInputFormat</font></div>
<div dir=3D"ltr"><font face=3D"tahoma"></font>=A0</div>
<div dir=3D"ltr"><font face=3D"tahoma">date1 =A0</font></div>
<div dir=3D"ltr"><font face=3D"tahoma">date2</font></div>
<div dir=3D"ltr"><font face=3D"tahoma">date3</font></div>
<div dir=3D"ltr"><font face=3D"tahoma">.......</font></div>
<div dir=3D"ltr"><font face=3D"tahoma">......</font></div>
<div dir=3D"ltr"><font face=3D"tahoma">date10</font></div>
<div dir=3D"ltr"><font face=3D"tahoma"></font>=A0</div>
<div dir=3D"ltr"><font face=3D"tahoma">But when I run I see number of mapta=
sks=3D2 or 1.
</font></div>
<div dir=3D"ltr"><font face=3D"tahoma">I require atleast=A0 10 map taks=A0w=
hich is=A0same=A0as number of &quot;line feeds&quot;. Each corresponds to c=
omplex calculation to be done by map task. So I can have optimal cpu utiliz=
ation - 8 cpus.</font></div>

<div dir=3D"ltr"><font face=3D"tahoma"></font>=A0</div>
<div dir=3D"ltr"><font face=3D"tahoma">Behaviour of maptasks looks strange =
to be as some times if I give in program jobconf.set(num map tasks) it take=
s 2 or 8.=A0 I see some files like part-00001...</font></div>
<div dir=3D"ltr"><font face=3D"tahoma">Are they partitions?</font></div>
<div dir=3D"ltr"><font face=3D"tahoma"></font>=A0</div>
<div dir=3D"ltr"><font face=3D"tahoma">Thanks</font></div>
<div dir=3D"ltr">
<hr>
<font face=3D"Tahoma"><b>From:</b> Satish Setty (HCL Financial Services)<br=
>
<b>Sent:</b> Friday, January 06, 2012 12:29 PM<br>
<b>To:</b> <a href=3D"mailto:bejoy.hadoop@gmail.com" target=3D"_blank">bejo=
y.hadoop@gmail.com</a><br>
<b>Subject:</b> FW: hadoop<br>
</font><br>
</div>
<div></div>
</div>
</div>
<div>
<div dir=3D"ltr">=A0</div>
<div></div>
<div>
<div>
<div></div>
<div>
<div dir=3D"ltr"><font color=3D"#000000" face=3D"Tahoma">Thanks Bejoy. Extr=
emely useful information. We will try and come back. WebApp application [<f=
ont face=3D"Times New Roman" size=3D"3">jobtracker web UI ]
</font>does this require deployment or application server container comes i=
nbuilt with hadoop?</font></div>
<div dir=3D"ltr"><font face=3D"tahoma"></font>=A0</div>
<div dir=3D"ltr"><font face=3D"tahoma">Regards</font></div>
<div dir=3D"ltr"><font face=3D"tahoma"></font>=A0</div>
<div style=3D"DIRECTION:ltr">
<hr>
<font face=3D"Tahoma"><b>From:</b> Bejoy Ks [<a href=3D"mailto:bejoy.hadoop=
@gmail.com" target=3D"_blank">bejoy.hadoop@gmail.com</a>]<br>
<b>Sent:</b> Friday, January 06, 2012 12:54 AM<br>
<b>To:</b> <a href=3D"mailto:mapreduce-user@hadoop.apache.org" target=3D"_b=
lank">mapreduce-user@hadoop.apache.org</a><br>
<b>Subject:</b> Re: hadoop<br>
</font><br>
</div>
<div></div>
</div>
</div>
<div>
<div>
<div></div>
<div>Hi Satish<br>
=A0=A0=A0=A0=A0=A0=A0 Please find some pointers in line<br>
<br>
(a) How do we know number of =A0map tasks spawned? =A0Can this be controlle=
d? We notice only 4 jvms running on a single node - namenode, datanode, job=
tracker, tasktracker. As we understand depending on number of splits that m=
any map tasks are spawned - so we should
 see that many increase in jvms.<br>
<br>
[Bejoy] namenode, datanode, jobtracker, tasktracker, secondaryNameNode are =
the default process on hadoop it is not dependent on your tasks and your ta=
sks are custom tasks are launched in separate jvms. You can control the max=
imum number of mappers on each tasktracker
 at an instance by setting mapred.tasktracker.map.tasks.maximum. In default=
 all the tasks (map or reduce) are executed on individual jvms and once the=
 task is completed the jvms are destroyed. You are right, in default one ma=
p task is launched per input split.<br>

Just check the jobtracker web UI (<a href=3D"http://nameNodeHostName:50030/=
jobtracker.jsp" target=3D"_blank">http://nameNodeHostName:50030/jobtracker.=
jsp</a>), it would give you you all details on the job including the number=
 of map tasks spanned by a job. If you
 want to run multiple task tracker and data node instances on the same mach=
ine you need to ensure that there are no port conflicts.<br>
<br>
(b) Our mapper class should perform complex computations - it has plenty of=
 dependent jars so how do we add all jars in class path =A0while running ap=
plication? Since we require to perform parallel computations - we need many=
 map tasks running in parallel with
 different data. All are in same machine with different jvms.<br>
<br>
[Bejoy] If these dependent jars are used by almost all your applications in=
clude the same in class path of all your nodes.(in your case just one node)=
. Alternatively you can use -libjars option while submitting your job. For =
more details refer<br>

<a href=3D"http://www.cloudera.com/blog/2011/01/how-to-include-third-party-=
libraries-in-your-map-reduce-job/" target=3D"_blank">http://www.cloudera.co=
m/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/=
</a><br>

<br>
(c) How does data split happen? =A0JobClient does not talk about data split=
s? As we understand we create format for distributed file system, start-all=
.sh and then &quot;hadoop fs -put&quot;. Do this write data to all datanode=
s? But we are unable to see physical location?
 How does split happen from this hdfs source?<br>
<br>
[Bejoy] Input files are split into blocks during copy into hdfs itself , th=
e size of each block is detmined from the hadoop configuration of your clus=
ter. Name node decides on which all datanodes these blocks are to be placed=
 including its replicas and this
 details are passed on to the client. The client copies the blocks to one d=
ata node and from this data node the block is replicated to other datanodes=
. The splitting of a file happens in HDFS API level.<br>
<br>
</div>
</div>
thanks</div>
</div>
</div>
</div>
<div><br>
<hr>
<font color=3D"gray" face=3D"Arial" size=3D"1">::DISCLAIMER::<br>
---------------------------------------------------------------------------=
--------------------------------------------<br>
<br>
The contents of this e-mail and any attachment(s) are confidential and inte=
nded for the named recipient(s) only.<br>
It shall not attach any liability on the originator or HCL or its affiliate=
s. Any views or opinions presented in<br>
this email are solely those of the author and may not necessarily reflect t=
he opinions of HCL or its affiliates.<br>
Any form of reproduction, dissemination, copying, disclosure, modification,=
 distribution and / or publication of<br>
this message without the prior written consent of the author of this e-mail=
 is strictly prohibited. If you have<br>
received this email in error please delete it and notify the sender immedia=
tely. Before opening any mail and<br>
attachments please check them for viruses and defect.<br>
<br>
---------------------------------------------------------------------------=
--------------------------------------------<br>
</font></div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</div></div></div>

</blockquote></div><br>

--20cf30363efdbebf9c04b61bee60--