hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bejoy Ks <bejoy.had...@gmail.com>
Subject Re: hadoop
Date Sat, 07 Jan 2012 14:05:45 GMT
Hi Satish
      Please find some pointers inline

Problem - As per documentation filesplits corresponds to number of map
tasks.  File split is governed  by bock size - 64mb in hadoop-0.20.203.0.
Where can I find default settings for variour parameters like block size,
number of map/reduce tasks.

[Bejoy] I'd rather state it other way round, the number of map tasks
triggered by a MR job is determined by number of input splits (and input
format). If you use TextInputFormat with default settings the number of
input splits is equal to the no of hdfs blocks occupied by the input. Size
of an input split is equal to hdfs block size in default(64Mb). If you want
to have more splits for one hdfs block itself you need to set a value less
than 64 Mb for mapred.max.split.size.

You can find pretty much all default configuration values from the
downloaded .tar at
hadoop-0.20.*/src/mapred/mapred-default.xml
hadoop-0.20.*/src/hdfs/hdfs-default.xml
hadoop-0.20.*/src/core/core-default.xml

If you want to alter some of these values then you can provide the same in
$HADOOP_HOME/conf/mapred-site.xml
$HADOOP_HOME/conf/hdfs-site.xml
$HADOOP_HOME/conf/core-site.xml

These values provided in *-site.xml would be taken into account only if
they are not marked in *-default.xml. If not final, the values provided in
*-site.xml overrides the values in *-default.xml for corresponding
configuration parameter.

I require atleast  10 map taks which is same as number of "line feeds".
Each corresponds to complex calculation to be done by map task. So I can
have optimal cpu utilization - 8 cpus.

[Bejoy] Hadoop is a good choice processing large amounts of data. It is not
wise to choose one mapper for one record/line in a file, as creation of a
map task itself is expensive with jvm spanning and all. Currently you may
have 10 records in your input but I believe you are just testing Hadoop in
dev env and in production that wouldn't be the case there could be n files
having m records each and this m can be in millions.(Just assuming based on
my experience). On larger data sets you may not need to split on line
boundaries. There can be multiple lines in a file and if you use
TextInputFormat it is just one line processed by a map task at an instant.
If you have n map tasks then n lines could be getting processed at an
instant of map task execution time frame one by each map task. In larger
data volumes map tasks are spanned in specific nodes primarily based on
data locality, then on available tasks slots on data local node and so on.
It is possible that if you have a 10 node cluster, 10 hdfs blocks
corresponding to a input file and assume that all the blocks are present
only on 8 nodes and there are sufficient task slots available on all 8 ,
then tasks for your job may be executed in 8 nodes alone instead of 10. So
there are chances that there won't be 100% balanced CPU utilization across
nodes in a cluster.
               I'm not really sure how you can spawn map tasks based on
line feeds in a file .Let us wait for others  to comment on this.
           Also if your using map reduce for parallel computation alone the
make sure you set the number of reducers to zero, with that you can save a
lot of time that would be other wise spend on sort and shuffle phases.
(-D  mapred.reduce.tasks=0)

Behaviour of maptasks looks strange to be as some times if I give in
program jobconf.set(num map tasks) it takes 2 or 8.

[Bejoy]There is no default value for number of map tasks, it is determined
by input splits and  input format used by your job. You cannot set the
number of map tasks even if you set them at your job level, it is not
considered. (mapred.map.tasks) . But you can definitely specify the number
of reduce tasks at your job level  by job.setNumReduceTasks(n) or
mapred.reduce.tasks. If not set it would take the default value for reduce
tasks specified in conf files.


I see some files like part-00001...
Are they partitions?

[Bejoy] The part-000* files corresponds to reducers. You'd have n files if
you have n reducers as one reducer produces one output file.

Hope it helps!..

Regards
Bejoy.KS


On Sat, Jan 7, 2012 at 3:32 PM, Satish Setty (HCL Financial Services) <
Satish.Setty@hcl.com> wrote:

>  Hi Bijoy,
>
> Just finished installation and tested sample applications.
>
> Problem - As per documentation filesplits corresponds to number of map
> tasks.  File split is governed  by bock size - 64mb in hadoop-0.20.203.0.
> Where can I find default settings for variour parameters like block size,
> number of map/reduce tasks.
>
> Is it possible to control filesplit by "line feed - \n". I tried giving
> sample input -> jobconf -> TextInputFormat
>
> date1
> date2
> date3
> .......
> ......
> date10
>
> But when I run I see number of maptasks=2 or 1.
> I require atleast  10 map taks which is same as number of "line feeds".
> Each corresponds to complex calculation to be done by map task. So I can
> have optimal cpu utilization - 8 cpus.
>
> Behaviour of maptasks looks strange to be as some times if I give in
> program jobconf.set(num map tasks) it takes 2 or 8.  I see some files like
> part-00001...
> Are they partitions?
>
> Thanks
>  ------------------------------
> *From:* Satish Setty (HCL Financial Services)
> *Sent:* Friday, January 06, 2012 12:29 PM
> *To:* bejoy.hadoop@gmail.com
> *Subject:* FW: hadoop
>
>
>  Thanks Bejoy. Extremely useful information. We will try and come back.
> WebApp application [jobtracker web UI ] does this require deployment or
> application server container comes inbuilt with hadoop?
>
> Regards
>
>  ------------------------------
> *From:* Bejoy Ks [bejoy.hadoop@gmail.com]
> *Sent:* Friday, January 06, 2012 12:54 AM
> *To:* mapreduce-user@hadoop.apache.org
> *Subject:* Re: hadoop
>
>  Hi Satish
>         Please find some pointers in line
>
> (a) How do we know number of  map tasks spawned?  Can this be controlled?
> We notice only 4 jvms running on a single node - namenode, datanode,
> jobtracker, tasktracker. As we understand depending on number of splits
> that many map tasks are spawned - so we should see that many increase in
> jvms.
>
> [Bejoy] namenode, datanode, jobtracker, tasktracker, secondaryNameNode are
> the default process on hadoop it is not dependent on your tasks and your
> tasks are custom tasks are launched in separate jvms. You can control the
> maximum number of mappers on each tasktracker at an instance by setting
> mapred.tasktracker.map.tasks.maximum. In default all the tasks (map or
> reduce) are executed on individual jvms and once the task is completed the
> jvms are destroyed. You are right, in default one map task is launched per
> input split.
> Just check the jobtracker web UI (
> http://nameNodeHostName:50030/jobtracker.jsp), it would give you you all
> details on the job including the number of map tasks spanned by a job. If
> you want to run multiple task tracker and data node instances on the same
> machine you need to ensure that there are no port conflicts.
>
> (b) Our mapper class should perform complex computations - it has plenty
> of dependent jars so how do we add all jars in class path  while running
> application? Since we require to perform parallel computations - we need
> many map tasks running in parallel with different data. All are in same
> machine with different jvms.
>
> [Bejoy] If these dependent jars are used by almost all your applications
> include the same in class path of all your nodes.(in your case just one
> node). Alternatively you can use -libjars option while submitting your job.
> For more details refer
>
> http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/
>
> (c) How does data split happen?  JobClient does not talk about data
> splits? As we understand we create format for distributed file system,
> start-all.sh and then "hadoop fs -put". Do this write data to all
> datanodes? But we are unable to see physical location? How does split
> happen from this hdfs source?
>
> [Bejoy] Input files are split into blocks during copy into hdfs itself ,
> the size of each block is detmined from the hadoop configuration of your
> cluster. Name node decides on which all datanodes these blocks are to be
> placed including its replicas and this details are passed on to the client.
> The client copies the blocks to one data node and from this data node the
> block is replicated to other datanodes. The splitting of a file happens in
> HDFS API level.
>
> thanks
>
> ------------------------------
> ::DISCLAIMER::
>
> -----------------------------------------------------------------------------------------------------------------------
>
> The contents of this e-mail and any attachment(s) are confidential and
> intended for the named recipient(s) only.
> It shall not attach any liability on the originator or HCL or its
> affiliates. Any views or opinions presented in
> this email are solely those of the author and may not necessarily reflect
> the opinions of HCL or its affiliates.
> Any form of reproduction, dissemination, copying, disclosure,
> modification, distribution and / or publication of
> this message without the prior written consent of the author of this
> e-mail is strictly prohibited. If you have
> received this email in error please delete it and notify the sender
> immediately. Before opening any mail and
> attachments please check them for viruses and defect.
>
>
> -----------------------------------------------------------------------------------------------------------------------
>

Mime
View raw message