hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bejoy.had...@gmail.com
Subject Re: hadoop
Date Tue, 10 Jan 2012 03:57:14 GMT
Hi Satish
       After changing dfs.block.size to 40 did to recopy the files. Changing dfs.block.size
won't affect the existing files in hdfs it would be applicable from the new files you copy
to hdfs. In short with the changes in dfs.block.size=40,
mapred.min.split.size=0,mapred.max.split.size=40 do a copyFromLocal and try executing your
job on this newly copied data.


Regards
Bejoy K S

-----Original Message-----
From: "Satish Setty (HCL Financial Services)" <Satish.Setty@hcl.com>
Date: Tue, 10 Jan 2012 08:57:37 
To: Bejoy Ks<bejoy.hadoop@gmail.com>
Cc: mapreduce-user@hadoop.apache.org<mapreduce-user@hadoop.apache.org>
Subject: RE: hadoop

  
Hi Bejoy, 
  

 Thanks for help. Changed values  mapred.min.split.size=0,mapred.max.split.size=40 but but
job counter does not reflect any other changes? 
For posting kindly let me know correct link/mail-id - at present directly sending to your
account["Bejoy Ks ‎[bejoy.hadoop@gmail.com]‎" - has been great help to me. 
  
Posting to group account 
mapreduce-user@hadoop.apache.org <mailto:mapreduce-user@hadoop.apache.org>   bounces
back. 
  
 
 
 Counter Map Reduce Total 
 File Input Format Counters Bytes Read 61 0 61 
 Job Counters SLOTS_MILLIS_MAPS 0 0 3,886 
 Launched map tasks 0 0 2 
 Data-local map tasks 0 0 2 
 FileSystemCounters HDFS_BYTES_READ 267 0 267 
 FILE_BYTES_WRITTEN 58,134 0 58,134 
 Map-Reduce Framework Map output materialized bytes 0 0 0 
 Combine output records 0 0 0 
 Map input records 9 0 9 
 Spilled Records 0 0 0 
 Map output bytes 70 0 70 
 Map input bytes 54 0 54 
 SPLIT_RAW_BYTES 206 0 206 
 Map output records 7 0 7 
 Combine input records 0 0 0 
 
----------------
 From: Bejoy Ks [bejoy.hadoop@gmail.com]
 Sent: Monday, January 09, 2012 11:13 PM
 To: Satish Setty (HCL Financial Services)
 Cc: mapreduce-user@hadoop.apache.org
 Subject: Re: hadoop
 
 
 
Hi Satish
       It would be good if you don't cross post your queries. Just post it once on the
right list.
 
       What is your value for mapred.max.split.size? Try setting these values as well

 mapred.min.split.size=0 (it is the default value)
 mapred.max.split.size=40
 
 Try executing your job once you apply these changes on top of others you did. 
 
 Regards
 Bejoy.K.S
 
 
On Mon, Jan 9, 2012 at 5:09 PM, Satish Setty (HCL Financial Services) <Satish.Setty@hcl.com
<mailto:Satish.Setty@hcl.com> > wrote:
 
 
Hi Bejoy, 
  
Even with below settings map tasks never go beyound 2, any way to make this spawn 10 tasks.
Basically it should look like compute grid - computation in parallel. 
  
<property>
   <name>io.bytes.per.checksum</name>
   <value>30</value>
   <description>The number of bytes per checksum.  Must not be larger than
   io.file.buffer.size.</description>
 </property> 

 <property>
   <name>dfs.block.size</name>
    <value>30</value>
   <description>The default block size for new files.</description>
 </property>
 
<property>
   <name>mapred.tasktracker.map.tasks.maximum</name>
   <value>10</value>
   <description>The maximum number of map tasks that will be run
   simultaneously by a task tracker.
   </description>
 </property>
 
  
 
----------------
 
From: Satish Setty (HCL Financial Services)
 Sent: Monday, January 09, 2012 1:21 PM 
 
 

 To: Bejoy Ks
 Cc: mapreduce-user@hadoop.apache.org <mailto:mapreduce-user@hadoop.apache.org> 
 Subject: RE: hadoop
 
 
 
 
 
 
 
Hi Bejoy, 
  
In hdfs I have set block size - 40bytes . Input Data set is as below 
data1   (5*8=40 bytes) 
data2 
...... 
data10 
  
  
But still I see only 2 map tasks spawned, should have been atleast 10 map tasks. Not sure
how works internally. Line feed does not work [as you have explained below] 
  
Thanks 
 
----------------
 From: Satish Setty (HCL Financial Services)
 Sent: Saturday, January 07, 2012 9:17 PM
 To: Bejoy Ks
 Cc: mapreduce-user@hadoop.apache.org <mailto:mapreduce-user@hadoop.apache.org> 
 Subject: RE: hadoop
 
 
 
 
Thanks Bejoy - great information - will try out. 
  
I meant for below problem single node with high configuration -> 8 cpus and 8gb memory.
Hence taking an example of 10 data items with line feeds. We want to utilize full power of
machine - hence want at least 10 map tasks - each task needs to perform highly complex mathematical
simulation.  At present it looks like file data is the only way to specify number of map
tasks via splitsize (in bytes) - but I prefer some criteria like line feed or whatever.

  
In below example - 'data1' corresponds to 5*8=40bytes, if I have data1 .... data10 in theory
I need to see 10 map tasks with split size of 40bytes. 
  
How do I perform logging - where is the log (apache logger) data written? system outs may
not come as it is background process. 
  
Regards 
  
  
 
----------------
 From: Bejoy Ks [bejoy.hadoop@gmail.com <mailto:bejoy.hadoop@gmail.com> ]
 Sent: Saturday, January 07, 2012 7:35 PM
 To: Satish Setty (HCL Financial Services)
 Cc: mapreduce-user@hadoop.apache.org <mailto:mapreduce-user@hadoop.apache.org> 
 Subject: Re: hadoop
 
 
 
Hi Satish
       Please find some pointers inline
 
 Problem - As per documentation filesplits corresponds to number of map tasks.  File split
is governed  by bock size - 64mb in hadoop-0.20.203.0. Where can I find default settings
for variour parameters like block size, number of map/reduce tasks.
 
 [Bejoy] I'd rather state it other way round, the number of map tasks triggered by a MR job
is determined by number of input splits (and input format). If you use TextInputFormat with
default settings the number of input splits is equal to the no of hdfs blocks occupied by
the input. Size of an input split is equal to hdfs block size in default(64Mb). If you want
to have more splits for one hdfs block itself you need to set a value less than 64 Mb for
mapred.max.split.size. 
 
 You can find pretty much all default configuration values from the downloaded .tar at
 hadoop-0.20.*/src/mapred/mapred-default.xml
 hadoop-0.20.*/src/hdfs/hdfs-default.xml
 hadoop-0.20.*/src/core/core-default.xml
 
 If you want to alter some of these values then you can provide the same in 
 $HADOOP_HOME/conf/mapred-site.xml
 $HADOOP_HOME/conf/hdfs-site.xml
 $HADOOP_HOME/conf/core-site.xml
 
 These values provided in *-site.xml would be taken into account only if they are not marked
in *-default.xml. If not final, the values provided in *-site.xml overrides the values in
*-default.xml for corresponding configuration parameter.
 
 I require atleast  10 map taks which is same as number of "line feeds". Each corresponds
to complex calculation to be done by map task. So I can have optimal cpu utilization - 8 cpus.
 
 [Bejoy] Hadoop is a good choice processing large amounts of data. It is not wise to choose
one mapper for one record/line in a file, as creation of a map task itself is expensive with
jvm spanning and all. Currently you may have 10 records in your input but I believe you are
just testing Hadoop in dev env and in production that wouldn't be the case there could be
n files having m records each and this m can be in millions.(Just assuming based on my experience).
On larger data sets you may not need to split on line boundaries. There can be multiple lines
in a file and if you use TextInputFormat it is just one line processed by a map task at an
instant. If you have n map tasks then n lines could be getting processed at an instant of
map task execution time frame one by each map task. In larger data volumes map tasks are spanned
in specific nodes primarily based on data locality, then on available tasks slots on data
local node and so on. It is possible that if you have a 10 node cluster, 10 hdfs blocks corresponding
to a input file and assume that all the blocks are present only on 8 nodes and there are sufficient
task slots available on all 8 , then tasks for your job may be executed in 8 nodes alone instead
of 10. So there are chances that there won't be 100% balanced CPU utilization across nodes
in a cluster. 
                I'm not really sure how you can spawn map tasks based on line
feeds in a file .Let us wait for others  to comment on this. 
            Also if your using map reduce for parallel computation alone the make
sure you set the number of reducers to zero, with that you can save a lot of time that would
be other wise spend on sort and shuffle phases. 
 (-D  mapred.reduce.tasks=0)
 
 
Behaviour of maptasks looks strange to be as some times if I give in program jobconf.set(num
map tasks) it takes 2 or 8.  
 
 [Bejoy]There is no default value for number of map tasks, it is determined by input splits
and  input format used by your job. You cannot set the number of map tasks even if you set
them at your job level, it is not considered. (mapred.map.tasks) . But you can definitely
specify the number of reduce tasks at your job level  by job.setNumReduceTasks(n) or mapred.reduce.tasks.
If not set it would take the default value for reduce tasks specified in conf files.
 
 
 I see some files like part-00001... 
Are they partitions? 
 [Bejoy] The part-000* files corresponds to reducers. You'd have n files if you have n reducers
as one reducer produces one output file.
 
 Hope it helps!..
 
 Regards
 Bejoy.KS
 
 
 
On Sat, Jan 7, 2012 at 3:32 PM, Satish Setty (HCL Financial Services) <Satish.Setty@hcl.com
<mailto:Satish.Setty@hcl.com> > wrote:
 
 
Hi Bijoy, 
 
 
 
 
  
Just finished installation and tested sample applications. 
  
Problem - As per documentation filesplits corresponds to number of map tasks.  File split
is governed  by bock size - 64mb in hadoop-0.20.203.0. Where can I find default settings
for variour parameters like block size, number of map/reduce tasks. 
  
Is it possible to control filesplit by "line feed - \n". I tried giving sample input ->
jobconf -> TextInputFormat 
  
date1   
date2 
date3 
....... 
...... 
date10 
  
But when I run I see number of maptasks=2 or 1. 
I require atleast  10 map taks which is same as number of "line feeds". Each corresponds
to complex calculation to be done by map task. So I can have optimal cpu utilization - 8 cpus.

  
Behaviour of maptasks looks strange to be as some times if I give in program jobconf.set(num
map tasks) it takes 2 or 8.  I see some files like part-00001... 
Are they partitions? 
  
Thanks 
 
----------------
 From: Satish Setty (HCL Financial Services)
 Sent: Friday, January 06, 2012 12:29 PM
 To: bejoy.hadoop@gmail.com <mailto:bejoy.hadoop@gmail.com> 
 Subject: FW: hadoop
 
 
 
 
  
 
 
 
 
 
Thanks Bejoy. Extremely useful information. We will try and come back. WebApp application
[jobtracker web UI ] does this require deployment or application server container comes inbuilt
with hadoop? 
  
Regards 
  
 
----------------
 From: Bejoy Ks [bejoy.hadoop@gmail.com <mailto:bejoy.hadoop@gmail.com> ]
 Sent: Friday, January 06, 2012 12:54 AM
 To: mapreduce-user@hadoop.apache.org <mailto:mapreduce-user@hadoop.apache.org> 
 Subject: Re: hadoop
 
 
 
 
 
 
Hi Satish
         Please find some pointers in line
 
 (a) How do we know number of  map tasks spawned?  Can this be controlled? We notice only
4 jvms running on a single node - namenode, datanode, jobtracker, tasktracker. As we understand
depending on number of splits that many map tasks are spawned - so we should see that many
increase in jvms.
 
 [Bejoy] namenode, datanode, jobtracker, tasktracker, secondaryNameNode are the default process
on hadoop it is not dependent on your tasks and your tasks are custom tasks are launched in
separate jvms. You can control the maximum number of mappers on each tasktracker at an instance
by setting mapred.tasktracker.map.tasks.maximum. In default all the tasks (map or reduce)
are executed on individual jvms and once the task is completed the jvms are destroyed. You
are right, in default one map task is launched per input split.
 Just check the jobtracker web UI (http://nameNodeHostName:50030/jobtracker.jsp), it would
give you you all details on the job including the number of map tasks spanned by a job. If
you want to run multiple task tracker and data node instances on the same machine you need
to ensure that there are no port conflicts.
 
 (b) Our mapper class should perform complex computations - it has plenty of dependent jars
so how do we add all jars in class path  while running application? Since we require to perform
parallel computations - we need many map tasks running in parallel with different data. All
are in same machine with different jvms.
 
 [Bejoy] If these dependent jars are used by almost all your applications include the same
in class path of all your nodes.(in your case just one node). Alternatively you can use -libjars
option while submitting your job. For more details refer
 http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/
 
 (c) How does data split happen?  JobClient does not talk about data splits? As we understand
we create format for distributed file system, start-all.sh and then "hadoop fs -put". Do this
write data to all datanodes? But we are unable to see physical location? How does split happen
from this hdfs source?
 
 [Bejoy] Input files are split into blocks during copy into hdfs itself , the size of each
block is detmined from the hadoop configuration of your cluster. Name node decides on which
all datanodes these blocks are to be placed including its replicas and this details are passed
on to the client. The client copies the blocks to one data node and from this data node the
block is replicated to other datanodes. The splitting of a file happens in HDFS API level.
 
 thanks 

 
----------------
 ::DISCLAIMER::
 -----------------------------------------------------------------------------------------------------------------------
 
 The contents of this e-mail and any attachment(s) are confidential and intended for the named
recipient(s) only.
 It shall not attach any liability on the originator or HCL or its affiliates. Any views or
opinions presented in
 this email are solely those of the author and may not necessarily reflect the opinions of
HCL or its affiliates.
 Any form of reproduction, dissemination, copying, disclosure, modification, distribution
and / or publication of
 this message without the prior written consent of the author of this e-mail is strictly prohibited.
If you have
 received this email in error please delete it and notify the sender immediately. Before opening
any mail and
 attachments please check them for viruses and defect.
 
 -----------------------------------------------------------------------------------------------------------------------
 
 
 
Mime
View raw message