hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From java8964 java8964 <java8...@hotmail.com>
Subject RE: running map tasks in remote node
Date Thu, 22 Aug 2013 11:10:53 GMT
If you don't plan to use HDFS, what kind of sharing file system you are going to use between
cluster? NFS?For what you want to do, even though it doesn't make too much sense, but you
need to the first problem as the shared file system.
Second, if you want to process the files file by file, instead of block by block in HDFS,
then you need to use the WholeFileInputFormat (google this how to write one). So you don't
need a file to list all the files to be processed, just put them into one folder in the sharing
file system, then send this folder to your MR job. In this way, as long as each node can access
it through some file system URL, each file will be processed in each mapper.
Yong

Date: Wed, 21 Aug 2013 17:39:10 +0530
Subject: running map tasks in remote node
From: rabmdu@gmail.com
To: user@hadoop.apache.org

Hello, 
Here is the new bie question of the day. For one of my use cases, I want to use hadoop map
reduce without HDFS. Here, I will have a text file containing a list of file names to process.
Assume that I have 10 lines (10 files to process) in the input text file and I wish to generate
10 map tasks and execute them in parallel in 10 nodes. I started with basic tutorial on hadoop
and could setup single node hadoop cluster and successfully tested wordcount code.
 Now, I took two machines A (master) and B (slave). I did the below configuration in these
machines to setup a two node cluster.
 hdfs-site.xml
 <?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!-- Put site-specific
property overrides in this file. -->
<configuration><property>
          <name>dfs.replication</name>          <value>1</value>
</property><property>
  <name>dfs.name.dir</name>  <value>/tmp/hadoop-bala/dfs/name</value>
</property><property>
  <name>dfs.data.dir</name>  <value>/tmp/hadoop-bala/dfs/data</value>
</property><property>
     <name>mapred.job.tracker</name>    <value>A:9001</value>
</property> 
</configuration> mapred-site.xml
 <?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> 
<!-- Put site-specific property overrides in this file. --> 
<configuration><property>
            <name>mapred.job.tracker</name>            <value>A:9001</value>
</property><property>
          <name>mapreduce.tasktracker.map.tasks.maximum</name>           <value>1</value>
</property></configuration>
 core-site.xml 
<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. --><configuration>
         <property>                <name>fs.default.name</name>
                <value>hdfs://A:9000</value>        </property>
</configuration> 
 In A and B, I do have a file named ‘slaves’ with an entry ‘B’ in it and another file
called ‘masters’ wherein an entry ‘A’ is there.
 I have kept my input file at A. I see the map method process the input file line by line
but they are all processed in A. Ideally, I would expect those processing to take place in
B.
 Can anyone highlight where I am going wrong?
  regardsrab 		 	   		  
Mime
View raw message