hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gaurav Agarwal <gauravagarwa...@yahoo.com>
Subject Re: Hadoop 'wordcount' program hanging in the Reduce phase.
Date Thu, 08 Mar 2007 06:38:09 GMT

Problem resolved!!

This looks like a bug in 0.12.0 version (there is some thread in the
developer area regarding a race-condition which results in the hung reduce
job) . I moved back to 0.11.2 and this got resolved!
Thanks a lot to all of you and specially Jaya for pointing it out.

regards
gaurav


Gaurav Agarwal wrote:
> 
> Hi Everyone!
> I am new user to Hadoop and trying to set up a small cluster using Hadoop
> (Release Mar 02) on Ubuntu 6.10 (Edgy) ; but I am facing some issues doing
> that.
> 
> I am trying to run the Hadoop 'wordcount' example program which come
> bundled with it. I am able to successfully run the program on a single
> node cluster (that is using my local machine only). But, when I try to run
> the same program on a cluster of two machines, the program hangs in the
> 'reduce' phase.
> 
> 
> Settings:
> 
> Master Node: 192.168.1.150 (dennis-laptop)
> Slave Node: 192.168.1.201 (traal)
> 
> User Account on both Master and Slave is named : Hadoop
> 
> Password-less ssh login to Slave from the Master is working.
> 
> JAVA_HOME is set appropriately in the hadoop-env.sh file on both
> Master/Slave.
> 
> MASTER
> 
> 1) conf/slaves
> localhost
> hadoop@192.168.1.201
> 
> 2) conf/master
> localhost
> 
> 3) conf/hadoop-site.xml
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> 
> <!-- Put site-specific property overrides in this file. -->
> 
> <configuration>
> <property>
>          <name>fs.default.name</name>
>          <value>192.168.1.150:50000</value>
>     </property>
> 
>     <property>
>          <name>mapred.job.tracker</name>
>          <value>192.168.1.150:50001</value>
>      </property>
>         
>     <property>
>          <name>dfs.replication</name>
>          <value>2</value>
>     </property>
> </configuration>
> 
> SLAVE
> 
> 1) conf/slaves
> localhost
> 
> 2) conf/master
> hadoop@192.168.1.150
> 
> 3) conf/hadoop-site.xml
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> 
> <!-- Put site-specific property overrides in this file. -->
> 
> <configuration>
> <property>
>          <name>fs.default.name</name>
>          <value>192.168.1.150:50000</value>
>     </property>
> 
>     <property>
>          <name>mapred.job.tracker</name>
>          <value>192.168.1.150:50001</value>
>      </property>
>         
>     <property>
>          <name>dfs.replication</name>
>          <value>2</value>
>     </property>
> </configuration>
> 
> 
> CONSOLE OUTPUT
> bin/hadoop jar hadoop-*-examples.jar wordcount -m 10 -r 2 input output
> 07/03/06 23:17:17 INFO mapred.InputFormatBase: Total input paths to
> process : 1
> 07/03/06 23:17:18 INFO mapred.JobClient: Running job: job_0001
> 07/03/06 23:17:19 INFO mapred.JobClient:  map 0% reduce 0%
> 07/03/06 23:17:29 INFO mapred.JobClient:  map 20% reduce 0%
> 07/03/06 23:17:30 INFO mapred.JobClient:  map 40% reduce 0%
> 07/03/06 23:17:32 INFO mapred.JobClient:  map 80% reduce 0%
> 07/03/06 23:17:33 INFO mapred.JobClient:  map 100% reduce 0%
> 07/03/06 23:17:42 INFO mapred.JobClient:  map 100% reduce 3%
> 07/03/06 23:17:43 INFO mapred.JobClient:  map 100% reduce 5%
> 07/03/06 23:17:44 INFO mapred.JobClient:  map 100% reduce 8%
> 07/03/06 23:17:52 INFO mapred.JobClient:  map 100% reduce 10%
> 07/03/06 23:17:53 INFO mapred.JobClient:  map 100% reduce 13%
> 07/03/06 23:18:03 INFO mapred.JobClient:  map 100% reduce 16%
> 
> 
> The only exception I can see from the log files is in the 'TaskTracker'
> log file:
> 
> 2007-03-06 23:17:32,214 INFO org.apache.hadoop.mapred.TaskRunner:
> task_0001_r_000000_0 Copying task_0001_m_000002_0 output from traal.
> 2007-03-06 23:17:32,221 INFO org.apache.hadoop.mapred.TaskRunner:
> task_0001_r_000000_0 Copying task_0001_m_000001_0 output from
> dennis-laptop.
> 2007-03-06 23:17:32,368 WARN org.apache.hadoop.mapred.TaskRunner:
> task_0001_r_000000_0 copy failed: task_0001_m_000002_0 from traal
> 2007-03-06 23:17:32,368 WARN org.apache.hadoop.mapred.TaskRunner:
> java.io.IOException: File
> /tmp/hadoop-hadoop/mapred/local/task_0001_r_000000_0/map_2.out-0 not
> created
> at
> org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:301)
> at
> org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:262)
> 
> 2007-03-06 23:17:32,369 WARN org.apache.hadoop.mapred.TaskRunner:
> task_0001_r_000000_0 adding host traal to penalty box, next contact in 99
> seconds
> 
> I am attaching the master log files just in case anyone wants to check
> them.
> 
> Any help will be greatly appreciated! 
> 
> -gaurav
> 
> 
> http://www.nabble.com/file/7013/hadoop-hadoop-tasktracker-dennis-laptop.log
> hadoop-hadoop-tasktracker-dennis-laptop.log </br>
> http://www.nabble.com/file/7012/hadoop-hadoop-jobtracker-dennis-laptop.log
> hadoop-hadoop-jobtracker-dennis-laptop.log </br>
> http://www.nabble.com/file/7011/hadoop-hadoop-namenode-dennis-laptop.log
> hadoop-hadoop-namenode-dennis-laptop.log </br>
> http://www.nabble.com/file/7010/hadoop-hadoop-datanode-dennis-laptop.log
> hadoop-hadoop-datanode-dennis-laptop.log 
> 

-- 
View this message in context: http://www.nabble.com/Hadoop-%27wordcount%27-program-hanging-in-the-Reduce-phase.-tf3360661.html#a9369365
Sent from the Hadoop Users mailing list archive at Nabble.com.


Mime
View raw message