hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brian Wedel" <brian.we...@gmail.com>
Subject Re: Hadoop 'wordcount' program hanging in the Reduce phase.
Date Wed, 07 Mar 2007 19:57:53 GMT
I am experimenting on a small cluster as well (4 machines) and I had
success with the following configuration:

 - configuration files on both the master and slaves are the same
 - in the master/slave lists I only used the ip address (not
localhost) and ommited the user e.g. (hadoop@)
 - in the fs.default.name configuration variable use
hdfs://<host>:<port> (I don't know if this is necessary - but it seems
you can specify other types of filesystems - not sure which is
default)
 - use 0.12.0 release - I was using 0.11.2 and was getting some odd
errors that disappeared  when I upgraded
 - I don't run a datanode daemon on the same machine a the namenode --
this was a problem when I was trying the hadoop-streaming contributed
package for scripting.  Not sure if it matters for the examples

This configuration worked me.
-Brian

On 3/7/07, Gaurav Agarwal <gauravagarwal_4@yahoo.com> wrote:
>
> Hi Richard,
>
> I am facing this error very consistently. I have tried the another nightly
> build (4 Mar) but that gave same exception.
>
> thanks,
> gaurav
>
>
>
> Richard Yang-3 wrote:
> >
> > Hi Gaurav,
> >
> > Does this error always happen??
> > Our settings are similar.
> > Mine contains some error messages about IOExceptions, not able to obtain
> > certain blocks, not able to create a new block.  Although the program hung
> > some time, in most cases, they were able to complete with correct results.
> > Btw, I am running the grep sample program on version 0.11.2.
> >
> > Best Regards
> >
> > Richard Yang
> > richardyang@richardyang.net
> > kusanagiyang@gmail.com
> >
> >
> > -----Original Message-----
> > From: Gaurav Agarwal [mailto:gauravagarwal_4@yahoo.com]
> > Sent: Wednesday, March 07, 2007 12:22 AM
> > To: hadoop-user@lucene.apache.org
> > Subject: Hadoop 'wordcount' program hanging in the Reduce phase.
> >
> >
> > Hi Everyone!
> > I am new user to Hadoop and trying to set up a small cluster using Hadoop.
> > but I am facing some issues doing that.
> >
> > I am trying to run the Hadoop 'wordcount' example program which come
> > bundled
> > with it. I am able to successfully run the program on a single node
> > cluster
> > (that is using my local machine only). But, when I try to run the same
> > program on a cluster of two machines, the program hangs in the 'reduce'
> > phase.
> >
> >
> > Settings:
> >
> > Master Node: 192.168.1.150 (dennis-laptop)
> > Slave Node: 192.168.1.201 (traal)
> >
> > User Account on both Master and Slave is named : Hadoop
> >
> > Password-less ssh login to Slave from the Master is working.
> >
> > JAVA_HOME is set appropriately in the hadoop-env.sh file on both
> > Master/Slave.
> >
> > MASTER
> >
> > 1) conf/slaves
> > localhost
> > hadoop@192.168.1.201
> >
> > 2) conf/master
> > localhost
> >
> > 3) conf/hadoop-site.xml
> > <?xml version="1.0"?>
> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> >
> > <!-- Put site-specific property overrides in this file. -->
> >
> > <configuration>
> > <property>
> >          <name>fs.default.name</name>
> >          <value>192.168.1.150:50000</value>
> >     </property>
> >
> >     <property>
> >          <name>mapred.job.tracker</name>
> >          <value>192.168.1.150:50001</value>
> >      </property>
> >
> >     <property>
> >          <name>dfs.replication</name>
> >          <value>2</value>
> >     </property>
> > </configuration>
> >
> > SLAVE
> >
> > 1) conf/slaves
> > localhost
> >
> > 2) conf/master
> > hadoop@192.168.1.150
> >
> > 3) conf/hadoop-site.xml
> > <?xml version="1.0"?>
> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> >
> > <!-- Put site-specific property overrides in this file. -->
> >
> > <configuration>
> > <property>
> >          <name>fs.default.name</name>
> >          <value>192.168.1.150:50000</value>
> >     </property>
> >
> >     <property>
> >          <name>mapred.job.tracker</name>
> >          <value>192.168.1.150:50001</value>
> >      </property>
> >
> >     <property>
> >          <name>dfs.replication</name>
> >          <value>2</value>
> >     </property>
> > </configuration>
> >
> >
> > CONSOLE OUTPUT
> > bin/hadoop jar hadoop-*-examples.jar wordcount -m 10 -r 2 input output
> > 07/03/06 23:17:17 INFO mapred.InputFormatBase: Total input paths to
> > process
> > : 1
> > 07/03/06 23:17:18 INFO mapred.JobClient: Running job: job_0001
> > 07/03/06 23:17:19 INFO mapred.JobClient:  map 0% reduce 0%
> > 07/03/06 23:17:29 INFO mapred.JobClient:  map 20% reduce 0%
> > 07/03/06 23:17:30 INFO mapred.JobClient:  map 40% reduce 0%
> > 07/03/06 23:17:32 INFO mapred.JobClient:  map 80% reduce 0%
> > 07/03/06 23:17:33 INFO mapred.JobClient:  map 100% reduce 0%
> > 07/03/06 23:17:42 INFO mapred.JobClient:  map 100% reduce 3%
> > 07/03/06 23:17:43 INFO mapred.JobClient:  map 100% reduce 5%
> > 07/03/06 23:17:44 INFO mapred.JobClient:  map 100% reduce 8%
> > 07/03/06 23:17:52 INFO mapred.JobClient:  map 100% reduce 10%
> > 07/03/06 23:17:53 INFO mapred.JobClient:  map 100% reduce 13%
> > 07/03/06 23:18:03 INFO mapred.JobClient:  map 100% reduce 16%
> >
> >
> > The only exception I can see from the log files is in the 'TaskTracker'
> > log
> > file:
> >
> > 2007-03-06 23:17:32,214 INFO org.apache.hadoop.mapred.TaskRunner:
> > task_0001_r_000000_0 Copying task_0001_m_000002_0 output from traal.
> > 2007-03-06 23:17:32,221 INFO org.apache.hadoop.mapred.TaskRunner:
> > task_0001_r_000000_0 Copying task_0001_m_000001_0 output from
> > dennis-laptop.
> > 2007-03-06 23:17:32,368 WARN org.apache.hadoop.mapred.TaskRunner:
> > task_0001_r_000000_0 copy failed: task_0001_m_000002_0 from traal
> > 2007-03-06 23:17:32,368 WARN org.apache.hadoop.mapred.TaskRunner:
> > java.io.IOException: File
> > /tmp/hadoop-hadoop/mapred/local/task_0001_r_000000_0/map_2.out-0 not
> > created
> > at
> > org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceT
> > askRunner.java:301)
> > at
> > org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunn
> > er.java:262)
> >
> > 2007-03-06 23:17:32,369 WARN org.apache.hadoop.mapred.TaskRunner:
> > task_0001_r_000000_0 adding host traal to penalty box, next contact in 99
> > seconds
> >
> > I am attaching the master log files just in case anyone wants to check
> > them.
> >
> > Any help will be greatly appreciated!
> >
> > -gaurav
> >
> > http://www.nabble.com/file/7013/hadoop-hadoop-tasktracker-dennis-laptop.log
> > hadoop-hadoop-tasktracker-dennis-laptop.log </br>
> > http://www.nabble.com/file/7012/hadoop-hadoop-jobtracker-dennis-laptop.log
> > hadoop-hadoop-jobtracker-dennis-laptop.log </br>
> > http://www.nabble.com/file/7011/hadoop-hadoop-namenode-dennis-laptop.log
> > hadoop-hadoop-namenode-dennis-laptop.log </br>
> > http://www.nabble.com/file/7010/hadoop-hadoop-datanode-dennis-laptop.log
> > hadoop-hadoop-datanode-dennis-laptop.log
> > --
> > View this message in context:
> > http://www.nabble.com/Hadoop-%27wordcount%27-program-hanging-in-the-Reduce-p
> > hase.-tf3360661.html#a9348424
> > Sent from the Hadoop Users mailing list archive at Nabble.com.
> >
> >
> >
> >
> >
>
> --
> View this message in context: http://www.nabble.com/Hadoop-%27wordcount%27-program-hanging-in-the-Reduce-phase.-tf3360661.html#a9357648
> Sent from the Hadoop Users mailing list archive at Nabble.com.
>
>

Mime
View raw message