hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Devaraj Das" <d...@yahoo-inc.com>
Subject RE: hadoop hang on reduce
Date Wed, 12 Sep 2007 01:39:44 GMT
This looks like some issue in locating the map output file. Now you should
also look at the tasktracker log (on the map side) and look for exceptions
at around the time (2007-09-11 14:28:07) when the reduce task complained
with the exception you mentioned. In summary, you should see some exceptions
on both the tasktracker and task logs at around the same time.

> -----Original Message-----
> From: Xiaoguang Qi [mailto:xiq204@gmail.com] 
> Sent: Tuesday, September 11, 2007 11:46 AM
> To: hadoop-user@lucene.apache.org
> Subject: Re: hadoop hang on reduce
> 
> Thanks for your reply!
> 
> I looked at the log file as you suggested. Here's the error I found:
> 
> 2007-09-11 14:28:07,451 INFO org.apache.hadoop.mapred.ReduceTask:
> task_0002_r_000000_0 Got 3 known map output location(s); scheduling...
> 2007-09-11 14:28:07,452 INFO org.apache.hadoop.mapred.ReduceTask:
> task_0002_r_000000_0 Copying task_0002_m_000001_0 output from 
> (*machine name*).
> 2007-09-11 14:28:07,475 WARN org.apache.hadoop.mapred.ReduceTask:
> task_0002_r_000000_0 copy failed: task_0002_m_000001_0 from (*machine
> name*)
> 2007-09-11 14:28:07,477 WARN org.apache.hadoop.mapred.ReduceTask:
> java.io.IOException: Server returned HTTP response code: 500 for URL:
> http://(*machine
> name*):50060/mapOutput?map=task_0002_m_000001_0&reduce=0
>         at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(Htt
> pURLConnection.java:1174)
>         at 
> org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLo
> cation.java:206)
>         at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copy
> Output(ReduceTask.java:680)
>         at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(
> ReduceTask.java:641)
> 
> 2007-09-11 14:28:07,480 INFO org.apache.hadoop.mapred.ReduceTask:
> task_0002_r_000000_0 Scheduled 1 of 3 known outputs (0 slow hosts and
> 2 dup hosts)
> 2007-09-11 14:28:07,480 WARN org.apache.hadoop.mapred.ReduceTask:
> task_0002_r_000000_0 adding host (*machine name*) to penalty 
> box, next contact in 278 seconds
> 2007-09-11 14:28:07,480 INFO org.apache.hadoop.mapred.ReduceTask:
> task_0002_r_000000_0 Need 3 map output(s)
> 
> 
> 
> 
> On 9/10/07, Devaraj Das <ddas@yahoo-inc.com> wrote:
> > Could you take a look at the task logs
> > $HADOOP_LOG_DIR/logs/<reduce-task-id>/syslog/part* . That 
> will contain 
> > info on what's going wrong. If it is consistently happening, there 
> > most likely is some misconfig. Let us know what exceptions, 
> etc. you see there.
> >
> > > -----Original Message-----
> > > From: Xiaoguang Qi [mailto:xiq204@gmail.com]
> > > Sent: Thursday, September 06, 2007 8:51 PM
> > > To: hadoop-user@lucene.apache.org
> > > Subject: hadoop hang on reduce
> > >
> > > Hi, all --
> > >
> > > I was trying to configure hadoop to work on two machines. The dfs 
> > > seems to work fine. But when I tried the 'grep' example in 
> > > 'hadoop-0.13.1-examples.jar', it always hang upon the 
> finish of map 
> > > tasks and the start of reduce tasks. I thought this could be a 
> > > network problem; so I reconfigured it to run on a single machine, 
> > > but still running in distributed mode.
> > > The problem remains. Here are the configuration files.
> > >
> > > ========== hadoop-site.xml ========== <?xml version="1.0"?> 
> > > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> > >
> > > <!-- Put site-specific property overrides in this file. -->
> > >
> > > <configuration>
> > >
> > >   <property>
> > >     <name>fs.default.name</name>
> > >     <value>(masked machine name):9000</value>
> > >   </property>
> > >
> > >   <property>
> > >     <name>mapred.job.tracker</name>
> > >     <value>(masked machine name):9001</value>
> > >   </property>
> > >
> > >   <property>
> > >     <name>dfs.replication</name>
> > >     <value>1</value>
> > >   </property>
> > >
> > >   <property>
> > >     <name>dfs.name.dir</name>
> > >     <value>dfs-space/dfs/name</value>
> > >   </property>
> > >
> > >   <property>
> > >     <name>dfs.data.dir</name>
> > >     <value>dfs-space/dfs/data</value>
> > >   </property>
> > >
> > >   <property>
> > >     <name>mapred.local.dir</name>
> > >     <value>dfs-space/mapred/local</value>
> > >   </property>
> > >
> > > </configuration>
> > >
> > >
> > > ========== mapred-default.xml ========== <?xml version="1.0"?> 
> > > <?xml-stylesheet type="text/xsl"
> > > href="configuration.xsl"?>
> > >
> > > <!-- Put mapred-specific property overrides in this file. -->
> > >
> > > <configuration>
> > >   <property>
> > >     <name>mapred.map.tasks</name>
> > >     <value>20</value>
> > >   </property>
> > >
> > >   <property>
> > >     <name>mapred.reduce.tasks</name>
> > >     <value>1</value>
> > >   </property>
> > > </configuration>
> > >
> > >
> > > When I run the following command:
> > > bin/hadoop jar hadoop-*-examples.jar grep input output 
> 'dfs[a-z.]+'
> > >
> > > here's what the screen shows:
> > >
> > > 07/09/06 23:10:20 INFO mapred.FileInputFormat: Total 
> input paths to 
> > > process : 3
> > > 07/09/06 23:10:20 INFO mapred.JobClient: Running job: job_0001
> > > 07/09/06 23:10:21 INFO mapred.JobClient:  map 0% reduce 0%
> > > 07/09/06 23:10:32 INFO mapred.JobClient:  map 4% reduce 0%
> > > 07/09/06 23:10:33 INFO mapred.JobClient:  map 13% reduce 0%
> > > 07/09/06 23:10:34 INFO mapred.JobClient:  map 18% reduce 0%
> > > 07/09/06 23:10:35 INFO mapred.JobClient:  map 22% reduce 0%
> > > 07/09/06 23:10:36 INFO mapred.JobClient:  map 27% reduce 0%
> > > 07/09/06 23:10:37 INFO mapred.JobClient:  map 36% reduce 0%
> > > 07/09/06 23:10:39 INFO mapred.JobClient:  map 45% reduce 0%
> > > 07/09/06 23:10:40 INFO mapred.JobClient:  map 49% reduce 0%
> > > 07/09/06 23:10:41 INFO mapred.JobClient:  map 54% reduce 0%
> > > 07/09/06 23:10:42 INFO mapred.JobClient:  map 59% reduce 0%
> > > 07/09/06 23:10:43 INFO mapred.JobClient:  map 68% reduce 0%
> > > 07/09/06 23:10:45 INFO mapred.JobClient:  map 77% reduce 0%
> > > 07/09/06 23:10:47 INFO mapred.JobClient:  map 86% reduce 0%
> > > 07/09/06 23:10:49 INFO mapred.JobClient:  map 95% reduce 0%
> > > 07/09/06 23:10:50 INFO mapred.JobClient:  map 100% reduce 0%
> > >
> > > Then the program hang for a long time until I kill it.
> > > Here's what I find in the 'tasktracker' log file:
> > >
> > > ......
> > > 2007-09-06 22:54:52,569 INFO
> > > org.apache.hadoop.mapred.TaskTracker: LaunchTaskAct
> > > ion: task_0001_m_000021_0
> > > 2007-09-06 22:54:53,942 INFO
> > > org.apache.hadoop.mapred.TaskTracker: task_0001_m_0 00019_0 1.0% 
> > > hdfs://(masked machine name):9000/user/(masked user 
> > > name)/input/hadoop-defau
> > > lt.xml:26068+1018
> > > 2007-09-06 22:54:53,944 INFO
> > > org.apache.hadoop.mapred.TaskTracker: Task task_000 
> 1_m_000019_0 is 
> > > done.
> > > 2007-09-06 22:54:54,040 INFO
> > > org.apache.hadoop.mapred.TaskTracker: task_0001_m_0 00021_0 1.0% 
> > > hdfs://(masked machine name):9000/user/(masked user 
> > > name)/input/hadoop-site.
> > > xml:0+178
> > > 2007-09-06 22:54:54,043 INFO
> > > org.apache.hadoop.mapred.TaskTracker: Task task_000 
> 1_m_000021_0 is 
> > > done.
> > > 2007-09-06 22:54:54,059 INFO
> > > org.apache.hadoop.mapred.TaskTracker: LaunchTaskAct
> > > ion: task_0001_r_000000_0
> > > 2007-09-06 22:54:55,935 INFO
> > > org.apache.hadoop.mapred.TaskTracker: task_0001_r_0 00000_0 0.0% 
> > > reduce > copy >
> > > 2007-09-06 22:54:56,939 INFO
> > > org.apache.hadoop.mapred.TaskTracker: task_0001_r_0 00000_0 0.0% 
> > > reduce > copy >
> > > 2007-09-06 22:54:57,942 INFO
> > > org.apache.hadoop.mapred.TaskTracker: task_0001_r_0 00000_0 0.0% 
> > > reduce > copy >
> > > 2007-09-06 22:54:58,947 INFO
> > > org.apache.hadoop.mapred.TaskTracker: task_0001_r_0 00000_0 0.0% 
> > > reduce > copy > ......
> > >
> > > The last line repeats until the end of log file.
> > >
> > > Any one have an idea what the problem is? Any suggestion is 
> > > appreciated!
> > >
> >
> >
> 


Mime
View raw message