hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kai Ju Liu <ka...@tellapart.com>
Subject Re: MapReduce jobs hanging or failing near completion
Date Tue, 19 Jul 2011 19:41:50 GMT
Hi Marcos. The issue appears to be the following. A reduce task is unable to
fetch results from a map task on HDFS. The map task is re-run, but the map
task is now unable to retrieve information that it needs to run. Here is the
error from the second map task:

java.io.FileNotFoundException:
/mnt/hadoop/mapred/local/taskTracker/hadoop/jobcache/job_201107171642_0560/attempt_201107171642_0560_m_000292_1/output/spill0.out
	at org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:176)
	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
	at org.apache.hadoop.mapred.Merger$Segment.init(Merger.java:205)
	at org.apache.hadoop.mapred.Merger$Segment.access$100(Merger.java:165)
	at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:418)
	at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:381)
	at org.apache.hadoop.mapred.Merger.merge(Merger.java:77)
	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1547)
	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1179)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
	at org.apache.hadoop.mapred.Child.main(Child.java:262)

I have been having general difficulties with HDFS on EBS, which
pointed me in this direction. Does this sound like a possible
hypothesis to you? Thanks!

Kai Ju

P.S. I am migrating off of HDFS on EBS, so I will post back with
further results as soon as I have them.

On Thu, Jul 7, 2011 at 6:36 PM, Marcos Ortiz <mlortiz@uci.cu> wrote:

>
>
> El 7/7/2011 8:43 PM, Kai Ju Liu escribió:
>
>  Over the past week or two, I've run into an issue where MapReduce jobs
>> hang or fail near completion. The percent completion of both map and
>> reduce tasks is often reported as 100%, but the actual number of
>> completed tasks is less than the total number. It appears that either
>> tasks backtrack and need to be restarted or the last few reduce tasks
>> hang interminably on the copy step.
>>
>> In certain cases, the jobs actually complete. In other cases, I can't
>> wait long enough and have to kill the job manually.
>>
>> My Hadoop cluster is hosted in EC2 on instances of type c1.xlarge with 4
>> attached EBS volumes. The instances run Ubuntu 10.04.1 with the
>> 2.6.32-309-ec2 kernel, and I'm currently using Cloudera's CDH3u0
>> distribution. Has anyone experienced similar behavior in their clusters,
>> and if so, had any luck resolving it? Thanks!
>>
>>  Can you post here your NN and DN logs files?
> Regards
>
>  Kai Ju
>>
>
> --
> Marcos Luís Ortíz Valmaseda
>  Software Engineer (UCI)
>  Linux User # 418229
>  http://marcosluis2186.**posterous.com<http://marcosluis2186.posterous.com>
>  http://twitter.com/**marcosluis2186 <http://twitter.com/marcosluis2186>
>
>

Mime
View raw message