hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bryan Duxbury <br...@rapleaf.com>
Subject Super-long reduce task timeouts in hadoop-0.19.0
Date Fri, 20 Feb 2009 18:41:39 GMT
I noticed some really odd behavior today while reviewing the job  
history of some of our jobs. Our Ganglia graphs showed really long  
periods of inactivity across the entire cluster, which should  
definitely not be the case - we have a really long string of jobs in  
our workflow that should execute one after another. I figured out  
which jobs were running during those periods of inactivity, and  
discovered that almost all of them had 4-5 failed reduce tasks, with  
the reason for failure being something like:

Task attempt_200902061117_3382_r_000038_0 failed to report status for  
1282 seconds. Killing!

The actual timeout reported varies from 700-5000 seconds. Virtually  
all of our longer-running jobs were affected by this problem. The  
period of inactivity on the cluster seems to correspond to the amount  
of time the job waited for these reduce tasks to fail.

I checked out the tasktracker log for the machines with timed-out  
reduce tasks looking for something that might explain the problem,  
but the only thing I came up with that actually referenced the failed  
task was this log message, which was repeated many times:

2009-02-19 22:48:19,380 INFO org.apache.hadoop.mapred.TaskTracker:  
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find  
attempt_200902061117_3388_r_000066_0/output/file.out in any of the  
configured local directories

I'm not sure what this means; can anyone shed some light on this  

Further confusing the issue, on the affected machines, I looked in  
logs/userlogs/<task id>, and to my surprise, the directory and log  
files existed, and the syslog file seemed to contain logs of a  
perfectly good reduce task!

Overall, this seems like a pretty critical bug. It's consuming up to  
50% of the runtime of our jobs in some instances, killing our  
throughput. At the very least, it seems like the reduce task timeout  
period should be MUCH shorter than the current 10-20 minutes.


View raw message