hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-750) race condition on stalled map output fetches
Date Tue, 28 Nov 2006 00:36:22 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-750?page=comments#action_12453764 ] 
            
Owen O'Malley commented on HADOOP-750:
--------------------------------------

The thread call stacks look like:

Thread 1525 (Thread-1403):
  State: WAITING
  Blocked count: 126
  Waited count: 1
  Waiting on java.util.ArrayList@943dc4
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:474)
    org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:207)

Thread 102 (Thread-89):
  State: TIMED_WAITING
  Blocked count: 7
  Waited count: 0
  Stack:
    java.lang.Thread.sleep(Native Method)
    org.apache.hadoop.mapred.ReduceTaskRunner$MapCopyLeaseChecker.run(ReduceTaskRunner.java:303)

Thread 79 (Thread-66):
  State: WAITING
  Blocked count: 151
  Waited count: 26481
  Waiting on java.util.ArrayList@14eaec9
  Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:474)
    org.apache.hadoop.mapred.ReduceTaskRunner.getCopyResult(ReduceTaskRunner.java:527)
    org.apache.hadoop.mapred.ReduceTaskRunner.prepare(ReduceTaskRunner.java:453)
    org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:120)


> race condition on stalled map output fetches
> --------------------------------------------
>
>                 Key: HADOOP-750
>                 URL: http://issues.apache.org/jira/browse/HADOOP-750
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.8.0
>            Reporter: Owen O'Malley
>         Assigned To: Owen O'Malley
>             Fix For: 0.9.0
>
>
> I've seen reduces getting killed because of a race condition in the ReduceTaskRunner.
 In the logs it looks like:
> 2006-11-27 08:40:44,795 WARN org.apache.hadoop.mapred.TaskRunner: Map output copy stalled
on http://kry2296.inktomisearch.com:7030/mapOutput?map=task_0001_m_015626_0
> ...
> 2006-11-27 09:16:41,361 INFO org.apache.hadoop.mapred.TaskRunner: task_0001_r_000658_0
Need 52 map output(s)
> 2006-11-27 09:16:41,361 INFO org.apache.hadoop.mapred.TaskRunner: task_0001_r_000658_0
Got 39 known map output location(s); scheduling...
> 2006-11-27 09:16:41,361 INFO org.apache.hadoop.mapred.TaskRunner: task_0001_r_000658_0
Scheduled 0 of 39 known outputs (0 slow hosts and 39 dup hosts)
> ...
> 2006-11-27 09:16:47,071 INFO org.apache.hadoop.mapred.TaskTracker: task_0001_r_000658_0
0.3328575% reduce > copy (28679 of 28720 at 0.76 MB/s) >
> ...
> 2006-11-27 09:16:47,338 INFO org.apache.hadoop.mapred.TaskRunner: task_0001_r_000658_0
done copying task_0001_m_015462_0 output from node1
> ...
> 2006-11-27 09:36:51,398 INFO org.apache.hadoop.mapred.TaskTracker: task_0001_r_000658_0:
Task failed to report status for 1204 seconds. Killing.
> Basically, the handling of the stall has a race condition that leaves the fetcher in
a bad state. At the end of the fetch, all of the tasks finish and their results never get
handled. When the thread times out, all of the map output copiers are waiting for things to
fetch and the prepare thread is waiting for results.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message