hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jothi Padmanabhan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-5758) Task attempt stopped shuffling and hung the job
Date Mon, 11 May 2009 03:52:45 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12707876#action_12707876
] 

Jothi Padmanabhan commented on HADOOP-5758:
-------------------------------------------

>From the logs, it is clear that the reducers could not fetch the output for map attempt
attempt_200905051023_1155_m_000036. Both the reducers (and possibly everybody else) is waiting
on these. Grepping for this map output in the syslog shows:
{noformat}
/users/jothipn/Desktop $ grep attempt_200905051023_1155_m_000036 ds04-bad-reducer-syslog
2009-05-07 18:58:42,224 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring obsolete output
of FAILED map-task: 'attempt_200905051023_1155_m_000036_0'
2009-05-07 18:58:42,228 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring obsolete output
of KILLED map-task: 'attempt_200905051023_1155_m_000036_0'
{noformat}


So, the attempt id 0 for this map task failed and attempt id 1 was killed.  There is something
mysterious going on here.
 If attempt id 1 was speculative, the framework would not have killed it as the original attempt
was a failure. If it was not, was the attempt Id 1 killed explicitly? In any case, why did
not the framework try and re execute this map somewhere else? 

Could you let us know if speculative execution was turned on?
If, by any chance, you have the logs for the task attempts attempt_200905051023_1155_m_000036_0
and attempt_200905051023_1155_m_000036_1, could you attach them to this Jira?


> Task attempt stopped shuffling and hung the job
> -----------------------------------------------
>
>                 Key: HADOOP-5758
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5758
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.18.3
>            Reporter: Nathan Marz
>         Attachments: ds04-bad-reducer-syslog, ds31-bad-reducer-syslog
>
>
> I was running a job and one of the reducer task attempts got stuck during the shuffle
phase. The percentage complete froze at 33.1%, and the logs for the attempt looked like:
> 2009-04-29 15:21:24,431 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200904250602_2468_r_000024_0:
Got 0 new map-outputs & number of known map outputs is 0
> 2009-04-29 15:21:24,431 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200904250602_2468_r_000024_0
Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts)
> 2009-04-29 15:22:24,580 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200904250602_2468_r_000024_0
Need another 1 map output(s) where 0 is already in progress
> 2009-04-29 15:22:24,581 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200904250602_2468_r_000024_0:
Got 0 new map-outputs & number of known map outputs is 0
> 2009-04-29 15:22:24,581 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200904250602_2468_r_000024_0
Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts)
> 2009-04-29 15:23:24,692 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200904250602_2468_r_000024_0
Need another 1 map output(s) where 0 is already in progress
> 2009-04-29 15:23:24,693 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200904250602_2468_r_000024_0:
Got 0 new map-outputs & number of known map outputs is 0
> 2009-04-29 15:23:24,693 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200904250602_2468_r_000024_0
Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts)
> 2009-04-29 15:24:24,718 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200904250602_2468_r_000024_0
Need another 1 map output(s) where 0 is already in progress
> 2009-04-29 15:24:24,718 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200904250602_2468_r_000024_0:
Got 0 new map-outputs & number of known map outputs is 0
> 2009-04-29 15:24:24,719 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200904250602_2468_r_000024_0
Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts)
> 2009-04-29 15:25:24,742 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200904250602_2468_r_000024_0
Need another 1 map output(s) where 0 is already in progress
> 2009-04-29 15:25:24,743 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200904250602_2468_r_000024_0:
Got 0 new map-outputs & number of known map outputs is 0
> 2009-04-29 15:25:24,743 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200904250602_2468_r_000024_0
Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts)
> The mappers and other reducers were long finished. When I manually killed the task attempt
process after 20 minutes of seeing it frozen, it restarted on another machine and succeeded
just fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message