hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arun C Murthy (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3184) Improve handling of fetch failures when a tasktracker is not responding on HTTP
Date Fri, 14 Oct 2011 01:38:12 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13127223#comment-13127223
] 

Arun C Murthy commented on MAPREDUCE-3184:
------------------------------------------

bq. Now we should discuss whether there are important use cases for the health script where
it would be problematic to make it call lostTT – any cases where admins want to use the
health script to prevent new scheduling but not fail the stuff that's currently running?

The point of the health-script was to quickly determine node faults. In that case, it seems
very reasonable to assume that you lose tasks from a 'bad node' and re-run them. I don't see
the point of trying to keep running tasks around when a node is deemed bad.
                
> Improve handling of fetch failures when a tasktracker is not responding on HTTP
> -------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3184
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3184
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: jobtracker
>    Affects Versions: 0.20.205.0
>            Reporter: Todd Lipcon
>
> On a 100 node cluster, we had an issue where one of the TaskTrackers was hit by MAPREDUCE-2386
and stopped responding to fetches. The behavior observed was the following:
> - every reducer would try to fetch the same map task, and fail after ~13 minutes.
> - At that point, all reducers would report this failed fetch to the JT for the same task,
and the task would be re-run.
> - Meanwhile, the reducers would move on to the next map task that ran on the TT, and
hang for another 13 minutes.
> The job essentially made no progress for hours, as each map task that ran on the bad
node was serially marked failed.
> To combat this issue, we should introduce a second type of failed fetch notification,
used when the TT does not respond at all (ie SocketTimeoutException, etc). These fetch failure
notifications should count against the TT at large, rather than a single task. If more than
half of the reducers report such an issue for a given TT, then all of the tasks from that
TT should be re-run.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Mime
View raw message