hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amar Kamat (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-6014) Improvements to Global Black-listing of TaskTrackers
Date Fri, 19 Jun 2009 05:49:07 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721662#action_12721662

Amar Kamat commented on HADOOP-6014:

bq. Maybe as a first step, we can just treat the failures that were explicitly initiated by
the TaskTracker differently, and not have the TaskTracker be penalized for those. 
I think for now this will be a simple thing to do. A task can fail because of 
# code issues (failure e.g buggy code)
# node issues (killed e.g disk)
# mismatch (killed-failure e.g insufficient memory) 

In case #3, its not tt's fault and hence we should be less aggressive in deciding on such

bq. I'd tend to agree with Jim that we should just use HADOOP-5478 and revert the cross-job
Cross blacklisting will still be required. Consider a case where a node's environment is messed
up (all the basic apps e.g wc, sort etc are missing). In such case I dont think node scripts
will help. Number of tasks/job failures looks like the right metric to me. 

> Improvements to Global Black-listing of TaskTrackers
> ----------------------------------------------------
>                 Key: HADOOP-6014
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6014
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Arun C Murthy
>             Fix For: 0.21.0
> HADOOP-4305 added a global black-list of tasktrackers.
> We saw a scenario on one of our clusters where a few jobs caused a lot of tasktrackers
to immediately be blacklisted. This was caused by a specific set of jobs which (same user)
whose tasks were shot down the by the TaskTracker for being over the vmem limit of 2G. Each
of these jobs had over 600 failures of the same kind. This resulted in each of the users black-listing
some tasktrackers, which in itself is wrong since the failures had nothing to do with the
node on which the failure occurred (i.e. high memory usage) and shouldn't have had to penalized
the tasktracker. We clearly need to start treating system and user failures separately for
black-listing etc. A DiskError is fatal and should probably we blacklisted immediately while
a task which was 'failed' for using more memory shouldn't count against the tasktracker at
> The other problem is that we never configured mapred.max.tracker.blacklists and continue
to use the default value of 4. Further more this config should really be a percent of the
cluster-size and not a whole number. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message