hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hemanth Yamijala (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3184) HOD gracefully exclude "bad" nodes during ring formation
Date Tue, 27 May 2008 04:57:57 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12599960#action_12599960

Hemanth Yamijala commented on HADOOP-3184:

Another approach is the following:

Mostly, Hod allocations fail if the RingMaster does not come up or the JobTracker does not
come up. If the JobTracker does not come up, then the hodring on the node can report a failure,
and another node which asks for the hadoop command can be asked to run the JT. If the RingMaster
does not come up, its a bit more difficult - because that's what controls the whole process.
So, maybe in that case, the RingMaster should somehow make another instance of it to come
up on a different machine and then it should die gracefully. 

I think the latter change would be quite involved. The former should be simpler.

> HOD gracefully exclude "bad" nodes during ring formation
> --------------------------------------------------------
>                 Key: HADOOP-3184
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3184
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hod
>            Reporter: Marco Nicosia
>            Assignee: Hemanth Yamijala
>             Fix For: 0.18.0
> HOD clusters sometimes fail to allocate due to a single "bad" node. During ring formation,
the entire ring should not be dependent upon every single node being good. Instead, it should
either exclude any ring member that does not adequately join the ring in a specified amount
of time.
> This is a frequent HOD user issue (although not directly caused by HOD).
> Examples of bad nodes: Missing java, incorrect version of HOD or Hadoop, local name-cache
corrupt, slow network links, drives just beginning to fail, etc.
> Many of these conditions are known, and we can monitor for those separately, but this
enhancement would shield users from unknown failure conditions that we haven't yet anticipated.
This way, a user will get a cluster, instead of hanging indefinitely.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message