hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hemanth Yamijala (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3184) HOD gracefully exclude "bad" nodes during ring formation
Date Tue, 27 May 2008 04:51:59 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12599958#action_12599958

Hemanth Yamijala commented on HADOOP-3184:

There are 2 possible types of issues users are facing:

- Hod allocations fail (that is, the allocate command returns back with a non-zero exit code)
due to some of the conditions mentioned above. And retrying doesn't help unless the condition
is rectified or the node which has the condition is removed from the resource manager's list.
This is particularly true in Torque, as it returns the same set of nodes, in the same order
and hence the failure condition is mostly repeated.
- Hod allocation hangs (without returning back), again due to some of the conditions mentioned.

Firstly, can you please confirm which one is more of the issue ?

AFAIK, the second case is a Torque issue where we do not even get control to do anything.
We could attempt to fix the first one - maybe even outside of HOD. Maybe we could offline
a node if HOD allocations fail a couple of times on it. So, in an automated manner, the offending
node is removed, and further attempts would work. 

> HOD gracefully exclude "bad" nodes during ring formation
> --------------------------------------------------------
>                 Key: HADOOP-3184
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3184
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hod
>            Reporter: Marco Nicosia
>            Assignee: Hemanth Yamijala
>             Fix For: 0.18.0
> HOD clusters sometimes fail to allocate due to a single "bad" node. During ring formation,
the entire ring should not be dependent upon every single node being good. Instead, it should
either exclude any ring member that does not adequately join the ring in a specified amount
of time.
> This is a frequent HOD user issue (although not directly caused by HOD).
> Examples of bad nodes: Missing java, incorrect version of HOD or Hadoop, local name-cache
corrupt, slow network links, drives just beginning to fail, etc.
> Many of these conditions are known, and we can monitor for those separately, but this
enhancement would shield users from unknown failure conditions that we haven't yet anticipated.
This way, a user will get a cluster, instead of hanging indefinitely.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message