hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Chansler (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-3184) HOD gracefully exclude "bad" nodes during ring formation
Date Fri, 27 Jun 2008 21:19:45 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Robert Chansler updated HADOOP-3184:
------------------------------------

    Release Note: Modified HOD to handle master (NameNode or JobTracker) failures on bad nodes
by trying to bring them up on another node in the ring. Introduced new property ringmaster.max-master-failures
to specify the maximum number of times a master is allowed to fail.  (was: Modified HOD to
handle master (NameNode or JobTracker) failures on bad nodes by trying to bring them up on
another node in the ring. These retries are done a configured number of times per master.
The change is incompatible because a new required configuration option is introduced: ringmaster.max-master-failures,
which defines the maximum number of times a master is allowed to fail.)
    Hadoop Flags: [Incompatible change, Reviewed]  (was: [Reviewed, Incompatible change])

> HOD gracefully exclude "bad" nodes during ring formation
> --------------------------------------------------------
>
>                 Key: HADOOP-3184
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3184
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hod
>            Reporter: Marco Nicosia
>            Assignee: Hemanth Yamijala
>             Fix For: 0.18.0
>
>         Attachments: 3184.1.patch, 3184.2.patch
>
>
> HOD clusters sometimes fail to allocate due to a single "bad" node. During ring formation,
the entire ring should not be dependent upon every single node being good. Instead, it should
either exclude any ring member that does not adequately join the ring in a specified amount
of time.
> This is a frequent HOD user issue (although not directly caused by HOD).
> Examples of bad nodes: Missing java, incorrect version of HOD or Hadoop, local name-cache
corrupt, slow network links, drives just beginning to fail, etc.
> Many of these conditions are known, and we can monitor for those separately, but this
enhancement would shield users from unknown failure conditions that we haven't yet anticipated.
This way, a user will get a cluster, instead of hanging indefinitely.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message