hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hemanth Yamijala (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3184) HOD gracefully exclude "bad" nodes during ring formation
Date Thu, 05 Jun 2008 11:09:45 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602614#action_12602614

Hemanth Yamijala commented on HADOOP-3184:

The attached patch solves the problem of cluster allocation failing due to a single bad JobTracker
node in the entire cluster. It does not handle ringmaster failures, which is much tougher
to solve at this point.

Description of the solution:

This patch builds on the solution of HADOOP-3464, where we introduced an RPC message (setHodRingErrors)
which the HodRing will call when they fail to launch the Hadoop daemons on a node (for e.g.
because of a missing Hadoop). In HADOOP-3464, upon receiving this error, we checked if the
error came while launching a Master command (i.e. a NameNode or JobTracker command) and if
so, we simply propagated that back to the client which deallocated the cluster after displaying
the error message from the hodring.

In this patch, we keep track of how many times such master commands failed in a variable in
the service object. We also introduce a config variable, ringmaster.max-master-failures. The
RingMaster returns an error to the client only when the number of times the master command
fails exceeds the configured value. If the number is not exceeded, the next HodRing which
asks for a command to launch is given out the master command again.

The config variable ringmaster.max-master-failures is bounded by a function of the maximum
number of requested nodes, in case they are fewer than the configured value. This is so that
the cluster allocation can fail if sufficient nodes are not available to bring up masters

> HOD gracefully exclude "bad" nodes during ring formation
> --------------------------------------------------------
>                 Key: HADOOP-3184
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3184
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hod
>            Reporter: Marco Nicosia
>            Assignee: Hemanth Yamijala
>             Fix For: 0.18.0
>         Attachments: 3184.1.patch
> HOD clusters sometimes fail to allocate due to a single "bad" node. During ring formation,
the entire ring should not be dependent upon every single node being good. Instead, it should
either exclude any ring member that does not adequately join the ring in a specified amount
of time.
> This is a frequent HOD user issue (although not directly caused by HOD).
> Examples of bad nodes: Missing java, incorrect version of HOD or Hadoop, local name-cache
corrupt, slow network links, drives just beginning to fail, etc.
> Many of these conditions are known, and we can monitor for those separately, but this
enhancement would shield users from unknown failure conditions that we haven't yet anticipated.
This way, a user will get a cluster, instead of hanging indefinitely.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message