hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wendy Chien (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-442) slaves file should include an 'exclude' section, to prevent "bad" datanodes and tasktrackers from disrupting a cluster
Date Thu, 01 Feb 2007 22:17:05 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12469597

Wendy Chien commented on HADOOP-442:

-I'll change the names to the ones Sameer proposed and make the default an empty string.

-The exclude/hosts files expect full hostnames listed, separated by whitespace (similar to
the slaves file).  I can refine this if it's too cumbersome. 

-Some of us had a discussion about exclude and decommission.  The conclusion is that they
should be combined.   

Below is the new design:

1. Namenode behavior in dealing with exclude/hosts lists.
  a. When the namenode starts up, it reads the hosts and exclude files.  When a node on the
exclude list registers with the namenode, we mark it to be decommissioned.  When nodes are
done being decommissioned, they are shutdown.  Nodes not on the include list will not be allowed
to register.   
  b. When the namenode gets a refreshNodes command, it will update the hosts and exclude lists.
 If a node is added to the hosts lists, then it will be allowed to register.  If a node is
removed from the hosts list, then any further communication will be disallowed and it will
be asked to shutdown.  If a node is added to the exclude list, then it will start to be decommissioned.
 If a node is removed from the exlcude list, then the decommission process will be stopped.

2. Decommissioning a node behaves slightly differently from before. 
  a. When a node is being decommissioned, we do not want to use its copies to replicate unless
no copies exist on non-decommissioned nodes. 
  b. A new thread will be used to periodically check if a node is done being decommissioned.
 If it is, the next time the node heartbeats, it will be told to shut down. 

Comments welcome (and please let me know if I forgot anything).  

> slaves file should include an 'exclude' section, to prevent "bad" datanodes and tasktrackers
from disrupting  a cluster
> -----------------------------------------------------------------------------------------------------------------------
>                 Key: HADOOP-442
>                 URL: https://issues.apache.org/jira/browse/HADOOP-442
>             Project: Hadoop
>          Issue Type: Bug
>          Components: conf
>            Reporter: Yoram Arnon
>         Assigned To: Wendy Chien
> I recently had a few nodes go bad, such that they were inaccessible to ssh, but were
still running their java processes.
> tasks that executed on them were failing, causing jobs to fail.
> I couldn't stop the java processes, because of the ssh issue, so I was helpless until
I could actually power down these nodes.
> restarting the cluster doesn't help, even when removing the bad nodes from the slaves
file - they just reconnect and are accepted.
> while we plan to avoid tasks from launching on the same nodes over and over, what I'd
like is to be able to prevent rogue processes from connecting to the masters.
> Ideally, the slaves file will contain an 'exclude' section, which will list nodes that
shouldn't be accessed, and should be ignored if they try to connect. That would also help
in configuring the slaves file for a large cluster - I'd list the full range of machines in
the cluster, then list the ones that are down in the 'exclude' section

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message