hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "dhruba borthakur (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-442) slaves file should include an 'exclude' section, to prevent "bad" datanodes and tasktrackers from disrupting a cluster
Date Mon, 12 Feb 2007 19:22:05 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472390

dhruba borthakur commented on HADOOP-442:

1. It would be nice if the description of dfs.hosts and dfs.hosts.exclude
   says "Full path name of file ..."

2. The FSNamesystem.close() function should have a dnthread.join() call.

3. Can we make FSNamesystem.refreshNodes() package private? i.e. remove the
   "public" keyword from its definition.

4. The method FSNamesystem.refreshNodes migth need to be synchronized because
   it traverses the datanodeMap. However, the first line in this method (that
   invokes "hostReader.refresh" should preferably be outside this synchrnization.
   It is good to read in contents from the hosts file outside the global
   FSNamesystem lock.

5. The methods inExcludedHostsList() and inHostsList() could be unified if we
   pass in the specified list as a parameter to this unified method.

> slaves file should include an 'exclude' section, to prevent "bad" datanodes and tasktrackers
from disrupting  a cluster
> -----------------------------------------------------------------------------------------------------------------------
>                 Key: HADOOP-442
>                 URL: https://issues.apache.org/jira/browse/HADOOP-442
>             Project: Hadoop
>          Issue Type: Bug
>          Components: conf
>            Reporter: Yoram Arnon
>         Assigned To: Wendy Chien
>         Attachments: hadoop-442-8.patch
> I recently had a few nodes go bad, such that they were inaccessible to ssh, but were
still running their java processes.
> tasks that executed on them were failing, causing jobs to fail.
> I couldn't stop the java processes, because of the ssh issue, so I was helpless until
I could actually power down these nodes.
> restarting the cluster doesn't help, even when removing the bad nodes from the slaves
file - they just reconnect and are accepted.
> while we plan to avoid tasks from launching on the same nodes over and over, what I'd
like is to be able to prevent rogue processes from connecting to the masters.
> Ideally, the slaves file will contain an 'exclude' section, which will list nodes that
shouldn't be accessed, and should be ignored if they try to connect. That would also help
in configuring the slaves file for a large cluster - I'd list the full range of machines in
the cluster, then list the ones that are down in the 'exclude' section

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message