hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wendy Chien (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-442) slaves file should include an 'exclude' section, to prevent "bad" datanodes and tasktrackers from disrupting a cluster
Date Fri, 16 Feb 2007 21:26:05 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Wendy Chien updated HADOOP-442:

    Attachment: hadoop-442-11.patch

Thanks for looking over the patch, Dhruba!  I updated it to incorporate Dhruba's comments.

1. TestDecommission.waitNodeState waits for 1 second now. 
2. I do mean to check for DECOMMISSION_INRPOGRESS to make sure the decommission began, but
I want to stop it before it finishes so I can test commissioning a node works too.  
3. refreshNodes now returns void.
4. UnregisteredDatanodeException was already there, but I also added DisallowedDatanodeException
to that clause.  I'm inclined to leave them together since they are similar.
6. added synchronized to verifyNodeRegistration, and removed it from start/stopDecommission.
7. removed the new code from pendingTransfers
8. I moved verifyNodeShutdown to FSNamesystem.  

> slaves file should include an 'exclude' section, to prevent "bad" datanodes and tasktrackers
from disrupting  a cluster
> -----------------------------------------------------------------------------------------------------------------------
>                 Key: HADOOP-442
>                 URL: https://issues.apache.org/jira/browse/HADOOP-442
>             Project: Hadoop
>          Issue Type: Bug
>          Components: conf
>            Reporter: Yoram Arnon
>         Assigned To: Wendy Chien
>         Attachments: hadoop-442-10.patch, hadoop-442-11.patch, hadoop-442-8.patch
> I recently had a few nodes go bad, such that they were inaccessible to ssh, but were
still running their java processes.
> tasks that executed on them were failing, causing jobs to fail.
> I couldn't stop the java processes, because of the ssh issue, so I was helpless until
I could actually power down these nodes.
> restarting the cluster doesn't help, even when removing the bad nodes from the slaves
file - they just reconnect and are accepted.
> while we plan to avoid tasks from launching on the same nodes over and over, what I'd
like is to be able to prevent rogue processes from connecting to the masters.
> Ideally, the slaves file will contain an 'exclude' section, which will list nodes that
shouldn't be accessed, and should be ignored if they try to connect. That would also help
in configuring the slaves file for a large cluster - I'd list the full range of machines in
the cluster, then list the ones that are down in the 'exclude' section

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message