hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Raghu Angadi (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-4061) Large number of decommission freezes the Namenode
Date Thu, 20 Nov 2008 22:14:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649501#action_12649501

Raghu Angadi commented on HADOOP-4061:

> Deleting already decommissioned blocks as Raghu proposes is also not very good. Until
the node is shut down its blocks can be  accessed for read. We don't want to change that.

The proposal is to delete it *after* the block is properly replicated. That seems like the
right thing to do and a must for scalability.  The main CPU problem here is because of keeping
all these excess blocks around.

I don't see any use of keeping over rreplicated blocks. If it is useful then, we should not
delete any over replicated block unless there is no space left.

Regd the patch, the loop should count 5 decommissioned nodes rather than any 5 nodes. Otherwise,
if you have 2000 nodes then each decommissioned node  be checked only once in 16 hours or
so (even if you decommission just one).

> Large number of decommission freezes the Namenode
> -------------------------------------------------
>                 Key: HADOOP-4061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4061
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Koji Noguchi
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 4061_20081119.patch
> On 1900 nodes cluster, we tried decommissioning 400 nodes with 30k blocks each. Other
1500 nodes were almost empty.
> When decommission started, namenode's queue overflowed every 6 minutes.
> Looking at the cpu usage,  it showed that every 5 minutes org.apache.hadoop.dfs.FSNamesystem$DecommissionedMonitor
thread was taking 100% of the CPU for 1 minute causing the queue to overflow.
> {noformat}
>   public synchronized void decommissionedDatanodeCheck() {
>     for (Iterator<DatanodeDescriptor> it = datanodeMap.values().iterator();
>          it.hasNext();) {
>       DatanodeDescriptor node = it.next();
>       checkDecommissionStateInternal(node);
>     }
>   }
> {noformat}

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message