hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tsz Wo (Nicholas), SZE (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-4061) Large number of decommission freezes the Namenode
Date Thu, 20 Nov 2008 22:46:44 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Tsz Wo (Nicholas), SZE updated HADOOP-4061:

    Attachment: 4061_20081120.patch


Thanks, Konstantin and Raghu!

- This is not true.  dfs.namenode.decommission.interval is always in seconds, not minutes.
 Only the default value is changed.  Could you check it again?

- For code refactoring, I am going to do it in the next step.  As mentioned before, I am going
to improve decomission performance.  For now, I will change the class name to DecommissionManager
and have Monitor as an inner class.

- I would rather rename it to i for iterator and declare it inside the for-loop header.

Let me create a new issue for improving decomission performance and discuss it there.  I believe
we need more thought.

@Raghu> the loop should count 5 decommissioned nodes

Yes, we should count nodes with decommission in progress.

> Large number of decommission freezes the Namenode
> -------------------------------------------------
>                 Key: HADOOP-4061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4061
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Koji Noguchi
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 4061_20081119.patch, 4061_20081120.patch
> On 1900 nodes cluster, we tried decommissioning 400 nodes with 30k blocks each. Other
1500 nodes were almost empty.
> When decommission started, namenode's queue overflowed every 6 minutes.
> Looking at the cpu usage,  it showed that every 5 minutes org.apache.hadoop.dfs.FSNamesystem$DecommissionedMonitor
thread was taking 100% of the CPU for 1 minute causing the queue to overflow.
> {noformat}
>   public synchronized void decommissionedDatanodeCheck() {
>     for (Iterator<DatanodeDescriptor> it = datanodeMap.values().iterator();
>          it.hasNext();) {
>       DatanodeDescriptor node = it.next();
>       checkDecommissionStateInternal(node);
>     }
>   }
> {noformat}

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message