Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-dev@hadoop.apache.org
Message-ID: <1125091392.1227212804866.JavaMail.jira@brutus>
Date: Thu, 20 Nov 2008 12:26:44 -0800 (PST)
From: "Konstantin Shvachko (JIRA)" <jira@apache.org>
To: core-dev@hadoop.apache.org
Subject: [jira] Commented: (HADOOP-4061) Large number of decommission
 freezes the Namenode
In-Reply-To: <930258889.1220465144195.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HADOOP-4061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649468#action_12649468 ] 

Konstantin Shvachko commented on HADOOP-4061:
---------------------------------------------

- You changed the time units in {{dfs.namenode.decommission.interval}} from minutes to seconds. Is it going to be a problem for those who use this config variable? If they set it to 5 (now in minutes) then it is going to be every 5 seconds after your patch.
- Do we want to introduce {{DecommissionMonitor decommissionManager}} member in {{FSNamesystem}}? Then we will be able to move all decommissioning logic into {{DecommissionMonitor}} or manager, which is probably partly a goal of this patch. 
-- {{FSNamesystem.checkDecommissionStateInternal()}} should be moved in to {{DecommissionMonitor}};
-- same as {{startDecommission()}} and {{stopDecommission()}}.
- In {{isReplicationInProgress()}} could you please rename  {{decommissionBlocks}} to {{nodeBlocks}}. It has nothing to do with decommission and is confising.

I think this throttling approach will solve the problem for now, but is not ideal. Say, if you have 500,000 blocks rather than 30,000 then you will have to reconfigure the throttler to scan even less nodes. 
Deleting already decommissioned blocks as Raghu proposes is also not very good. Until the node is shut down its blocks can be  accessed for read. We don't want to change that.
I would rather go with the approach, which counts down decommissioned blocks as they are replicated. Then there is no need to scan all blocks to verify the node is decommissioned, just check the counter. We can add the total block scan as a sanity check in stopDecommission(). The counter can also be a good indicator of how much decommissioning progress has been done at every moment.
We should create a separate jira for these changes.

> Large number of decommission freezes the Namenode
> -------------------------------------------------
>
>                 Key: HADOOP-4061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4061
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.2
>            Reporter: Koji Noguchi
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 4061_20081119.patch
>
>
> On 1900 nodes cluster, we tried decommissioning 400 nodes with 30k blocks each. Other 1500 nodes were almost empty.
> When decommission started, namenode's queue overflowed every 6 minutes.
> Looking at the cpu usage,  it showed that every 5 minutes org.apache.hadoop.dfs.FSNamesystem$DecommissionedMonitor thread was taking 100% of the CPU for 1 minute causing the queue to overflow.
> {noformat}
>   public synchronized void decommissionedDatanodeCheck() {
>     for (Iterator<DatanodeDescriptor> it = datanodeMap.values().iterator();
>          it.hasNext();) {
>       DatanodeDescriptor node = it.next();
>       checkDecommissionStateInternal(node);
>     }
>   }
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.