Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B4A2A18ED0 for ; Thu, 16 Jul 2015 19:02:21 +0000 (UTC) Received: (qmail 61362 invoked by uid 500); 16 Jul 2015 19:02:10 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 61315 invoked by uid 500); 16 Jul 2015 19:02:10 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 61224 invoked by uid 99); 16 Jul 2015 19:02:10 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 16 Jul 2015 19:02:10 +0000 Date: Thu, 16 Jul 2015 19:02:10 +0000 (UTC) From: "Andrew Wang (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HDFS-8776) Decom manager should not be active on standby MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630185#comment-14630185 ] Andrew Wang commented on HDFS-8776: ----------------------------------- Related question to Ming's, this means node state will could flip back to decom in progress from decommissioned on failover right? I guess this can happen anyway, but is less likely in the current world. I'm also curious if this is ameliorated by the decom manager rewrite in HDFS-7411, since in-progress scans are incremental. We could also invest some more effort making it fully incremental if that would help. > Decom manager should not be active on standby > --------------------------------------------- > > Key: HDFS-8776 > URL: https://issues.apache.org/jira/browse/HDFS-8776 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.6.0 > Reporter: Daryn Sharp > Assignee: Daryn Sharp > > The decommission manager should not be actively processing on the standby. > The decomm manager goes through the costly computation for determining every block on the node requires replication yet doesn't queue them for replication - because it's in standby. The decomm manager is holding the namesystem write lock, causing DNs to timeout on heartbeats or IBRs, NN purges the call queue of timed out clients, NN processes some heartbeats/IBRs before the decomm manager locks up the namesystem again. Nodes attempting to register will be sending full BRs which are more costly to send and discard than a heartbeat. > If a failover is required, the standby will likely have to struggle very hard to not GC while "catching up" on its queued IBRs while DNs continue to fill the call queue and time out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)