Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AF87718C7D for ; Mon, 23 Nov 2015 21:00:12 +0000 (UTC) Received: (qmail 20246 invoked by uid 500); 23 Nov 2015 21:00:12 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 20124 invoked by uid 500); 23 Nov 2015 21:00:12 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 20089 invoked by uid 99); 23 Nov 2015 21:00:12 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 23 Nov 2015 21:00:12 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 0A4542C1F69 for ; Mon, 23 Nov 2015 21:00:12 +0000 (UTC) Date: Mon, 23 Nov 2015 21:00:12 +0000 (UTC) From: "Kihwal Lee (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HDFS-8676) Delayed rolling upgrade finalization can cause heartbeat expiration and write failures MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-8676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15023052#comment-15023052 ] Kihwal Lee commented on HDFS-8676: ---------------------------------- bq. Does this issue exist in 2.6.x? Should this be backported to branch-2.6? yes & yes, imo. > Delayed rolling upgrade finalization can cause heartbeat expiration and write failures > -------------------------------------------------------------------------------------- > > Key: HDFS-8676 > URL: https://issues.apache.org/jira/browse/HDFS-8676 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: Kihwal Lee > Assignee: Walter Su > Priority: Critical > Fix For: 3.0.0, 2.7.2 > > Attachments: HDFS-8676.01.patch, HDFS-8676.02.patch > > > In big busy clusters where the deletion rate is also high, a lot of blocks can pile up in the datanode trash directories until an upgrade is finalized. When it is finally finalized, the deletion of trash is done in the service actor thread's context synchronously. This blocks the heartbeat and can cause heartbeat expiration. > We have seen a namenode losing hundreds of nodes after a delayed upgrade finalization. The deletion of trash directories should be made asynchronous. -- This message was sent by Atlassian JIRA (v6.3.4#6332)