Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 2A7A8200BCC for ; Mon, 24 Oct 2016 23:21:00 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 29290160AD7; Mon, 24 Oct 2016 21:21:00 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 830BE160B00 for ; Mon, 24 Oct 2016 23:20:59 +0200 (CEST) Received: (qmail 57602 invoked by uid 500); 24 Oct 2016 21:20:58 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 57546 invoked by uid 99); 24 Oct 2016 21:20:58 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 24 Oct 2016 21:20:58 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 83DD62C2A67 for ; Mon, 24 Oct 2016 21:20:58 +0000 (UTC) Date: Mon, 24 Oct 2016 21:20:58 +0000 (UTC) From: "Arpit Agarwal (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HDFS-10857) Rolling upgrade can make data unavailable when the cluster has many failed volumes MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 24 Oct 2016 21:21:00 -0000 [ https://issues.apache.org/jira/browse/HDFS-10857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15603256#comment-15603256 ] Arpit Agarwal commented on HDFS-10857: -------------------------------------- Looks like {{checkDiskError}} should get the DataNode object lock for the {{dataDirs}} modification to avoid a potential race with {{refreshVolumes}}. > Rolling upgrade can make data unavailable when the cluster has many failed volumes > ---------------------------------------------------------------------------------- > > Key: HDFS-10857 > URL: https://issues.apache.org/jira/browse/HDFS-10857 > Project: Hadoop HDFS > Issue Type: Bug > Affects Versions: 2.6.4 > Reporter: Kihwal Lee > Priority: Critical > Attachments: HDFS-10857.branch-2.6.patch > > > When the marker file or trash dir is created or removed during the heartbeat response processing, an {{IOException}} is thrown if tried on a failed volume. This stops processing of the rest of storage directories and any DNA commands that were part of the heartbeat response. > While this is happening, the block token key update does not happen and all read and write requests start to fail, until the upgrade is finalized and the DN receives a new key. All it takes is one failed volume. If there are three such nodes in the cluster, it is very likely that some blocks cannot be read. The NN has no idea unlike the common missing blocks scenarios, although the effect is the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org