Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id ECBA818CBF for ; Tue, 9 Feb 2016 18:55:18 +0000 (UTC) Received: (qmail 99000 invoked by uid 500); 9 Feb 2016 18:55:18 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 98925 invoked by uid 500); 9 Feb 2016 18:55:18 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 98882 invoked by uid 99); 9 Feb 2016 18:55:18 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Feb 2016 18:55:18 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 505B32C1F58 for ; Tue, 9 Feb 2016 18:55:18 +0000 (UTC) Date: Tue, 9 Feb 2016 18:55:18 +0000 (UTC) From: "Akira AJISAKA (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HDFS-9752) Permanent write failures may happen to slow writers during datanode rolling upgrades MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-9752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira AJISAKA updated HDFS-9752: -------------------------------- Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 2.6.5 2.7.3 2.8.0 Status: Resolved (was: Patch Available) Committed the branch-2.7 patch. Thanks [~walter.k.su] for the contribution. > Permanent write failures may happen to slow writers during datanode rolling upgrades > ------------------------------------------------------------------------------------ > > Key: HDFS-9752 > URL: https://issues.apache.org/jira/browse/HDFS-9752 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: Kihwal Lee > Assignee: Walter Su > Priority: Critical > Fix For: 2.8.0, 2.7.3, 2.6.5 > > Attachments: HDFS-9752-branch-2.6.03.patch, HDFS-9752-branch-2.7.03.patch, HDFS-9752.01.patch, HDFS-9752.02.patch, HDFS-9752.03.patch, HdfsWriter.java > > > When datanodes are being upgraded, an out-of-band ack is sent upstream and the client does a pipeline recovery. The client may hit this multiple times as more nodes get upgraded. This normally does not cause any issue, but if the client is holding the stream open without writing any data during this time, a permanent write failure can occur. > This is because there is a limit of 5 recovery trials for the same packet, which is tracked by "last acked sequence number". Since the empty heartbeat packets for an idle output stream does not increment the sequence number, the write will fail after it seeing 5 pipeline breakages by datanode upgrades. > This check/limit was added to avoid spinning until running out of nodes in the cluster due to a corruption or any other irrecoverable conditions. The datanode upgrade-restart should be excluded from the count. -- This message was sent by Atlassian JIRA (v6.3.4#6332)