Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 2901F2009F4 for ; Thu, 26 May 2016 10:06:15 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 2771F160939; Thu, 26 May 2016 08:06:15 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id E48AE160A10 for ; Thu, 26 May 2016 10:06:13 +0200 (CEST) Received: (qmail 5188 invoked by uid 500); 26 May 2016 08:06:13 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 5077 invoked by uid 99); 26 May 2016 08:06:13 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 May 2016 08:06:13 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id E45EE2C1F5C for ; Thu, 26 May 2016 08:06:12 +0000 (UTC) Date: Thu, 26 May 2016 08:06:12 +0000 (UTC) From: "Yongjun Zhang (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HDFS-6937) Another issue in handling checksum errors in write pipeline MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Thu, 26 May 2016 08:06:15 -0000 [ https://issues.apache.org/jira/browse/HDFS-6937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yongjun Zhang updated HDFS-6937: -------------------------------- Attachment: HDFS-6937.001.patch > Another issue in handling checksum errors in write pipeline > ----------------------------------------------------------- > > Key: HDFS-6937 > URL: https://issues.apache.org/jira/browse/HDFS-6937 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, hdfs-client > Affects Versions: 2.5.0 > Reporter: Yongjun Zhang > Assignee: Yongjun Zhang > Attachments: HDFS-6937.001.patch > > > Given a write pipeline: > DN1 -> DN2 -> DN3 > DN3 detected cheksum error and terminate, DN2 truncates its replica to the ACKed size. Then a new pipeline is attempted as > DN1 -> DN2 -> DN4 > DN4 detects checksum error again. Later when replaced DN4 with DN5 (and so on), it failed for the same reason. This led to the observation that DN2's data is corrupted. > Found that the software currently truncates DN2's replca to the ACKed size after DN3 terminates. But it doesn't check the correctness of the data already written to disk. > So intuitively, a solution would be, when downstream DN (DN3 here) found checksum error, propagate this info back to upstream DN (DN2 here), DN2 checks the correctness of the data already written to disk, and truncate the replica to to MIN(correctDataSize, ACKedSize). > Found this issue is similar to what was reported by HDFS-3875, and the truncation at DN2 was actually introduced as part of the HDFS-3875 solution. > Filing this jira for the issue reported here. HDFS-3875 was filed by [~tlipcon] > and found he proposed something similar there. > {quote} > if the tail node in the pipeline detects a checksum error, then it returns a special error code back up the pipeline indicating this (rather than just disconnecting) > if a non-tail node receives this error code, then it immediately scans its own block on disk (from the beginning up through the last acked length). If it detects a corruption on its local copy, then it should assume that it is the faulty one, rather than the downstream neighbor. If it detects no corruption, then the faulty node is either the downstream mirror or the network link between the two, and the current behavior is reasonable. > {quote} > Thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org