Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 60BE2200BDC for ; Wed, 14 Dec 2016 23:21:00 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 5F512160B34; Wed, 14 Dec 2016 22:21:00 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id A89D4160B0D for ; Wed, 14 Dec 2016 23:20:59 +0100 (CET) Received: (qmail 58379 invoked by uid 500); 14 Dec 2016 22:20:58 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 58324 invoked by uid 99); 14 Dec 2016 22:20:58 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 14 Dec 2016 22:20:58 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 918652C03E6 for ; Wed, 14 Dec 2016 22:20:58 +0000 (UTC) Date: Wed, 14 Dec 2016 22:20:58 +0000 (UTC) From: "Wei-Chiu Chuang (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HDFS-11160) VolumeScanner reports write-in-progress replicas as corrupt incorrectly MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 14 Dec 2016 22:21:00 -0000 [ https://issues.apache.org/jira/browse/HDFS-11160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei-Chiu Chuang updated HDFS-11160: ----------------------------------- Attachment: HDFS-11160.007.patch Thanks [~xiaochen] Yes indeed looks like I can remove the redundancy in test code. Submit patch v007 for precommit check. > VolumeScanner reports write-in-progress replicas as corrupt incorrectly > ----------------------------------------------------------------------- > > Key: HDFS-11160 > URL: https://issues.apache.org/jira/browse/HDFS-11160 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Environment: CDH5.7.4 > Reporter: Wei-Chiu Chuang > Assignee: Wei-Chiu Chuang > Attachments: HDFS-11160.001.patch, HDFS-11160.002.patch, HDFS-11160.003.patch, HDFS-11160.004.patch, HDFS-11160.005.patch, HDFS-11160.006.patch, HDFS-11160.007.patch, HDFS-11160.reproduce.patch > > > Due to a race condition initially reported in HDFS-6804, VolumeScanner may erroneously detect good replicas as corrupt. This is serious because in some cases it results in data loss if all replicas are declared corrupt. This bug is especially prominent when there are a lot of append requests via HttpFs/WebHDFS. > We are investigating an incidence that caused very high block corruption rate in a relatively small cluster. Initially, we thought HDFS-11056 is to blame. However, after applying HDFS-11056, we are still seeing VolumeScanner reporting corrupt replicas. > It turns out that if a replica is being appended while VolumeScanner is scanning it, VolumeScanner may use the new checksum to compare against old data, causing checksum mismatch. > I have a unit test to reproduce the error. Will attach later. A quick and simple fix is to hold FsDatasetImpl lock and read from disk the checksum. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org