Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id BDA55200CF2 for ; Wed, 9 Aug 2017 01:58:05 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id BC25C162CD7; Tue, 8 Aug 2017 23:58:05 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 0A319162B9A for ; Wed, 9 Aug 2017 01:58:04 +0200 (CEST) Received: (qmail 68724 invoked by uid 500); 8 Aug 2017 23:58:04 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 68713 invoked by uid 99); 8 Aug 2017 23:58:04 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Aug 2017 23:58:04 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 938371807E8 for ; Tue, 8 Aug 2017 23:58:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 9xxv5hq8ZOBV for ; Tue, 8 Aug 2017 23:58:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 9F1825FB96 for ; Tue, 8 Aug 2017 23:58:02 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id B05B1E0984 for ; Tue, 8 Aug 2017 23:58:01 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 9F1A5241A5 for ; Tue, 8 Aug 2017 23:58:00 +0000 (UTC) Date: Tue, 8 Aug 2017 23:58:00 +0000 (UTC) From: "Wei-Chiu Chuang (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HDFS-11160) VolumeScanner reports write-in-progress replicas as corrupt incorrectly MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Tue, 08 Aug 2017 23:58:05 -0000 [ https://issues.apache.org/jira/browse/HDFS-11160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16119207#comment-16119207 ] Wei-Chiu Chuang commented on HDFS-11160: ---------------------------------------- Hi [~wheat9] an alternative approach is add a retry at client side, so that if client encounters a checksum error, it retries the read to eliminate the false positive due to the race condition. I don't mind reverting it from 2.8 branch if it makes large Hadoop operators less concerned about the release. > VolumeScanner reports write-in-progress replicas as corrupt incorrectly > ----------------------------------------------------------------------- > > Key: HDFS-11160 > URL: https://issues.apache.org/jira/browse/HDFS-11160 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Environment: CDH5.7.4 > Reporter: Wei-Chiu Chuang > Assignee: Wei-Chiu Chuang > Fix For: 2.8.0, 2.7.4, 3.0.0-alpha2 > > Attachments: HDFS-11160.001.patch, HDFS-11160.002.patch, HDFS-11160.003.patch, HDFS-11160.004.patch, HDFS-11160.005.patch, HDFS-11160.006.patch, HDFS-11160.007.patch, HDFS-11160.008.patch, HDFS-11160.branch-2.patch, HDFS-11160.reproduce.patch > > > Due to a race condition initially reported in HDFS-6804, VolumeScanner may erroneously detect good replicas as corrupt. This is serious because in some cases it results in data loss if all replicas are declared corrupt. This bug is especially prominent when there are a lot of append requests via HttpFs/WebHDFS. > We are investigating an incidence that caused very high block corruption rate in a relatively small cluster. Initially, we thought HDFS-11056 is to blame. However, after applying HDFS-11056, we are still seeing VolumeScanner reporting corrupt replicas. > It turns out that if a replica is being appended while VolumeScanner is scanning it, VolumeScanner may use the new checksum to compare against old data, causing checksum mismatch. > I have a unit test to reproduce the error. Will attach later. A quick and simple fix is to hold FsDatasetImpl lock and read from disk the checksum. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org