Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 7AAA5200B56 for ; Sat, 30 Jul 2016 13:04:29 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 7933E160A8A; Sat, 30 Jul 2016 11:04:29 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id C03A5160A81 for ; Sat, 30 Jul 2016 13:04:28 +0200 (CEST) Received: (qmail 57034 invoked by uid 500); 30 Jul 2016 11:04:27 -0000 Mailing-List: contact hdfs-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list hdfs-dev@hadoop.apache.org Received: (qmail 57023 invoked by uid 99); 30 Jul 2016 11:04:27 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 30 Jul 2016 11:04:27 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 0EF38C17E9 for ; Sat, 30 Jul 2016 11:04:27 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -1.608 X-Spam-Level: X-Spam-Status: No, score=-1.608 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=2, RCVD_IN_DNSWL_MED=-2.3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-1.287, SPF_PASS=-0.001] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id cIpcQpAgUtdz for ; Sat, 30 Jul 2016 11:04:22 +0000 (UTC) Received: from szxga03-in.huawei.com (szxga03-in.huawei.com [119.145.14.66]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 8A4F860D60 for ; Sat, 30 Jul 2016 11:04:19 +0000 (UTC) Received: from 172.24.1.137 (EHLO SZXEMA412-HUB.china.huawei.com) ([172.24.1.137]) by szxrg03-dlp.huawei.com (MOS 4.4.3-GA FastPath queued) with ESMTP id CFQ12472; Sat, 30 Jul 2016 19:04:07 +0800 (CST) Received: from BLREML408-HUB.china.huawei.com (10.20.4.47) by SZXEMA412-HUB.china.huawei.com (10.82.72.71) with Microsoft SMTP Server (TLS) id 14.3.235.1; Sat, 30 Jul 2016 19:04:06 +0800 Received: from BLREML502-MBX.china.huawei.com ([10.20.5.202]) by BLREML408-HUB.china.huawei.com ([10.20.4.47]) with mapi id 14.03.0235.001; Sat, 30 Jul 2016 16:34:00 +0530 From: Brahma Reddy Battula To: "hdfs-dev@hadoop.apache.org" Subject: Issue in handling checksum errors in write pipeline Thread-Topic: Issue in handling checksum errors in write pipeline Thread-Index: AdHqUcgFCr3dc0J8RcmSn/FJ5eWfjg== Date: Sat, 30 Jul 2016 11:03:59 +0000 Message-ID: <8AD4EE147886274A8B495D6AF407DF698F11A353@blreml502-mbx> Accept-Language: en-US, zh-CN Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.18.246.125] Content-Type: multipart/alternative; boundary="_000_8AD4EE147886274A8B495D6AF407DF698F11A353blreml502mbx_" MIME-Version: 1.0 X-CFilter-Loop: Reflected X-Mirapoint-Virus-RAPID-Raw: score=unknown(0), refid=str=0001.0A020203.579C89A8.0050,ss=1,re=0.000,recu=0.000,reip=0.000,cl=1,cld=1,fgs=0, ip=0.0.0.0, so=2013-05-26 15:14:31, dmn=2013-03-21 17:37:32 X-Mirapoint-Loop-Id: 7fcfc510b29d5cc9aa41b01f94ed1a58 archived-at: Sat, 30 Jul 2016 11:04:29 -0000 --_000_8AD4EE147886274A8B495D6AF407DF698F11A353blreml502mbx_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Hello We had come across one issue, where write is failed even 7 DN's are availab= le due to network fault at one datanode which is LAST_IN_PIPELINE. It will = be similar to HDFS-6937 . Scenario : (DN3 has N/W Fault and Min repl=3D2). Write pipeline: DN1->DN2->DN3 =3D> DN3 Gives ERROR_CHECKSUM ack. And so DN2 marked as bad DN1->DN4-> DN3 =3D> DN3 Gives ERROR_CHECKSUM ack. And so DN4 is marked as b= ad .... And so on ( all the times DN3 is LAST_IN_PIPELINE) ... Continued till no mo= re datanodes to construct the pipeline. Thinking we can handle like below: Instead of throwing IOException for ERROR_CHECKSUM ack from downstream, If = we can send back the pipeline ack and client side we can replace both DN2 a= nd DN3 with new nodes as we can't decide on which is having network problem= . Please give you views the possible fix.. --Brahma Reddy Battula --_000_8AD4EE147886274A8B495D6AF407DF698F11A353blreml502mbx_--