From hdfs-issues-return-209510-archive-asf-public=cust-asf.ponee.io@hadoop.apache.org Tue Feb 6 17:11:05 2018 Return-Path: X-Original-To: archive-asf-public@eu.ponee.io Delivered-To: archive-asf-public@eu.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by mx-eu-01.ponee.io (Postfix) with ESMTP id 8EC20180657 for ; Tue, 6 Feb 2018 17:11:05 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 7E358160C3A; Tue, 6 Feb 2018 16:11:05 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id C4F69160C34 for ; Tue, 6 Feb 2018 17:11:04 +0100 (CET) Received: (qmail 60667 invoked by uid 500); 6 Feb 2018 16:11:03 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 60656 invoked by uid 99); 6 Feb 2018 16:11:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Feb 2018 16:11:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 5BA92180A77 for ; Tue, 6 Feb 2018 16:11:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -110.311 X-Spam-Level: X-Spam-Status: No, score=-110.311 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 9_sXlolaVrzk for ; Tue, 6 Feb 2018 16:11:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 9343C5F232 for ; Tue, 6 Feb 2018 16:11:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 0241EE0296 for ; Tue, 6 Feb 2018 16:11:01 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 4459324106 for ; Tue, 6 Feb 2018 16:11:00 +0000 (UTC) Date: Tue, 6 Feb 2018 16:11:00 +0000 (UTC) From: "Kihwal Lee (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HDFS-13111) Close recovery may incorrectly mark blocks corrupt MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-13111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16354076#comment-16354076 ] Kihwal Lee commented on HDFS-13111: ----------------------------------- Here is relevant log lines of the latest example. This node was under a heavy I/O load. {noformat} 2018-02-06 00:00:03,413 [DataXceiver for client DFSClient_XXX at /1.2.3.4:57710] INFO Receiving BP-YYY:blk_7654321_1234567 src: /1.2.3.4:57710 dest: /1.2.3.5:1004 2018-02-06 00:09:58,840 [DataXceiver for client DFSClient_XXX at /1.2.3.4:57710] WARN Slow BlockReceiver write data to disk cost:462ms (threshold=300ms) 2018-02-06 00:10:40,148 [DataXceiver for client DFSClient_XXX at /1.2.3.4:57710] WARN Slow BlockReceiver write data to disk cost:11155ms (threshold=300ms) 2018-02-06 00:10:46,053 [DataXceiver for client DFSClient_XXX at /1.2.3.4:57710] WARN Slow BlockReceiver write data to disk cost:1577ms (threshold=300ms) 2018-02-06 00:11:02,376 [DataXceiver for client DFSClient_XXX at /1.2.3.4:57710] WARN Slow BlockReceiver write data to disk cost:327ms (threshold=300ms) 2018-02-06 00:11:53,064 [DataXceiver for client DFSClient_XXX at /1.2.3.4:40532] INFO Receiving BP-YYY:blk_7654321_1234567 src: /1.2.3.4:40532 dest: /1.2.3.5:1004 2018-02-06 00:12:09,782 [DataXceiver for client DFSClient_XXX at /1.2.3.4:40532] INFO Recover failed close BP-YYY:blk_7654321_1234567 2018-02-06 00:12:13,081 [DataXceiver for client DFSClient_XXX at /1.2.3.7:46522] INFO Receiving BP-YYY:blk_7654321_1234567 src: /1.2.3.7:46522 dest: /1.2.3.5:1004 2018-02-06 00:12:13,081 [DataXceiver for client DFSClient_XXX at /1.2.3.7:46522] INFO Recover failed close BP-YYY:blk_7654321_1234567 2018-02-06 00:12:17,276 [DataXceiver for client DFSClient_XXX at /1.2.3.4:40532] WARN Lock held time above threshold: lock identifier: org.apache.hadoop.hdfs.server.datanode .fsdataset.impl.FsDatasetImpl lockHeldTimeMs=7492 ms. Suppressed 0 lock warnings. The stack trace is: java.lang.Thread.getStackTrace(Thread.java:1556) ... // it was recoverClose() 2018-02-06 00:12:17,276 [DataXceiver for client DFSClient_XXX at /1.2.3.7:46522] INFO Received BP-YYY:blk_7654321_1135832806836 src: /1.2.3.7:46522 dest: /1.2.3.5:1004 of size xx 2018-02-06 00:12:20,103 [DataXceiver for client DFSClient_XXX at /1.2.3.4:40532] INFO Received BP-YYY:blk_7654321_1135832805246 src: /1.2.3.4:40532 dest: /1.2.3.5:1004 of size xx 2018-02-06 00:12:38,353 [PacketResponder: BP-YYY:blk_7654321_1234567, type=LAST_IN_PIPELINE] INFO DataNode.clienttrace: src: /1.2.3.4:57710, dest: /1.2.3.5:1004, bytes: 134217728, op: HDFS_WRITE, cliID: DFSClient_XXX, offset: 0, srvID: ZZZ, blockid: BP-YYY:blk_7654321_1234567, duration: looong {noformat} Note the client port number to identify each writer thread. After two "successful" {{recoverClose()}}, the original writer comes around and also declares a success. This must have resulted in the reported gen stamp going backward. On disk actually was the latest one. This clearly illustrates that it is wrong to time out on the writer termination and continue with the recovery. > Close recovery may incorrectly mark blocks corrupt > -------------------------------------------------- > > Key: HDFS-13111 > URL: https://issues.apache.org/jira/browse/HDFS-13111 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Affects Versions: 2.8.0 > Reporter: Daryn Sharp > Priority: Critical > > Close recovery can leave a block marked corrupt until the next FBR arrives from one of the DNs. The reason is unclear but has happened multiple times when a DN has io saturated disks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org