Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 0E90A200D2C for ; Sun, 15 Oct 2017 04:48:10 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 0D056160BE6; Sun, 15 Oct 2017 02:48:10 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 2B27E1609D5 for ; Sun, 15 Oct 2017 04:48:09 +0200 (CEST) Received: (qmail 96440 invoked by uid 500); 15 Oct 2017 02:48:08 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 96428 invoked by uid 99); 15 Oct 2017 02:48:08 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 15 Oct 2017 02:48:08 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 5D1F6180860 for ; Sun, 15 Oct 2017 02:48:07 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.002 X-Spam-Level: X-Spam-Status: No, score=-100.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id JBH-PIYzMZvn for ; Sun, 15 Oct 2017 02:48:05 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id A6EBB5FC55 for ; Sun, 15 Oct 2017 02:48:04 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 384BCE0635 for ; Sun, 15 Oct 2017 02:48:02 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 192212140B for ; Sun, 15 Oct 2017 02:48:01 +0000 (UTC) Date: Sun, 15 Oct 2017 02:48:00 +0000 (UTC) From: "Xiao Chen (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HDFS-12642) Log block and datanode details in BlockRecoveryWorker MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Sun, 15 Oct 2017 02:48:10 -0000 [ https://issues.apache.org/jira/browse/HDFS-12642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16204993#comment-16204993 ] Xiao Chen commented on HDFS-12642: ---------------------------------- Thanks Yongjun for the review. Patch 2 addressed the comments. New example logs for convenience: {noformat} 2017-10-14 19:43:25,345 [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@40bdcea2] INFO datanode.DataNode (BlockRecoveryWorker.java:logRecoverBlock(549)) - BlockRecoveryWorker: NameNode at localhost/127.0.0.1:52885 calls recoverBlock(BP-342842897-10.0.0.51-1508035399504:blk_1073741827_1003, targets=[DatanodeInfoWithStorage[127.0.0.1:52896,null,null], DatanodeInfoWithStorage[127.0.0.1:52901,null,null], DatanodeInfoWithStorage[127.0.0.1:52888,null,null]], newGenerationStamp=1004, newBlock=null, isStriped=false) 2017-10-14 19:43:25,368 [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@40bdcea2] INFO datanode.DataNode (BlockRecoveryWorker.java:syncBlock(200)) - BlockRecoveryWorker: block=BP-342842897-10.0.0.51-1508035399504:blk_1073741827_1003 (length=952), isTruncateRecovery=false, syncList=[block:blk_1073741827_1003[numBytes=952,originalReplicaState=FINALIZED] node:DatanodeInfoWithStorage[127.0.0.1:52896,null,null], block:blk_1073741827_1003[numBytes=952,originalReplicaState=FINALIZED] node:DatanodeInfoWithStorage[127.0.0.1:52901,null,null], block:blk_1073741827_1003[numBytes=952,originalReplicaState=FINALIZED] node:DatanodeInfoWithStorage[127.0.0.1:52888,null,null]] 2017-10-14 19:43:25,369 [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@40bdcea2] INFO datanode.DataNode (BlockRecoveryWorker.java:syncBlock(291)) - BlockRecoveryWorker: block=BP-342842897-10.0.0.51-1508035399504:blk_1073741827_1003 (length=952), bestState=FINALIZED, newBlock=BP-342842897-10.0.0.51-1508035399504:blk_1073741827_1004 (length=952), participatingList=[block:blk_1073741827_1003[numBytes=952,originalReplicaState=FINALIZED] node:DatanodeInfoWithStorage[127.0.0.1:52896,null,null], block:blk_1073741827_1003[numBytes=952,originalReplicaState=FINALIZED] node:DatanodeInfoWithStorage[127.0.0.1:52901,null,null], block:blk_1073741827_1003[numBytes=952,originalReplicaState=FINALIZED] node:DatanodeInfoWithStorage[127.0.0.1:52888,null,null]] {noformat} > Log block and datanode details in BlockRecoveryWorker > ----------------------------------------------------- > > Key: HDFS-12642 > URL: https://issues.apache.org/jira/browse/HDFS-12642 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode > Reporter: Xiao Chen > Assignee: Xiao Chen > Attachments: HDFS-12642.01.patch > > > In a recent investigation, we have seen a weird block recovery issue, which is difficult to reach to a conclusion because of insufficient logs. > For the most critical part of the events, we see block recovery failed to {{commitBlockSynchronization]} on the NN, due to the block not closed. This leaves the file as open forever (for 1+ months). > The reason the block was not closed on NN, was because it is configured with {{dfs.namenode.replication.min}} =2, and only 1 replica was with the latest genstamp. > We were not able to tell why only 1 replica is on latest genstamp. > From the primary node of the recovery (ps2204), {{initReplicaRecoveryImpl}} was called on each of the 7 DNs the block were ever placed. All DNs but ps2204 and ps3765 failed because of genstamp comparison - that's expected. ps2204 and ps3765 have gone past the comparison (since no exceptions from their logs), but {{updateReplicaUnderRecovery}} only appeared to be called on ps3765. > This jira is to propose we log more details when {{BlockRecoveryWorker}} is about to call {{updateReplicaUnderRecovery}} on the DataNodes, so this could be figured out in the future. > {noformat} > $ grep "updateReplica:" ps2204.dn.log > $ grep "updateReplica:" ps3765.dn.log > hadoop-hdfs-datanode-ps3765.log.2:{"@timestamp":"2017-09-13T00:56:20.933Z","source_host":"ps3765.example.com","file":"FsDatasetImpl.java","method":"updateReplicaUnderRecovery","level":"INFO","line_number":"2512","thread_name":"IPC Server handler 6 on 50020","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"updateReplica: BP-550436645-17.142.147.13-1438988035284:blk_2172795728_1106150312, recoveryId=1107074793, length=65024, replica=ReplicaUnderRecovery, blk_2172795728_1106150312, RUR > $ grep "initReplicaRecovery:" ps2204.dn.log > hadoop-hdfs-datanode-ps2204.log.1:{"@timestamp":"2017-09-13T00:56:20.691Z","source_host":"ps2204.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2441","thread_name":"org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@5ae3cb26","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery: blk_2172795728_1106150312, recoveryId=1107074793, replica=ReplicaWaitingToBeRecovered, blk_2172795728_1106150312, RWR > hadoop-hdfs-datanode-ps2204.log.1:{"@timestamp":"2017-09-13T00:56:20.691Z","source_host":"ps2204.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2497","thread_name":"org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@5ae3cb26","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery: changing replica state for blk_2172795728_1106150312 from RWR to RUR","class":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","mdc":{}} > $ grep "initReplicaRecovery:" ps3765.dn.log > hadoop-hdfs-datanode-ps3765.log.2:{"@timestamp":"2017-09-13T00:56:20.457Z","source_host":"ps3765.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2441","thread_name":"IPC Server handler 5 on 50020","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery: blk_2172795728_1106150312, recoveryId=1107074793, replica=ReplicaBeingWritten, blk_2172795728_1106150312, RBW > hadoop-hdfs-datanode-ps3765.log.2:{"@timestamp":"2017-09-13T00:56:20.457Z","source_host":"ps3765.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2441","thread_name":"IPC Server handler 5 on 50020","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery: blk_2172795728_1106150312, recoveryId=1107074793, replica=ReplicaBeingWritten, blk_2172795728_1106150312, RBW > hadoop-hdfs-datanode-ps3765.log.2:{"@timestamp":"2017-09-13T00:56:20.457Z","source_host":"ps3765.example.com","file":"FsDatasetImpl.java","method":"initReplicaRecoveryImpl","level":"INFO","line_number":"2497","thread_name":"IPC Server handler 5 on 50020","@version":1,"logger_name":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","message":"initReplicaRecovery: changing replica state for blk_2172795728_1106150312 from RBW to RUR","class":"org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl","mdc":{}} > {noformat} > P.S. HDFS-11499 was once suspected, but non-conclusive since we don't have all NN logs to know about the decommission. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org