Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0F3D317C9F for ; Fri, 24 Apr 2015 19:00:40 +0000 (UTC) Received: (qmail 55283 invoked by uid 500); 24 Apr 2015 19:00:39 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 55156 invoked by uid 500); 24 Apr 2015 19:00:39 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 54910 invoked by uid 99); 24 Apr 2015 19:00:39 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Apr 2015 19:00:39 +0000 Date: Fri, 24 Apr 2015 19:00:38 +0000 (UTC) From: "Rushabh S Shah (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (HDFS-8245) Standby namenode doesn't process DELETED_BLOCK if the add block request is in edit log. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Rushabh S Shah created HDFS-8245: ------------------------------------ Summary: Standby namenode doesn't process DELETED_BLOCK if the add block request is in edit log. Key: HDFS-8245 URL: https://issues.apache.org/jira/browse/HDFS-8245 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.6.0 Reporter: Rushabh S Shah The following series of events happened on Standby namenode : 2015-04-09 07:47:21,735 \[Edit log tailer] INFO ha.EditLogTailer: Triggering log roll on remote NameNode Active Namenode (ANN) 2015-04-09 07:58:01,858 \[Edit log tailer] INFO ha.EditLogTailer: Triggering log roll on remote NameNode ANN The following series of events happened on Active Namenode:, 2015-04-09 07:47:21,747 \[IPC Server handler 99 on 8020] INFO namenode.FSNamesystem: Roll Edit Log from Standby NN (SNN) 2015-04-09 07:58:01,868 \[IPC Server handler 18 on 8020] INFO namenode.FSNamesystem: Roll Edit Log from SNN The following series of events happened on datanode ( {color:red} datanodeA {color}): 2015-04-09 07:52:15,817 \[DataXceiver for client DFSClient_attempt_1428022041757_102831_r_000107_0_1139131345_1 at /xxxx:51078 \[Receiving block BP-595383232-xxxx-1360869396230:blk_1570321882_1102029183867]] INFO datanode.DataNode: Receiving BP-595383232-xxxx-1360869396230:blk_1570321882_1102029183867 src: /client:51078 dest: /{color:red}datanodeA:1004{color} 2015-04-09 07:52:15,969 \[PacketResponder: BP-595383232-xxxx-1360869396230:blk_1570321882_1102029183867, type=HAS_DOWNSTREAM_IN_PIPELINE] INFO DataNode.clienttrace: src: /client:51078, dest: /{color:red}datanodeA:1004{color}, bytes: 20, op: HDFS_WRITE, cliID: DFSClient_attempt_1428022041757_102831_r_000107_0_1139131345_1, offset: 0, srvID: 356a8a98-826f-446d-8f4c-ce288c1f0a75, blockid: BP-595383232-xxxx-1360869396230:blk_1570321882_1102029183867, duration: 148948385 2015-04-09 07:52:15,969 \[PacketResponder: BP-595383232-xxxx-1360869396230:blk_1570321882_1102029183867, type=HAS_DOWNSTREAM_IN_PIPELINE] INFO datanode.DataNode: PacketResponder: BP-595383232-xxxx-1360869396230:blk_1570321882_1102029183867, type=HAS_DOWNSTREAM_IN_PIPELINE terminating 2015-04-09 07:52:25,970 \[DataXceiver for client /{color:red} {color}:52827 \[Copying block BP-595383232-xxxx-1360869396230:blk_1570321882_1102029183867]] INFO datanode.DataNode: Copied BP-595383232-xxxx-1360869396230:blk_1570321882_1102029183867 to <{color:red}datanodeB{color}>:52827 2015-04-09 07:52:28,187 \[DataNode: heartbeating to ANN:8020] INFO impl.FsDatasetAsyncDiskService: Scheduling blk_1570321882_1102029183867 file /blk_1570321882 for deletion 2015-04-09 07:52:28,188 \[Async disk worker #1482 for volume ] INFO impl.FsDatasetAsyncDiskService: Deleted BP-595383232-xxxx-1360869396230 blk_1570321882_1102029183867 file /blk_1570321882 Then we failover for upgrade and then the standby became active. When we did ls command on this file, we got the following exception: 15/04/09 22:07:39 WARN hdfs.BlockReaderFactory: I/O error constructing remote block reader. java.io.IOException: Got error for OP_READ_BLOCK, self=/client:32947, remote={color:red}datanodeA:1004{color}, for file , for pool BP-595383232-xxxx-1360869396230 block 1570321882_1102029183867 at org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:445) at org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:410) at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:815) at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:693) at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:351) at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:576) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:800) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:847) at java.io.DataInputStream.read(DataInputStream.java:100) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:78) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:52) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:112) at org.apache.hadoop.fs.shell.CopyCommands$Merge.processArguments(CopyCommands.java:97) at org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:190) at org.apache.hadoop.fs.shell.Command.run(Command.java:154) at org.apache.hadoop.fs.FsShell.run(FsShell.java:287) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.hadoop.fs.FsShell.main(FsShell.java:340) Eventually this request was served by {color:red} datanodeB {color} on the second attempt, 15/04/09 22:07:39 INFO hdfs.DFSClient: Successfully connected to {color:red}datanodeB:1004{color} for BP-595383232-xxxx-1360869396230:blk_1570321882_1102029183867 On further analyzing this problem, I found that when {color:red} datanodeA {color} notified the namenode after deleting that block, the then ANN didn't queue delete request to PendingDNMessages. So when the then SNN read the edits, it didn't knew that the block from {color:red}datanodeA {color} was deleted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)