Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 788D918C1F for ; Wed, 6 Jan 2016 18:11:40 +0000 (UTC) Received: (qmail 79032 invoked by uid 500); 6 Jan 2016 18:11:40 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 78979 invoked by uid 500); 6 Jan 2016 18:11:40 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 78959 invoked by uid 99); 6 Jan 2016 18:11:40 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Jan 2016 18:11:40 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id D45C82C1F5A for ; Wed, 6 Jan 2016 18:11:39 +0000 (UTC) Date: Wed, 6 Jan 2016 18:11:39 +0000 (UTC) From: "Wei-Chiu Chuang (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HDFS-9619) DataNode sometimes can not find blockpool for the correct namenode MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-9619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei-Chiu Chuang updated HDFS-9619: ---------------------------------- Labels: test (was: ) > DataNode sometimes can not find blockpool for the correct namenode > ------------------------------------------------------------------ > > Key: HDFS-9619 > URL: https://issues.apache.org/jira/browse/HDFS-9619 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, test > Affects Versions: 3.0.0 > Environment: Jenkins > Reporter: Wei-Chiu Chuang > Assignee: Wei-Chiu Chuang > Labels: test > Attachments: HDFS-9619.001.patch > > > We sometimes see {{TestBalancerWithMultipleNameNodes.testBalancer}} failed to replicate a file, because a data node is excluded. > {noformat} > File /tmp.txt could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and 1 node(s) are excluded in this operation. > at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1745) > at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:299) > at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2390) > at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:797) > at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:500) > at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:637) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2305) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2301) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1705) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2299) > {noformat} > Relevent logs suggest root cause is due to block pool not found. > {noformat} > 2016-01-03 22:11:43,174 [DataXceiver for client DFSClient_NONMAPREDUCE_849671738_1 at /127.0.0.1:47318 [Receiving block BP-1927700312-172.26.2.1-1451887902222:blk_1073741825_1001]] ERROR datanode.DataNode (DataXceiver.java:run(280)) - host0.foo.com:49997:DataXceiver error processing WRITE_BLOCK operation src: /127.0.0.1:47318 dst: /127.0.0.1:49997 > java.io.IOException: Non existent blockpool BP-1927700312-172.26.2.1-1451887902222 > at org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset.getMap(SimulatedFSDataset.java:583) > at org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset.createTemporary(SimulatedFSDataset.java:955) > at org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset.createRbw(SimulatedFSDataset.java:941) > at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:203) > at org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1235) > at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:678) > at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:166) > at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103) > at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:253) > at java.lang.Thread.run(Thread.java:745) > {noformat} > For a bit more context, this test starts a cluster with two name nodes and one data node. The block pools are added, but one of them is not found after added. The root cause is due to an undetected concurrent access in a hash map in SimulatedFSDataset (two block pools are added simultaneously). I added some logs to print blockMap, and saw a few ConcurrentModificationExceptions. The solution would be to use a thread safe class instead, like ConcurrentHashMap. -- This message was sent by Atlassian JIRA (v6.3.4#6332)