Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-issues@hadoop.apache.org
Date: Tue, 9 Sep 2014 21:16:30 +0000 (UTC)
From: "Allen Wittenauer (JIRA)" <jira@apache.org>
To: hdfs-issues@hadoop.apache.org
Message-ID: <JIRA.12631413.1360317484000.31499.1410297390580@Atlassian.JIRA>
In-Reply-To: <JIRA.12631413.1360317484000@Atlassian.JIRA>
References: <JIRA.12631413.1360317484000@Atlassian.JIRA>
 <JIRA.12631413.1360317484529@arcas>
Subject: [jira] [Updated] (HDFS-4482) ReplicationMonitor thread can exit
 with NPE due to the race between delete and replication of same file.
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/HDFS-4482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Allen Wittenauer updated HDFS-4482:
-----------------------------------
    Fix Version/s:     (was: 3.0.0)

> ReplicationMonitor thread can exit with NPE due to the race between delete and replication of same file.
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-4482
>                 URL: https://issues.apache.org/jira/browse/HDFS-4482
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 3.0.0, 2.0.1-alpha
>            Reporter: Uma Maheswara Rao G
>            Assignee: Uma Maheswara Rao G
>            Priority: Blocker
>             Fix For: 2.0.5-alpha, 0.23.10
>
>         Attachments: HDFS-4482-1.patch, HDFS-4482.patch, HDFS-4482.patch
>
>
> Trace:
> {noformat}
> java.lang.NullPointerException
> 	at org.apache.hadoop.hdfs.server.namenode.FSDirectory.getFullPathName(FSDirectory.java:1442)
> 	at org.apache.hadoop.hdfs.server.namenode.INode.getFullPathName(INode.java:269)
> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.getName(INodeFile.java:163)
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy.chooseTarget(BlockPlacementPolicy.java:131)
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1157)
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1063)
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3085)
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3047)
> 	at java.lang.Thread.run(Thread.java:619)
> {noformat}
> What I am seeing here is:
> 1) create a file and write with 2 DNS
> 2) Close the file.
> 3) Kill one DN
> 4) Let replication start.
>   Info:
>     {code}
>  // choose replication targets: NOT HOLDING THE GLOBAL LOCK
>       // It is costly to extract the filename for which chooseTargets is called,
>       // so for now we pass in the block collection itself.
>       rw.targets = blockplacement.chooseTarget(rw.bc,
>           rw.additionalReplRequired, rw.srcNode, rw.liveReplicaNodes,
>           excludedNodes, rw.block.getNumBytes());{code}
> Here we are choosing target outside the global lock. Inside we will try to get the src path from blockCollection(nothing but INodeFile here).
> see the code for FSDirectory#getFullPathName
>  Here it is incrementing the depth until it has parent. and Later it will iterate and access parent again in next loop.
> 5) before going to secnd loop in FSDirectory#getFullPathName, if file is deleted by client then that parent would have been set as null. So, here accessing the parent can cause NPE because it is not under lock.
> [~brahmareddy] reported this issue.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)