hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uma Maheswara Rao G (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-4482) ReplicationMonitor thread can exit with NPE due to the race between delete and replication of same file.
Date Wed, 20 Feb 2013 01:24:13 GMT

    [ https://issues.apache.org/jira/browse/HDFS-4482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13581840#comment-13581840

Uma Maheswara Rao G commented on HDFS-4482:

Hi Nicholas, Thanks a lot for taking a look!
Yes, I have seen that. Why I did not add null check there is, the code patch of listCorruptFileBlocks
is under read lock already. But we are returning null in second loop check case to ensure
if there are any concurrent modifications to inode. But there should not be any concurrency
here because this code already in namesystem lock right? am i missing?
    try {
> ReplicationMonitor thread can exit with NPE due to the race between delete and replication
of same file.
> --------------------------------------------------------------------------------------------------------
>                 Key: HDFS-4482
>                 URL: https://issues.apache.org/jira/browse/HDFS-4482
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 3.0.0, 2.0.1-alpha
>            Reporter: Uma Maheswara Rao G
>            Assignee: Uma Maheswara Rao G
>            Priority: Blocker
>         Attachments: HDFS-4482.patch, HDFS-4482.patch
> Trace:
> {noformat}
> java.lang.NullPointerException
> 	at org.apache.hadoop.hdfs.server.namenode.FSDirectory.getFullPathName(FSDirectory.java:1442)
> 	at org.apache.hadoop.hdfs.server.namenode.INode.getFullPathName(INode.java:269)
> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.getName(INodeFile.java:163)
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy.chooseTarget(BlockPlacementPolicy.java:131)
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1157)
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1063)
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3085)
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3047)
> 	at java.lang.Thread.run(Thread.java:619)
> {noformat}
> What I am seeing here is:
> 1) create a file and write with 2 DNS
> 2) Close the file.
> 3) Kill one DN
> 4) Let replication start.
>   Info:
>     {code}
>  // choose replication targets: NOT HOLDING THE GLOBAL LOCK
>       // It is costly to extract the filename for which chooseTargets is called,
>       // so for now we pass in the block collection itself.
>       rw.targets = blockplacement.chooseTarget(rw.bc,
>           rw.additionalReplRequired, rw.srcNode, rw.liveReplicaNodes,
>           excludedNodes, rw.block.getNumBytes());{code}
> Here we are choosing target outside the global lock. Inside we will try to get the src
path from blockCollection(nothing but INodeFile here).
> see the code for FSDirectory#getFullPathName
>  Here it is incrementing the depth until it has parent. and Later it will iterate and
access parent again in next loop.
> 5) before going to secnd loop in FSDirectory#getFullPathName, if file is deleted by client
then that parent would have been set as null. So, here accessing the parent can cause NPE
because it is not under lock.
> [~brahmareddy] reported this issue.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message