hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uma Maheswara Rao G (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-4482) ReplicationMonitor thread can exit with NPE due to the race between delete and replication of same file.
Date Fri, 08 Feb 2013 10:01:13 GMT

     [ https://issues.apache.org/jira/browse/HDFS-4482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Uma Maheswara Rao G updated HDFS-4482:
--------------------------------------

    Description: 
Trace:

{noformat}
java.lang.NullPointerException
	at org.apache.hadoop.hdfs.server.namenode.FSDirectory.getFullPathName(FSDirectory.java:1442)
	at org.apache.hadoop.hdfs.server.namenode.INode.getFullPathName(INode.java:269)
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.getName(INodeFile.java:163)
	at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy.chooseTarget(BlockPlacementPolicy.java:131)
	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1157)
	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1063)
	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3085)
	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3047)
	at java.lang.Thread.run(Thread.java:619)

{noformat}

What I am seeing here is:

1) create a file and write with 2 DNS
2) Close the file.
3) Kill one DN
4) Let replication start.
  Info:
    {code}
 // choose replication targets: NOT HOLDING THE GLOBAL LOCK
      // It is costly to extract the filename for which chooseTargets is called,
      // so for now we pass in the block collection itself.
      rw.targets = blockplacement.chooseTarget(rw.bc,
          rw.additionalReplRequired, rw.srcNode, rw.liveReplicaNodes,
          excludedNodes, rw.block.getNumBytes());
{code}
Here we are choosing target outside the global lock. Inside we will try to get the src path
from blockCollection(nothing but INodeFile here).

see the code for FSDirectory#getFullPathName
 Here it is incrementing the depth until it has parent. and Later it will iterate and access
parent again in next loop.

5) before going to secnd loop in FSDirectory#getFullPathName, if file is deleted by client
then that parent would have been set as null. So, here accessing the parent can cause NPE
because it is not under lock.


  was:
Trace:

{noformat}
java.lang.NullPointerException
	at org.apache.hadoop.hdfs.server.namenode.FSDirectory.getFullPathName(FSDirectory.java:1442)
	at org.apache.hadoop.hdfs.server.namenode.INode.getFullPathName(INode.java:269)
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.getName(INodeFile.java:163)
	at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy.chooseTarget(BlockPlacementPolicy.java:131)
	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1157)
	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1063)
	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3085)
	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3047)
	at java.lang.Thread.run(Thread.java:619)

{noformat}

What I am seeing here is:

1) create a file and write with 2 DNS
2) Close the file.
3) Kill one DN
4) Lat replication start.
  Info:
    {code}
 // choose replication targets: NOT HOLDING THE GLOBAL LOCK
      // It is costly to extract the filename for which chooseTargets is called,
      // so for now we pass in the block collection itself.
      rw.targets = blockplacement.chooseTarget(rw.bc,
          rw.additionalReplRequired, rw.srcNode, rw.liveReplicaNodes,
          excludedNodes, rw.block.getNumBytes());
{code}
Here we are choosing target outside the global lock. Inside we will try to get the src path
from blockCollection(nothing but INodeFile here).

see the code for FSDirectory#getFullPathName
 Here it is incrementing the depth until it has parent. and Later it will iterate and access
parent again in next loop.

Between this if file is deleted by client then that parent would have been set as null. So,
here accessing the parent can cause NPE because it is not under lock.


2) 

    
> ReplicationMonitor thread can exit with NPE due to the race between delete and replication
of same file.
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-4482
>                 URL: https://issues.apache.org/jira/browse/HDFS-4482
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Uma Maheswara Rao G
>            Priority: Blocker
>
> Trace:
> {noformat}
> java.lang.NullPointerException
> 	at org.apache.hadoop.hdfs.server.namenode.FSDirectory.getFullPathName(FSDirectory.java:1442)
> 	at org.apache.hadoop.hdfs.server.namenode.INode.getFullPathName(INode.java:269)
> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.getName(INodeFile.java:163)
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy.chooseTarget(BlockPlacementPolicy.java:131)
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1157)
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1063)
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3085)
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3047)
> 	at java.lang.Thread.run(Thread.java:619)
> {noformat}
> What I am seeing here is:
> 1) create a file and write with 2 DNS
> 2) Close the file.
> 3) Kill one DN
> 4) Let replication start.
>   Info:
>     {code}
>  // choose replication targets: NOT HOLDING THE GLOBAL LOCK
>       // It is costly to extract the filename for which chooseTargets is called,
>       // so for now we pass in the block collection itself.
>       rw.targets = blockplacement.chooseTarget(rw.bc,
>           rw.additionalReplRequired, rw.srcNode, rw.liveReplicaNodes,
>           excludedNodes, rw.block.getNumBytes());
> {code}
> Here we are choosing target outside the global lock. Inside we will try to get the src
path from blockCollection(nothing but INodeFile here).
> see the code for FSDirectory#getFullPathName
>  Here it is incrementing the depth until it has parent. and Later it will iterate and
access parent again in next loop.
> 5) before going to secnd loop in FSDirectory#getFullPathName, if file is deleted by client
then that parent would have been set as null. So, here accessing the parent can cause NPE
because it is not under lock.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message