hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eli Collins (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3936) MiniDFSCluster shutdown may fail due to BlocksMap#getBlockCollection NPE
Date Fri, 14 Sep 2012 21:06:09 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13456139#comment-13456139

Eli Collins commented on HDFS-3936:

@Colin, the exception here is not unexpected, so asserting on IE here would mean shutdown

@Todd, BM#updatedNeededReplications is the only place this patch swallows the IOE, think think
it should be propagated to all callers? The IE comes from the interrupt in BM#close which
subsequently swallows the IE so it seemed equivalent. I could add Thread.currentThread().interrupt()
so we throw an IE again but that will just get swallowed right?

The top-level RPC methods and test util methods turn the IE into an IOE, think those should
be preserved as IE as well? IIUC the RPC code will marshal it into an IOE anyway.

While looping TestDFSClientRetries I found a related issue. Interrupting out of the RM lock
fixes the issue where the BM does not actually exiting and races with the replication monitor
(since it now gets interrupted), but client RPCs can still race with NN shutdown. After fixing
this TestDFSClientRetries eventually fails with:

  Exception 0: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RemoteException):
        at org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.getBlockCollection(BlocksMap.java:101)
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.getBlockCollection(BlockManager.java:2947)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.isValidBlock(FSNamesystem.java:4477)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.allocateBlock(FSNamesystem.java:2460)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2221)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:476)

The NN should really stop the RPC server and drain all RPCs before shutting down the FSN,
BM etc. Thinking that should be punted to another change. With the following this test passes
when looped for 10 hours because this test only races on NN#addBlock.

+  private boolean isClosed() {
+    return blocks == null;
+  }
   BlockCollection getBlockCollection(Block b) {
+    if (isClosed()) {
+      return null; // This call raced with close
+    }
> MiniDFSCluster shutdown may fail due to BlocksMap#getBlockCollection NPE
> ------------------------------------------------------------------------
>                 Key: HDFS-3936
>                 URL: https://issues.apache.org/jira/browse/HDFS-3936
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.0.0-alpha
>            Reporter: Eli Collins
>            Assignee: Eli Collins
>         Attachments: hdfs-3936.txt
> Looks like HDFS-3664 didn't fix the whole issue because the added join times out because
the thread closing the BM (FSN#stopCommonServices) holds the FSN lock while closing the BM
and the BM is block uninterruptedly trying to aquire the FSN lock.
> {noformat}
> 2012-09-13 18:54:12,526 FATAL hdfs.MiniDFSCluster (MiniDFSCluster.java:shutdown(1355))
- Test resulted in an unexpected exit
> org.apache.hadoop.util.ExitUtil$ExitException: Fatal exception with message null
> stack trace
> java.lang.NullPointerException
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.getBlockCollection(BlocksMap.java:101)
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1132)
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1107)
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3061)
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3023)
> 	at java.lang.Thread.run(Thread.java:662)
> {noformat}

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message