hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karthik Kambatla (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3787) BlockManager#close races with ReplicationMonitor#run
Date Fri, 10 Aug 2012 01:21:19 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13432364#comment-13432364
] 

Karthik Kambatla commented on HDFS-3787:
----------------------------------------

Thanks Andy. The patch looks like it should fix the race.

However, I wonder if there would ever be a case where the ReplicationMonitor is interrupted
and the blocksMap should not be closed. To avoid "changing" the semantics (I am not sure if
it really changes), how about the following:
{code}
   public void close() {
    if (replicationThread != null) {
      replicationThread.interrupt();
      try {
        replicationThread.join();
      } catch (InterruptedException ie) {
      } finally {
        if (pendingReplications != null) pendingReplications.stop();
        blocksMap.close();
        datanodeManager.close();
      }
    }
   }
{code}

In addition to this, we can conservatively call pendingReplications.stop() in ReplicationMonitor
as well?
                
> BlockManager#close races with ReplicationMonitor#run
> ----------------------------------------------------
>
>                 Key: HDFS-3787
>                 URL: https://issues.apache.org/jira/browse/HDFS-3787
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 2.0.0-alpha
>            Reporter: Andy Isaacson
>            Assignee: Andy Isaacson
>            Priority: Minor
>         Attachments: hdfs-3787.txt
>
>
> We saw {{TestDirectoryScanner}} fail during shutdown:
> {code}
> 2012-08-09 12:17:19,844 WARN  datanode.DataNode (BPServiceActor.java:run(683)) - Ending
block pool service for: Block pool BP-610123021-172.29.121.238-1344539835759 (storage id DS-1581877160-172.29.121.238-43609-1344539837880)
service to localhost/127.0.0.1:40012
> ...
> 2012-08-09 12:17:19,876 FATAL blockmanagement.BlockManager (BlockManager.java:run(3039))
- ReplicationMonitor thread received Runtime exception. 
> java.lang.NullPointerException
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.getBlockCollection(BlocksMap.java:101)
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1141)
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1116)
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3070)
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3032)
> 	at java.lang.Thread.run(Thread.java:662)
> {code}
> Inspecting the code, it appears that {{BlockManager#close -> BlocksMap#close}} can
set {{blocks}} to {{null}} while {{computeDatanodeWork}} is running.
> The fix seems simple -- have {{close}} just set an exit flag, and have {{ReplicationMonitor#run}}
call {{BlocksMap#close}}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message