hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andy Isaacson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3787) BlockManager#close races with ReplicationMonitor#run
Date Fri, 10 Aug 2012 01:29:19 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13432367#comment-13432367

Andy Isaacson commented on HDFS-3787:

That seems reasonable to me, and it keeps the closing logic in close() where it logically
belongs.  However it means that a hung replicationThread will hang the close() as well, if
we do an unbounded join.

How about {{join(3000);}}, followed by a finally block?  If the join times out, assume the
thread is hung and it doesn't matter if we close racily.
> BlockManager#close races with ReplicationMonitor#run
> ----------------------------------------------------
>                 Key: HDFS-3787
>                 URL: https://issues.apache.org/jira/browse/HDFS-3787
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 2.0.0-alpha
>            Reporter: Andy Isaacson
>            Assignee: Andy Isaacson
>            Priority: Minor
>         Attachments: hdfs-3787.txt
> We saw {{TestDirectoryScanner}} fail during shutdown:
> {code}
> 2012-08-09 12:17:19,844 WARN  datanode.DataNode (BPServiceActor.java:run(683)) - Ending
block pool service for: Block pool BP-610123021- (storage id DS-1581877160-
service to localhost/
> ...
> 2012-08-09 12:17:19,876 FATAL blockmanagement.BlockManager (BlockManager.java:run(3039))
- ReplicationMonitor thread received Runtime exception. 
> java.lang.NullPointerException
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.getBlockCollection(BlocksMap.java:101)
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1141)
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1116)
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3070)
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3032)
> 	at java.lang.Thread.run(Thread.java:662)
> {code}
> Inspecting the code, it appears that {{BlockManager#close -> BlocksMap#close}} can
set {{blocks}} to {{null}} while {{computeDatanodeWork}} is running.
> The fix seems simple -- have {{close}} just set an exit flag, and have {{ReplicationMonitor#run}}
call {{BlocksMap#close}}.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message