hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daryn Sharp (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets
Date Thu, 19 Oct 2017 17:43:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16211413#comment-16211413
] 

Daryn Sharp commented on HDFS-12638:
------------------------------------

bq. Yes, I think our code should bear with such orphan blocks, instead of failing the NN with
NPE like this. At least.
See below, they aren't really orphaned.  I think it's correct for the NN to crash if the namesystem
data structures are corrupted.

bq. I assume when the snapshot gets deleted, these blocks will be also removed from the blocks
map. But before that, we need to live with such orphaned blocks
To the block manager, replication monitor, etc these copy-on-truncate blocks are not (supposed
to be) special.  My prior point stated another way is the block is not orphaned if it's in
a snapshot diff.  INodes are not orphaned when only referenced via a snapshot diff.  A block
in the blocks map should not be referencing an inode not in the inodes map.  Direct namespace
accessibility is irrelevant to the block/inode/map linkages being correct.

We need to fix the bug, not mask it.

> NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-12638
>                 URL: https://issues.apache.org/jira/browse/HDFS-12638
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs
>    Affects Versions: 2.8.2
>            Reporter: Jiandan Yang 
>         Attachments: HDFS-12638-branch-2.8.2.001.patch
>
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed in when
creating ReplicationWork is null, but I do not know why BlockCollection is null, By view history
I found [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  whether
 BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] org.apache.hadoop.hdfs.server.blockmanagement.BlockManager:
ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
>         at org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
>         at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
>         at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
>         at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
>         at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
>         at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message