hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Weiwei Yang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets
Date Thu, 19 Oct 2017 02:41:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16210498#comment-16210498

Weiwei Yang commented on HDFS-12638:

Hi [~daryn]

bq. it sounds like you want to avoid the symptom (NPE) 

Yes, I think our code should bear with such orphan blocks, instead of failing the NN with
NPE like this. At least.

bq. rather than address the orphaned block (root cause)?

The test case explains how these orphaned blocks come from

||Step#||Step Operation||Explain||
|1|Create a file| this creates a few blocks, e.g B1|
|2|Run rolling upgrade (or create a snapshot)|this is the condition to trigger the copy-on-truncate
|3|Truncate the file to a smaller size|"copy-on-truncate" schema is used, it creates a new
block B2 and copy required bytes from B1 to B2, and inode reference updated to B2|
|4|Delete this file|this will delete inode ref and remove B2 from blocks map, leaving B1 behind.
So when we read snapshot again, it is able to find its original block B1.|

Please see also {{FSDirTtruncateOp#shouldCopyOnTruncate}}. I have read the truncate design
doc, this seems to be the designed behavior. We cannot delete the old block otherwise snapshot
won't be able to read it anymore. I assume when the snapshot gets deleted, these blocks will
be also removed from the blocks map. But before that, we need to live with such orphaned blocks.
Any thoughts on this?


> NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets
> -----------------------------------------------------------------------------------------------------------
>                 Key: HDFS-12638
>                 URL: https://issues.apache.org/jira/browse/HDFS-12638
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs
>    Affects Versions: 2.8.2
>            Reporter: Jiandan Yang 
>         Attachments: HDFS-12638-branch-
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed in when
creating ReplicationWork is null, but I do not know why BlockCollection is null, By view history
I found [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  whether
 BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] org.apache.hadoop.hdfs.server.blockmanagement.BlockManager:
ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
>         at org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
>         at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
>         at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
>         at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
>         at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
>         at java.lang.Thread.run(Thread.java:834)
> {code}

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message