hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Weiwei Yang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets
Date Mon, 16 Oct 2017 13:13:05 GMT

    [ https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16205861#comment-16205861
] 

Weiwei Yang commented on HDFS-12638:
------------------------------------

Hi [~yangjiandan]

Thanks for narrowing down the root cause and providing a test case. I believe as long as the
truncate runs as *copy-on-truncate* schema. e.g under rolling upgrade, upgrade not finalized
or in snapshot, it will have this problem. This code path creates a new block for truncation
and at same time the old block is left over in blocks map. When the file gets deleted, the
old block becomes to be an orphan block.

Further, I read quite a few JIRAs similar to this problem. Such as HDFS-7611, HDFS-8113, HDFS-4867.
It looks like what we deal with such blocks (if  it is reasonably be an orphan block) is to
simply add a check to avoid NPE. For example in {{BlockManager#dumpBlockMeta}}

{code}
if (block instanceof BlockInfo) {
      BlockCollection bc = getBlockCollection((BlockInfo)block);
      String fileName = (bc == null) ? "[orphaned]" : bc.getName();
      out.print(fileName + ": ");
}
{code}

most places already handled the case like this. So I would suggest to use a similar fix to
resolve this issue. A few suggestions

# Add a check in {{BlockManager#scheduleReplication}} to avoid NPE
# Review the call in {{BlockManager#chooseExcessReplicates}}, most likely it needs a check
too
# Add a check in {{NamenodeFsck}} to fix the NPE when run {{fsck -blockId}} agaist an orphan
block
# Add a javadoc to remind {{BlockManager#getBlockCollection}} might return a null

Please let me know if this makes sense, [~yangjiandan], [~kihwal], [~daryn].





> NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-12638
>                 URL: https://issues.apache.org/jira/browse/HDFS-12638
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs
>    Affects Versions: 2.8.2
>            Reporter: Jiandan Yang 
>         Attachments: HDFS-12638-branch-2.8.2.001.patch
>
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed in when
creating ReplicationWork is null, but I do not know why BlockCollection is null, By view history
I found [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  whether
 BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] org.apache.hadoop.hdfs.server.blockmanagement.BlockManager:
ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
>         at org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
>         at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
>         at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
>         at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
>         at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
>         at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message