hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Walter Su (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-9958) BlockManager#createLocatedBlocks can throw NPE for corruptBlocks on failed storages.
Date Wed, 20 Apr 2016 05:52:25 GMT

    [ https://issues.apache.org/jira/browse/HDFS-9958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249312#comment-15249312

Walter Su commented on HDFS-9958:

bq. we fix countNodes().corruptReplicas() to return the number after going thru all storages(
irrespective of their state) that have the corruptNodes (in this case), since numNodes() is
storage state agnostic.
I think {{countNodes(blk)}} going thru all storages is unnecessary. Also I think {{numMachines}}
should only include NORMAL and READ_ONLY. So {{createLocatedBlock(..)}} going thru all storages
is unnecessary.
    if (numMachines > 0) {
      for(DatanodeStorageInfo storage : blocksMap.getStorages(blk)) {

btw, which is not related to this topic, I think {{findAndMarkBlockAsCorrupt(..)}} shouldn't
support adding blk to the map if the storage is not found.

ping [~jingzhao] to check if he has any comment.

> BlockManager#createLocatedBlocks can throw NPE for corruptBlocks on failed storages.
> ------------------------------------------------------------------------------------
>                 Key: HDFS-9958
>                 URL: https://issues.apache.org/jira/browse/HDFS-9958
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.7.2
>            Reporter: Kuhu Shukla
>            Assignee: Kuhu Shukla
>         Attachments: HDFS-9958-Test-v1.txt, HDFS-9958.001.patch, HDFS-9958.002.patch
> In a scenario where the corrupt replica is on a failed storage, before it is taken out
of blocksMap, there is a race which causes the creation of LocatedBlock on a {{machines}}
array element that is not populated. 
> Following is the root cause,
> {code}
> final int numCorruptNodes = countNodes(blk).corruptReplicas();
> {code}
> countNodes only looks at nodes with storage state as NORMAL, which in the case where
corrupt replica is on failed storage will amount to numCorruptNodes being zero. 
> {code}
> final int numNodes = blocksMap.numNodes(blk);
> {code}
> However, numNodes will count all nodes/storages irrespective of the state of the storage.
Therefore numMachines will include such (failed) nodes. The assert would fail only if the
system is enabled to catch Assertion errors, otherwise it goes ahead and tries to create LocatedBlock
object for that is not put in the {{machines}} array.
> Here is the stack trace:
> {code}
> java.lang.NullPointerException
> 	at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeStorageInfo.toDatanodeInfos(DatanodeStorageInfo.java:45)
> 	at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeStorageInfo.toDatanodeInfos(DatanodeStorageInfo.java:40)
> 	at org.apache.hadoop.hdfs.protocol.LocatedBlock.<init>(LocatedBlock.java:84)
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlock(BlockManager.java:878)
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlock(BlockManager.java:826)
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlockList(BlockManager.java:799)
> 	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlocks(BlockManager.java:899)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1849)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:588)
> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:415)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
> {code}

This message was sent by Atlassian JIRA

View raw message