hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wei-Chiu Chuang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-9361) Default block placement policy causes TestReplaceDataNodeOnFailure to fail intermittently
Date Mon, 02 Nov 2015 23:06:28 GMT

     [ https://issues.apache.org/jira/browse/HDFS-9361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Wei-Chiu Chuang updated HDFS-9361:
----------------------------------
    Description: 
TestReplaceDatanodeOnFailure sometimes fail (See HDFS-6101).
(For background information, the test case set up a cluster with three data nodes, add two
more data nodes, remove one data nodes, and verify that clients can correctly recover from
the failure and set up three replicas)

I traced down and found that some times a client only set up a pipeline with only two data
nodes, which is one less than configured in the test case, even though the test case configures
to always replace failed nodes.

Digging into the log, I saw:
{noformat}
2015-11-02 12:07:38,634 [IPC Server handler 8 on 50673] WARN  blockmanagement.BlockPlacementPolicy
(BlockPlacementPolicyDefault.java:chooseTarget(355)) - Failed to place enough replicas, still
in nee
d of 1 to reach 3 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK],
creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true)
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException:
[
Node /rack0/127.0.0.1:32931 [
  Datanode 127.0.0.1:32931 is not chosen since the rack has too many chosen nodes .
]
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:723)
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRemoteRack(BlockPlacementPolicyDefault.java:624)
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:429)
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:342)
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:220)
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:105)
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:120)
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1727)
        at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:299)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2457)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:796)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:500)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:637)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2305)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2301)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2299)
{noformat}

So from the log, it seems the policy causes the pipeline selection to give up on the data
node.
I wonder whether this is appropriate or not. If the load factor exceeds certain threshold,
but the file is insufficient of replicas, should it accept it as is, or should it attempt
to acquire more replicas? 

I am filing this JIRA for discussion. I am very unfamiliar with block placement, so I may
be wrong about my hypothesis.

  was:
TestReplaceDatanodeOnFailure sometimes fail (See HDFS-6101).
(For background information, the test case set up a cluster with three data nodes, add two
more data nodes, remove one data nodes, and verify that clients can correctly recover from
the failure and set up three replicas)

I traced down and found that some times a client only set up a pipeline with only two data
nodes, which is one less than configured in the test case, even though the test case configures
to always replace failed nodes.

Digging into the log, I saw:
{noformat}
2015-11-02 12:07:38,634 [IPC Server handler 8 on 50673] WARN  blockmanagement.BlockPlacementPolicy
(BlockPlacementPolicyDefault.java:chooseTarget(355)) - Failed to place enough replicas, still
in nee
d of 1 to reach 3 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK],
creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true)
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException:
[
Node /rack0/127.0.0.1:32931 [
  Datanode 127.0.0.1:32931 is not chosen since the rack has too many chosen nodes .
]
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:723)
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRemoteRack(BlockPlacementPolicyDefault.java:624)
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:429)
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:342)
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:220)
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:105)
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:120)
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1727)
        at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:299)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2457)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:796)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:500)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:637)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2305)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2301)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2299)
{noformat}

So from the log, I wonder whether this is appropriate or not. If the load factor exceeds certain
threshold, but the file is insufficient of replicas, should it accept it as is, or should
it attempt to acquire more replicas?

I am filing this JIRA for discussion.


> Default block placement policy causes TestReplaceDataNodeOnFailure to fail intermittently
> -----------------------------------------------------------------------------------------
>
>                 Key: HDFS-9361
>                 URL: https://issues.apache.org/jira/browse/HDFS-9361
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: HDFS
>            Reporter: Wei-Chiu Chuang
>
> TestReplaceDatanodeOnFailure sometimes fail (See HDFS-6101).
> (For background information, the test case set up a cluster with three data nodes, add
two more data nodes, remove one data nodes, and verify that clients can correctly recover
from the failure and set up three replicas)
> I traced down and found that some times a client only set up a pipeline with only two
data nodes, which is one less than configured in the test case, even though the test case
configures to always replace failed nodes.
> Digging into the log, I saw:
> {noformat}
> 2015-11-02 12:07:38,634 [IPC Server handler 8 on 50673] WARN  blockmanagement.BlockPlacementPolicy
(BlockPlacementPolicyDefault.java:chooseTarget(355)) - Failed to place enough replicas, still
in nee
> d of 1 to reach 3 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK],
creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException:
[
> Node /rack0/127.0.0.1:32931 [
>   Datanode 127.0.0.1:32931 is not chosen since the rack has too many chosen nodes .
> ]
>         at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:723)
>         at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRemoteRack(BlockPlacementPolicyDefault.java:624)
>         at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:429)
>         at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:342)
>         at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:220)
>         at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:105)
>         at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:120)
>         at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1727)
>         at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:299)
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2457)
>         at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:796)
>         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:500)
>         at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:637)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:976)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2305)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2301)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2299)
> {noformat}
> So from the log, it seems the policy causes the pipeline selection to give up on the
data node.
> I wonder whether this is appropriate or not. If the load factor exceeds certain threshold,
but the file is insufficient of replicas, should it accept it as is, or should it attempt
to acquire more replicas? 
> I am filing this JIRA for discussion. I am very unfamiliar with block placement, so I
may be wrong about my hypothesis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message