hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ajay Kumar (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HADOOP-15317) Improve NetworkTopology chooseRandom's loop
Date Thu, 29 Mar 2018 02:37:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-15317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418341#comment-16418341
] 

Ajay Kumar edited comment on HADOOP-15317 at 3/29/18 2:36 AM:
--------------------------------------------------------------

[~xiaochen], patch v5 looks good. One question:
 * For parameter {{numInScopeNodes}} in {{NetworkTopology#chooseRandom}} L564 we are passing
total no of nodes in parent node including excluded nodes. If this is intentional than we
should rename this to {{totalAvailableNodes}} and update its definition as well. For example
in below case, we have 2 {{excludedNodes}} out of total 5, so {{availableNodes}} are 3. {{numInScopeNodes}} should
be 3 as well as by its definition we are excluding excluded nodes. Please correct me if I
might be missing something here.

!Screen Shot 2018-03-28 at 7.23.32 PM.png!


was (Author: ajayydv):
[~xiaochen], patch v5 looks good. One question:
* For parameter {{numInScopeNodes}} in {{NetworkTopology#chooseRandom}} L564 we are passing
total no of nodes in parent node including excluded nodes. If this is intentional than we
should rename this to {{totalAvailableNodes}} and update its definition as well. For example
in below case, we have 2 {{excludedNodes}} out of total 5, so {{availableNodes}} are 3. So
{{numInScopeNodes}} seems to be 3 as well as by its definition we are excluding excluded nodes.
Please correct me if I might be missing something here.
 
 !Screen Shot 2018-03-28 at 7.23.32 PM.png! 

> Improve NetworkTopology chooseRandom's loop
> -------------------------------------------
>
>                 Key: HADOOP-15317
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15317
>             Project: Hadoop Common
>          Issue Type: Bug
>            Reporter: Xiao Chen
>            Assignee: Xiao Chen
>            Priority: Major
>         Attachments: HADOOP-15317.01.patch, HADOOP-15317.02.patch, HADOOP-15317.03.patch,
HADOOP-15317.04.patch, HADOOP-15317.05.patch, Screen Shot 2018-03-28 at 7.23.32 PM.png
>
>
> Recently we found a postmortem case where the ANN seems to be in an infinite loop. From
the logs it seems it just went through a rolling restart, and DNs are getting registered.
> Later the NN become unresponsive, and from the stacktrace it's inside a do-while loop
inside {{NetworkTopology#chooseRandom}} - part of what's done in HDFS-10320.
> Going through the code and logs I'm not able to come up with any theory (thought about
incorrect locking, or the Node object being modified outside of NetworkTopology, both seem
impossible) why this is happening, but we should eliminate this loop.
> stacktrace:
> {noformat}
>  Stack:
> java.util.HashMap.hash(HashMap.java:338)
> java.util.HashMap.containsKey(HashMap.java:595)
> java.util.HashSet.contains(HashSet.java:203)
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:786)
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:732)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:757)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:692)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:666)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:573)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:461)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:368)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:243)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:115)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4AdditionalDatanode(BlockManager.java:1596)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3599)
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:717)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message