Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 22B82CF12 for ; Fri, 13 Sep 2013 15:57:43 +0000 (UTC) Received: (qmail 68470 invoked by uid 500); 13 Sep 2013 09:05:57 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 68427 invoked by uid 500); 13 Sep 2013 09:05:54 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 68403 invoked by uid 99); 13 Sep 2013 09:05:52 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 13 Sep 2013 09:05:52 +0000 Date: Fri, 13 Sep 2013 09:05:52 +0000 (UTC) From: "Nikola Vujic (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HDFS-5184) BlockPlacementPolicyWithNodeGroup does not work correct when avoidStaleNodes is true MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-5184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikola Vujic updated HDFS-5184: ------------------------------- Description: If avoidStaleNodes is true then choosing targets is potentially done in two attempts. If we don't find enough targets to place replicas in the first attempt then second attempt is invoked with the aim to use stale nodes in order to find the remaining targets. This second attempt breaks node group rule of not having two replicas in the same node group. Invocation of the second attempt looks like this: {code} DatanodeDescriptor choseTarget(excludeNodes,...) { oldExcludedNodes=new HashMap(excludedNodes); // first attempt // if we don't find enough targets then if (avoidStaleNodes) { for (Node node : results) { oldExcludedNodes.put(node, node); } numOfReplicas = totalReplicasExpected - results.size(); return chooseTarget(numOfReplicas, writer, oldExcludedNodes, blocksize, maxNodesPerRack, results, false); } } {code} So, all excluded nodes from the first attempt which are neither in oldExcludedNodes nor in results will be ignored and the second invocation of chooseTarget will use an incomplete set of excluded nodes. For example, if we have next topology: dn1 -> /d1/r1/n1 dn2 -> /d1/r1/n1 dn3 -> /d1/r1/n2 dn4 -> /d1/r1/n2 and if we want to choose 3 targets with avoidStaleNodes=true then in the first attempt we will choose 2 targets since we have only two node groups. Let's say we choose dn1 and dn3. Then, we will add dn1 and dn3 in the oldExcudedNodes and use that set of excluded nodes in the second attempt. This set of excluded nodes is incomplete and allows us to select dn2 and dn4 in the second attempt which should not be selected due to node group awareness but it is happening in the current code! Repro: - add CONF.setBoolean(DFSConfigKeys.DFS_NAMENODE_AVOID_STALE_DATANODE_FOR_WRITE_KEY, true); to TestReplicationPolicyWithNodeGroup. - testChooseMoreTargetsThanNodeGroups() should fail. was: If avoidStaleNodes is true then choosing targets is potentially done in two attempts. If we don't find enough targets to place replicas in the first attempt then second attempt is invoked with the aim to use stale nodes in order to find the remaining targets. This second attempt breaks node group rule of not having two replicas in the same node group. Invocation of the second attempt looks like this: {code} DatanodeDescriptor choseTarget(excludeNodes,...) { oldExcludedNodes=new HashMap(excludedNodes); // first attempt // if we don't find enough targets then if (avoidStaleNodes) { for (Node node : results) { oldExcludedNodes.put(node, node); } numOfReplicas = totalReplicasExpected - results.size(); return chooseTarget(numOfReplicas, writer, oldExcludedNodes, blocksize, maxNodesPerRack, results, false); } } {code} So, all excluded nodes from the first attempt which are neither in oldExcludedNodes nor in results will be ignored and the second invocation of chooseTarget will use an incomplete set of excluded nodes. For example, if we have next topology: dn1 -> /d1/r1/n1 dn2 -> /d1/r1/n1 dn3 -> /d1/r1/n2 dn4 -> /d1/r1/n2 and if we want to choose 3 targets with avoidStaleNodes=true then in the first attempt we will choose 2 targets since we have only two node groups. Let's say we choose dn1 and dn3. Then, we will add dn1 and dn2 in the oldExcudedNodes and use that set of excluded nodes in the second attempt. This set of excluded nodes is incomplete and allows us to select dn2 and dn4 in the second attempt which should not be selected due to node group awareness but it is happening in the current code! Repro: - add CONF.setBoolean(DFSConfigKeys.DFS_NAMENODE_AVOID_STALE_DATANODE_FOR_WRITE_KEY, true); to TestReplicationPolicyWithNodeGroup. - testChooseMoreTargetsThanNodeGroups() should fail. > BlockPlacementPolicyWithNodeGroup does not work correct when avoidStaleNodes is true > ------------------------------------------------------------------------------------ > > Key: HDFS-5184 > URL: https://issues.apache.org/jira/browse/HDFS-5184 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: Nikola Vujic > Priority: Minor > > If avoidStaleNodes is true then choosing targets is potentially done in two attempts. If we don't find enough targets to place replicas in the first attempt then second attempt is invoked with the aim to use stale nodes in order to find the remaining targets. This second attempt breaks node group rule of not having two replicas in the same node group. > Invocation of the second attempt looks like this: > {code} > DatanodeDescriptor choseTarget(excludeNodes,...) { > oldExcludedNodes=new HashMap(excludedNodes); > // first attempt > // if we don't find enough targets then > if (avoidStaleNodes) { > for (Node node : results) { > oldExcludedNodes.put(node, node); > } > numOfReplicas = totalReplicasExpected - results.size(); > return chooseTarget(numOfReplicas, writer, oldExcludedNodes, blocksize, maxNodesPerRack, results, false); > } > } > {code} > So, all excluded nodes from the first attempt which are neither in oldExcludedNodes nor in results will be ignored and the second invocation of chooseTarget will use an incomplete set of excluded nodes. For example, if we have next topology: > dn1 -> /d1/r1/n1 > dn2 -> /d1/r1/n1 > dn3 -> /d1/r1/n2 > dn4 -> /d1/r1/n2 > and if we want to choose 3 targets with avoidStaleNodes=true then in the first attempt we will choose 2 targets since we have only two node groups. Let's say we choose dn1 and dn3. Then, we will add dn1 and dn3 in the oldExcudedNodes and use that set of excluded nodes in the second attempt. This set of excluded nodes is incomplete and allows us to select dn2 and dn4 in the second attempt which should not be selected due to node group awareness but it is happening in the current code! > Repro: > - add CONF.setBoolean(DFSConfigKeys.DFS_NAMENODE_AVOID_STALE_DATANODE_FOR_WRITE_KEY, true); to TestReplicationPolicyWithNodeGroup. > - testChooseMoreTargetsThanNodeGroups() should fail. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira