Return-Path: Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: (qmail 64517 invoked from network); 18 Sep 2009 07:09:29 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 18 Sep 2009 07:09:29 -0000 Received: (qmail 48886 invoked by uid 500); 18 Sep 2009 07:09:29 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 48829 invoked by uid 500); 18 Sep 2009 07:09:29 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 48819 invoked by uid 99); 18 Sep 2009 07:09:29 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Sep 2009 07:09:29 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Sep 2009 07:09:19 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id A8617234C1E9 for ; Fri, 18 Sep 2009 00:08:57 -0700 (PDT) Message-ID: <1458673616.1253257737688.JavaMail.jira@brutus> Date: Fri, 18 Sep 2009 00:08:57 -0700 (PDT) From: "Ruyue Ma (JIRA)" To: hdfs-issues@hadoop.apache.org Subject: [jira] Commented: (HDFS-630) In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block. In-Reply-To: <794054798.1253257617910.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HDFS-630?page=3Dcom.atlassian.j= ira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D127570= 19#action_12757019 ]=20 Ruyue Ma commented on HDFS-630: ------------------------------- Ruyue Ma added a comment - 20/Jul/09 11:32 PM to: dhruba borthakur > This is not related to HDFS-4379. let me explain why. > The problem is actually related to HDFS-xxx. The namenode waits for 10 mi= nutes after losing heartbeats from a datanode to declare it dead. During th= is 10 minutes, the NN is free to choose the dead datanode as a possible rep= lica for a newly allocated block. > If during a write, the dfsclient sees that a block replica location for a= newly allocated block is not-connectable, it re-requests the NN to get a f= resh set of replica locations of the block. It tries this dfs.client.block.= write.retries times (default 3), sleeping 6 seconds between each retry ( se= e DFSClient.nextBlockOutputStream). > This setting works well when you have= a reasonable size cluster; if u have only 4 datanodes in the cluster, ever= y retry picks the dead-datanode and the above logic bails out. > One solution is to change the value of dfs.client.block.write.retries to = a much much larger value, say 200 or so. Better still, increase the number = of nodes in ur cluster. Our modification: when getting block location from namenode, we give nn the= excluded datanodes. The list of dead datanodes is only for one block alloc= ation. +++ hadoop-new/src/hdfs/org/apache/hadoop/hdfs/DFSClient.java 2009-07-20 00= :19:03.000000000 +0800 @@ -2734,6 +2734,7 @@ LocatedBlock lb =3D null; boolean retry =3D false; DatanodeInfo[] nodes; + DatanodeInfo[] exludedNodes =3D null; int count =3D conf.getInt("dfs.client.block.write.retries", 3); boolean success; do { @@ -2745,7 +2746,7 @@ success =3D false; long startTime =3D System.currentTimeMillis(); * lb =3D locateFollowingBlock(startTime); + lb =3D locateFollowingBlock(startTime, exludedNodes); block =3D lb.getBlock(); nodes =3D lb.getLocations(); @@ -2755,6 +2756,19 @@ success =3D createBlockOutputStream(nodes, clientName, false); if (!success) { + + LOG.info("Excluding node: " + nodes[errorIndex]); + // Mark datanode as excluded + DatanodeInfo errorNode =3D nodes[errorIndex]; + if (exludedNodes !=3D null) { + DatanodeInfo[] newExcludedNodes =3D new D= atanodeInfo[exludedNodes.length + 1]; + System.arraycopy(exludedNodes, 0, n= ewExcludedNodes, 0, exludedNodes.length); + newExcludedNodes[exludedNodes.l= ength] =3D errorNode; + exludedNodes =3D newExcludedNodes; + } else { + exludedNodes =3D new DatanodeInfo[] { errorNode }; + } + LOG.info("Abandoning block " + block); namenode.abandonBlock(block, src, clientName); [ Show =C2=BB ] Ruyue Ma added a comment - 20/Jul/09 11:32 PM to: dhruba borthakur > This i= s not related to HDFS-4379. let me explain why. > The problem is actually r= elated to HDFS-xxx. The namenode waits for 10 minutes after losing heartbea= ts from a datanode to declare it dead. During this 10 minutes, the NN is fr= ee to choose the dead datanode as a possible replica for a newly allocated = block. > If during a write, the dfsclient sees that a block replica locatio= n for a newly allocated block is not-connectable, it re-requests the NN to = get a fresh set of replica locations of the block. It tries this dfs.client= .block.write.retries times (default 3), sleeping 6 seconds between each ret= ry ( see DFSClient.nextBlockOutputStream). > This setting works well when y= ou have a reasonable size cluster; if u have only 4 datanodes in the cluste= r, every retry picks the dead-datanode and the above logic bails out. > One= solution is to change the value of dfs.client.block.write.retries to a muc= h much larger value, say 200 or so. Better still, increase the number of no= des in ur cluster. Our modification: when getting block location from namen= ode, we give nn the excluded datanodes. The list of dead datanodes is only = for one block allocation. +++ hadoop-new/src/hdfs/org/apache/hadoop/hdfs/DF= SClient.java 2009-07-20 00:19:03.000000000 +0800 @@ -2734,6 +2734,7 @@ Loca= tedBlock lb =3D null; boolean retry =3D false; DatanodeInfo[] nodes; + Data= nodeInfo[] exludedNodes =3D null; int count =3D conf.getInt("dfs.client.blo= ck.write.retries", 3); boolean success; do { @@ -2745,7 +2746,7 @@ success = =3D false; long startTime =3D System.currentTimeMillis(); * lb =3D locateFollowingBlock(startTime); + lb =3D locateFollowingBlock= (startTime, exludedNodes); block =3D lb.getBlock(); nodes =3D lb.getLocatio= ns(); @@ -2755,6 +2756,19 @@ success =3D createBlockOutputStream(nodes, clientNam= e, false); if (!success) { + + LOG.info("Excluding node: " + nodes[errorInd= ex]); + // Mark datanode as excluded + DatanodeInfo errorNode =3D nodes[err= orIndex]; + if (exludedNodes !=3D null) { + DatanodeInfo[] newExcludedNodes= =3D new DatanodeInfo[exludedNodes.length + 1]; + System.arraycopy(exludedN= odes, 0, newExcludedNodes, 0, exludedNodes.length); + newExcludedNodes[exlu= dedNodes.length] =3D errorNode; + exludedNodes =3D newExcludedNodes; + } el= se { + exludedNodes =3D new DatanodeInfo[] { errorNode }; + } + LOG.info("A= bandoning block " + block); namenode.abandonBlock(block, src, clientName); [ Permlink | =C2=AB Hide ] dhruba borthakur added a comment - 22/Jul/09 07:14 AM Hi Ruyue, your option of excluding specific datanodes (specified by the cli= ent) sounds reasonable. This might help in the case of network partitioning= where a specific client loses access to a set of datanodes while the datan= ode is alive and well and is able to send heartbeats to the namenode. Can y= ou pl create a separate JIRA for your prosposed fix and attach your patch t= here? Thanks. [ Show =C2=BB ] dhruba borthakur added a comment - 22/Jul/09 07:14 AM Hi Ruyue, your option= of excluding specific datanodes (specified by the client) sounds reasonabl= e. This might help in the case of network partitioning where a specific cli= ent loses access to a set of datanodes while the datanode is alive and well= and is able to send heartbeats to the namenode. Can you pl create a separa= te JIRA for your prosposed fix and attach your patch there? Thanks. > In DFSOutputStream.nextBlockOutputStream(), the client can exclude specif= ic datanodes when locating the next block. > -------------------------------------------------------------------------= ------------------------------------------ > > Key: HDFS-630 > URL: https://issues.apache.org/jira/browse/HDFS-630 > Project: Hadoop HDFS > Issue Type: New Feature > Components: hdfs client > Affects Versions: 0.20.1, 0.21.0 > Reporter: Ruyue Ma > Assignee: Ruyue Ma > Priority: Minor > Fix For: 0.21.0 > > > created from hdfs-200. > If during a write, the dfsclient sees that a block replica location for a= newly allocated block is not-connectable, it re-requests the NN to get a f= resh set of replica locations of the block. It tries this dfs.client.block.= write.retries times (default 3), sleeping 6 seconds between each retry ( se= e DFSClient.nextBlockOutputStream). > This setting works well when you have a reasonable size cluster; if u hav= e few datanodes in the cluster, every retry maybe pick the dead-datanode an= d the above logic bails out. > Our solution: when getting block location from namenode, we give nn the e= xcluded datanodes. The list of dead datanodes is only for one block allocat= ion. --=20 This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.