Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-issues@hadoop.apache.org
Message-ID: <1458673616.1253257737688.JavaMail.jira@brutus>
Date: Fri, 18 Sep 2009 00:08:57 -0700 (PDT)
From: "Ruyue Ma (JIRA)" <jira@apache.org>
To: hdfs-issues@hadoop.apache.org
Subject: [jira] Commented: (HDFS-630) In
 DFSOutputStream.nextBlockOutputStream(), the client can exclude specific
 datanodes when locating the next block.
In-Reply-To: <794054798.1253257617910.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


    [ https://issues.apache.org/jira/browse/HDFS-630?page=3Dcom.atlassian.j=
ira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D127570=
19#action_12757019 ]=20

Ruyue Ma commented on HDFS-630:
-------------------------------

Ruyue Ma added a comment - 20/Jul/09 11:32 PM
to: dhruba borthakur

> This is not related to HDFS-4379. let me explain why.
> The problem is actually related to HDFS-xxx. The namenode waits for 10 mi=
nutes after losing heartbeats from a datanode to declare it dead. During th=
is 10 minutes, the NN is free to choose the dead datanode as a possible rep=
lica for a newly allocated block.

> If during a write, the dfsclient sees that a block replica location for a=
 newly allocated block is not-connectable, it re-requests the NN to get a f=
resh set of replica locations of the block. It tries this dfs.client.block.=
write.retries times (default 3), sleeping 6 seconds between each retry ( se=
e DFSClient.nextBlockOutputStream). > This setting works well when you have=
 a reasonable size cluster; if u have only 4 datanodes in the cluster, ever=
y retry picks the dead-datanode and the above logic bails out.

> One solution is to change the value of dfs.client.block.write.retries to =
a much much larger value, say 200 or so. Better still, increase the number =
of nodes in ur cluster.

Our modification: when getting block location from namenode, we give nn the=
 excluded datanodes. The list of dead datanodes is only for one block alloc=
ation.

+++ hadoop-new/src/hdfs/org/apache/hadoop/hdfs/DFSClient.java 2009-07-20 00=
:19:03.000000000 +0800
@@ -2734,6 +2734,7 @@
LocatedBlock lb =3D null;
boolean retry =3D false;
DatanodeInfo[] nodes;
+ DatanodeInfo[] exludedNodes =3D null;
int count =3D conf.getInt("dfs.client.block.write.retries", 3);
boolean success;
do {
@@ -2745,7 +2746,7 @@
success =3D false;

long startTime =3D System.currentTimeMillis();

    * lb =3D locateFollowingBlock(startTime);
      + lb =3D locateFollowingBlock(startTime, exludedNodes);
      block =3D lb.getBlock();
      nodes =3D lb.getLocations();

@@ -2755,6 +2756,19 @@
success =3D createBlockOutputStream(nodes, clientName, false);

if (!success) {
+
+ LOG.info("Excluding node: " + nodes[errorIndex]);
+ // Mark datanode as excluded
+ DatanodeInfo errorNode =3D nodes[errorIndex];
+ if (exludedNodes !=3D null) { + DatanodeInfo[] newExcludedNodes =3D new D=
atanodeInfo[exludedNodes.length + 1]; + System.arraycopy(exludedNodes, 0, n=
ewExcludedNodes, 0, exludedNodes.length); + newExcludedNodes[exludedNodes.l=
ength] =3D errorNode; + exludedNodes =3D newExcludedNodes; + } else {
+ exludedNodes =3D new DatanodeInfo[] { errorNode };
+ }
+
LOG.info("Abandoning block " + block);
namenode.abandonBlock(block, src, clientName);
[ Show =C2=BB ]
Ruyue Ma added a comment - 20/Jul/09 11:32 PM to: dhruba borthakur > This i=
s not related to HDFS-4379. let me explain why. > The problem is actually r=
elated to HDFS-xxx. The namenode waits for 10 minutes after losing heartbea=
ts from a datanode to declare it dead. During this 10 minutes, the NN is fr=
ee to choose the dead datanode as a possible replica for a newly allocated =
block. > If during a write, the dfsclient sees that a block replica locatio=
n for a newly allocated block is not-connectable, it re-requests the NN to =
get a fresh set of replica locations of the block. It tries this dfs.client=
.block.write.retries times (default 3), sleeping 6 seconds between each ret=
ry ( see DFSClient.nextBlockOutputStream). > This setting works well when y=
ou have a reasonable size cluster; if u have only 4 datanodes in the cluste=
r, every retry picks the dead-datanode and the above logic bails out. > One=
 solution is to change the value of dfs.client.block.write.retries to a muc=
h much larger value, say 200 or so. Better still, increase the number of no=
des in ur cluster. Our modification: when getting block location from namen=
ode, we give nn the excluded datanodes. The list of dead datanodes is only =
for one block allocation. +++ hadoop-new/src/hdfs/org/apache/hadoop/hdfs/DF=
SClient.java 2009-07-20 00:19:03.000000000 +0800 @@ -2734,6 +2734,7 @@ Loca=
tedBlock lb =3D null; boolean retry =3D false; DatanodeInfo[] nodes; + Data=
nodeInfo[] exludedNodes =3D null; int count =3D conf.getInt("dfs.client.blo=
ck.write.retries", 3); boolean success; do { @@ -2745,7 +2746,7 @@ success =
=3D false; long startTime =3D System.currentTimeMillis();

    * lb =3D locateFollowingBlock(startTime); + lb =3D locateFollowingBlock=
(startTime, exludedNodes); block =3D lb.getBlock(); nodes =3D lb.getLocatio=
ns();

@@ -2755,6 +2756,19 @@ success =3D createBlockOutputStream(nodes, clientNam=
e, false); if (!success) { + + LOG.info("Excluding node: " + nodes[errorInd=
ex]); + // Mark datanode as excluded + DatanodeInfo errorNode =3D nodes[err=
orIndex]; + if (exludedNodes !=3D null) { + DatanodeInfo[] newExcludedNodes=
 =3D new DatanodeInfo[exludedNodes.length + 1]; + System.arraycopy(exludedN=
odes, 0, newExcludedNodes, 0, exludedNodes.length); + newExcludedNodes[exlu=
dedNodes.length] =3D errorNode; + exludedNodes =3D newExcludedNodes; + } el=
se { + exludedNodes =3D new DatanodeInfo[] { errorNode }; + } + LOG.info("A=
bandoning block " + block); namenode.abandonBlock(block, src, clientName);

[ Permlink | =C2=AB Hide ]
dhruba borthakur added a comment - 22/Jul/09 07:14 AM
Hi Ruyue, your option of excluding specific datanodes (specified by the cli=
ent) sounds reasonable. This might help in the case of network partitioning=
 where a specific client loses access to a set of datanodes while the datan=
ode is alive and well and is able to send heartbeats to the namenode. Can y=
ou pl create a separate JIRA for your prosposed fix and attach your patch t=
here? Thanks.
[ Show =C2=BB ]
dhruba borthakur added a comment - 22/Jul/09 07:14 AM Hi Ruyue, your option=
 of excluding specific datanodes (specified by the client) sounds reasonabl=
e. This might help in the case of network partitioning where a specific cli=
ent loses access to a set of datanodes while the datanode is alive and well=
 and is able to send heartbeats to the namenode. Can you pl create a separa=
te JIRA for your prosposed fix and attach your patch there? Thanks.


> In DFSOutputStream.nextBlockOutputStream(), the client can exclude specif=
ic datanodes when locating the next block.
> -------------------------------------------------------------------------=
------------------------------------------
>
>                 Key: HDFS-630
>                 URL: https://issues.apache.org/jira/browse/HDFS-630
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: hdfs client
>    Affects Versions: 0.20.1, 0.21.0
>            Reporter: Ruyue Ma
>            Assignee: Ruyue Ma
>            Priority: Minor
>             Fix For: 0.21.0
>
>
> created from hdfs-200.
> If during a write, the dfsclient sees that a block replica location for a=
 newly allocated block is not-connectable, it re-requests the NN to get a f=
resh set of replica locations of the block. It tries this dfs.client.block.=
write.retries times (default 3), sleeping 6 seconds between each retry ( se=
e DFSClient.nextBlockOutputStream).
> This setting works well when you have a reasonable size cluster; if u hav=
e few datanodes in the cluster, every retry maybe pick the dead-datanode an=
d the above logic bails out.
> Our solution: when getting block location from namenode, we give nn the e=
xcluded datanodes. The list of dead datanodes is only for one block allocat=
ion.

--=20
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.