hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yongjun Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-8960) DFS client says "no more good datanodes being available to try" on a single drive failure
Date Fri, 28 Aug 2015 07:27:45 GMT

    [ https://issues.apache.org/jira/browse/HDFS-8960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14718146#comment-14718146
] 

Yongjun Zhang commented on HDFS-8960:
-------------------------------------

Yes, it's trying to do pipeline recovery, see below:

{code}
[yzhang@localhost Downloads]$ grep -B 3 blk_1073817519  r12s16-datanode.log | grep firstbadlink
15/08/23 07:21:49 INFO datanode.DataNode: Datanode 2 got response for connect ack  from downstream
datanode with firstbadlink as 172.24.32.5:10110
15/08/23 07:21:49 INFO datanode.DataNode: Datanode 2 forwarding connect ack to upstream firstbadlink
is 172.24.32.5:10110
15/08/23 07:21:52 INFO datanode.DataNode: Datanode 2 got response for connect ack  from downstream
datanode with firstbadlink as 172.24.32.1:10110
15/08/23 07:21:52 INFO datanode.DataNode: Datanode 2 forwarding connect ack to upstream firstbadlink
is 172.24.32.1:10110
15/08/23 07:21:55 INFO datanode.DataNode: Datanode 2 got response for connect ack  from downstream
datanode with firstbadlink as 172.24.32.6:10110
15/08/23 07:21:55 INFO datanode.DataNode: Datanode 2 forwarding connect ack to upstream firstbadlink
is 172.24.32.6:10110
15/08/23 07:21:58 INFO datanode.DataNode: Datanode 2 got response for connect ack  from downstream
datanode with firstbadlink as 172.24.32.8:10110
15/08/23 07:21:58 INFO datanode.DataNode: Datanode 2 forwarding connect ack to upstream firstbadlink
is 172.24.32.8:10110
15/08/23 07:22:01 INFO datanode.DataNode: Datanode 2 got response for connect ack  from downstream
datanode with firstbadlink as 172.24.32.14:10110
15/08/23 07:22:01 INFO datanode.DataNode: Datanode 2 forwarding connect ack to upstream firstbadlink
is 172.24.32.14:10110
15/08/23 07:22:04 INFO datanode.DataNode: Datanode 2 got response for connect ack  from downstream
datanode with firstbadlink as 172.24.32.2:10110
15/08/23 07:22:04 INFO datanode.DataNode: Datanode 2 forwarding connect ack to upstream firstbadlink
is 172.24.32.2:10110
15/08/23 07:22:07 INFO datanode.DataNode: Datanode 2 got response for connect ack  from downstream
datanode with firstbadlink as 172.24.32.9:10110
15/08/23 07:22:07 INFO datanode.DataNode: Datanode 2 forwarding connect ack to upstream firstbadlink
is 172.24.32.9:10110
15/08/23 07:22:10 INFO datanode.DataNode: Datanode 2 got response for connect ack  from downstream
datanode with firstbadlink as 172.24.32.3:10110
15/08/23 07:22:10 INFO datanode.DataNode: Datanode 2 forwarding connect ack to upstream firstbadlink
is 172.24.32.3:10110
15/08/23 07:22:13 INFO datanode.DataNode: Datanode 2 got response for connect ack  from downstream
datanode with firstbadlink as 172.24.32.7:10110
15/08/23 07:22:13 INFO datanode.DataNode: Datanode 2 forwarding connect ack to upstream firstbadlink
is 172.24.32.7:10110
15/08/23 07:22:16 INFO datanode.DataNode: Datanode 2 got response for connect ack  from downstream
datanode with firstbadlink as 172.24.32.10:10110
15/08/23 07:22:16 INFO datanode.DataNode: Datanode 2 forwarding connect ack to upstream firstbadlink
is 172.24.32.10:10110
15/08/23 07:22:19 INFO datanode.DataNode: Datanode 2 got response for connect ack  from downstream
datanode with firstbadlink as 172.24.32.12:10110
15/08/23 07:22:19 INFO datanode.DataNode: Datanode 2 forwarding connect ack to upstream firstbadlink
is 172.24.32.12:10110
15/08/23 07:22:23 INFO datanode.DataNode: Datanode 2 got response for connect ack  from downstream
datanode with firstbadlink as 172.24.32.11:10110
15/08/23 07:22:23 INFO datanode.DataNode: Datanode 2 forwarding connect ack to upstream firstbadlink
is 172.24.32.11:10110
15/08/23 07:22:26 INFO datanode.DataNode: Datanode 2 got response for connect ack  from downstream
datanode with firstbadlink as 172.24.32.15:10110
15/08/23 07:22:26 INFO datanode.DataNode: Datanode 2 forwarding connect ack to upstream firstbadlink
is 172.24.32.15:10110
{code}

It happened you loaded r12s13 which is not one of the node in the grepped message (per your
report, r12s13 is the last node in the initial pipeline), and r12s16 is the source node.

Would you please upload a few more DN logs?

Thanks.





> DFS client says "no more good datanodes being available to try" on a single drive failure
> -----------------------------------------------------------------------------------------
>
>                 Key: HDFS-8960
>                 URL: https://issues.apache.org/jira/browse/HDFS-8960
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs-client
>    Affects Versions: 2.7.1
>         Environment: openjdk version "1.8.0_45-internal"
> OpenJDK Runtime Environment (build 1.8.0_45-internal-b14)
> OpenJDK 64-Bit Server VM (build 25.45-b02, mixed mode)
>            Reporter: Benoit Sigoure
>         Attachments: blk_1073817519_77099.log, r12s13-datanode.log, r12s16-datanode.log
>
>
> Since we upgraded to 2.7.1 we regularly see single-drive failures cause widespread problems
at the HBase level (with the default 3x replication target).
> Here's an example.  This HBase RegionServer is r12s16 (172.24.32.16) and is writing its
WAL to [172.24.32.16:10110, 172.24.32.8:10110, 172.24.32.13:10110] as can be seen by the following
occasional messages:
> {code}
> 2015-08-23 06:28:40,272 INFO  [sync.3] wal.FSHLog: Slow sync cost: 123 ms, current pipeline:
[172.24.32.16:10110, 172.24.32.8:10110, 172.24.32.13:10110]
> {code}
> A bit later, the second node in the pipeline above is going to experience an HDD failure.
> {code}
> 2015-08-23 07:21:58,720 WARN  [DataStreamer for file /hbase/WALs/r12s16.sjc.aristanetworks.com,9104,1439917659071/r12s16.sjc.aristanetworks.com%2C9104%2C1439917659071.default.1440314434998
block BP-1466258523-172.24.32.1-1437768622582:blk_1073817519_77099] hdfs.DFSClient: Error
Recovery for block BP-1466258523-172.24.32.1-1437768622582:blk_1073817519_77099 in pipeline
172.24.32.16:10110, 172.24.32.13:10110, 172.24.32.8:10110: bad datanode 172.24.32.8:10110
> {code}
> And then HBase will go like "omg I can't write to my WAL, let me commit suicide".
> {code}
> 2015-08-23 07:22:26,060 FATAL [regionserver/r12s16.sjc.aristanetworks.com/172.24.32.16:9104.append-pool1-t1]
wal.FSHLog: Could not append. Requesting close of wal
> java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to
no more good datanodes being available to try. (Nodes: current=[172.24.32.16:10110, 172.24.32.13:10110],
original=[172.24.32.16:10110, 172.24.32.13:10110]). The current failed datanode replacement
policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy'
in its configuration.
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:969)
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1035)
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1184)
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:933)
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:487)
> {code}
> Whereas this should be mostly a non-event as the DFS client should just drop the bad
replica from the write pipeline.
> This is a small cluster but has 16 DNs so the failed DN in the pipeline should be easily
replaced.  I didn't set {{dfs.client.block.write.replace-datanode-on-failure.policy}} (so
it's still {{DEFAULT}}) and didn't set {{dfs.client.block.write.replace-datanode-on-failure.enable}}
(so it's still {{true}}).
> I don't see anything noteworthy in the NN log around the time of the failure, it just
seems like the DFS client gave up or threw an exception back to HBase that it wasn't throwing
before or something else, and that made this single drive failure lethal.
> We've occasionally be "unlucky" enough to have a single-drive failure cause multiple
RegionServers to commit suicide because they had their WALs on that drive.
> We upgraded from 2.7.0 about a month ago, and I'm not sure whether we were seeing this
with 2.7 or not – prior to that we were running in a quite different environment, but this
is a fairly new deployment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message