hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lars Hofhansl (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (HDFS-4455) Datanode sometimes gives up permanently on Namenode in HA setup
Date Sun, 18 May 2014 06:16:16 GMT

     [ https://issues.apache.org/jira/browse/HDFS-4455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Lars Hofhansl resolved HDFS-4455.
---------------------------------

    Resolution: Implemented

> Datanode sometimes gives up permanently on Namenode in HA setup
> ---------------------------------------------------------------
>
>                 Key: HDFS-4455
>                 URL: https://issues.apache.org/jira/browse/HDFS-4455
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, ha
>    Affects Versions: 2.0.2-alpha
>            Reporter: Lars Hofhansl
>            Assignee: Juan Yu
>            Priority: Critical
>
> Today we got ourselves into a situation where we hard killed the cluster (kill -9 across
the board on all processes) and upon restarting all DNs would permanently give up on of the
NNs in our two NN HA setup (using QJM).
> The HA setup is correct (prior to this we failed over the NNs many times for testing).
Bouncing the DNs resolved the problem.
> In the logs I see this exception:
> {code}
> 2013-01-29 23:32:49,461 FATAL datanode.DataNode - Initialization failed for block pool
Block pool BP-1852726028-<ip>-1358813649047 (storage id DS-60505003-<ip>-50010-1353106051747)
service to <host>/<ip>:8020
> java.io.IOException: Failed on local exception: java.io.IOException: Response is null.;
Host Details : local host is: "<host>/<ip>"; destination host is: "<host>":8020;

>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:759)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1164)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
>         at $Proxy10.registerDatanode(Unknown Source)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
>         at $Proxy10.registerDatanode(Unknown Source)
>         at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:149)
>         at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:619)
>         at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:221)
>         at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:661)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.IOException: Response is null.
>         at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:885)
>         at org.apache.hadoop.ipc.Client$Connection.run(Client.java:813)
> 2013-01-29 23:32:49,463 WARN  datanode.DataNode - Ending block pool service for: Block
pool BP-1852726028-<ip>-1358813649047 (storage id DS-60505003-<ip>-50010-1353106051747)
service to <host>/<ip>:8020
> {code}
> So somehow in BPServiceActor.connectToNNAndHandshake() we made it all the way to register().
Then failed in bpNamenode.registerDatanode(bpRegistration) with an IOException, which is not
caught and has the block pool service fail as a whole.
> No doubt that was caused by one of the NNs being a weird state. While that happened the
active NN claimed that the FS was corrupted and stayed in safe mode, and DNs only registered
with the standby DN. Failing over to the 2nd NN and then restarting the first NN and failing
did not change that.
> No amount bouncing/failing over the HA NNs would have the DNs reconnect to one of the
NNs.
> In BPServiceActor.register(), should we catch IOException instead of SocketTimeoutException?
That way it would continue to retry and eventually connect to the NN.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message