hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-3031) HA: Error (failed to close file) when uploading large file + kill active NN + manual failover
Date Sun, 13 May 2012 02:14:48 GMT

     [ https://issues.apache.org/jira/browse/HDFS-3031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Todd Lipcon updated HDFS-3031:
------------------------------

    Attachment: hdfs-3031.txt

Attached patch implements idempotence on the complete() RPC as well.

The approach is similar but simpler because the client reliably sends the last block as part
of this IPC. Let me know if the new comment block isn't clear enough.

I also removed the debug log message from DFSOutputStream that I accidentally left in.

The reason for the test changes in MiniDFSCluster is that the cluster.transitionToActive()
call to fail-back in doTestWriteOverFailoverWithDnFail was failing without it. The IPC client
cache was holding a connection to the restarted node, so using the transitionToActive() RPC
here threw an EOFException.

I think if other tests want to explicitly test the RPC call, they should use the normal APIs
like HAAdmin or FailoverController.
                
> HA: Error (failed to close file) when uploading large file + kill active NN + manual
failover
> ---------------------------------------------------------------------------------------------
>
>                 Key: HDFS-3031
>                 URL: https://issues.apache.org/jira/browse/HDFS-3031
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 0.24.0
>            Reporter: Stephen Chu
>            Assignee: Todd Lipcon
>         Attachments: hdfs-3031.txt, hdfs-3031.txt, hdfs-3031.txt, hdfs-3031.txt, styx01_killNNfailover,
styx01_uploadLargeFile
>
>
> I executed section 3.4 of Todd's HA test plan. https://issues.apache.org/jira/browse/HDFS-1623
> 1. A large file upload is started.
> 2. While the file is being uploaded, the administrator kills the first NN and performs
a failover.
> 3. After the file finishes being uploaded, it is verified for correct length and contents.
> For the test, I have a vm_template styx01:/home/schu/centos64-2-5.5.qcow2. styx01 hosted
the active NN and styx02 hosted the standby NN.
> In the log files I attached, you can see that on styx01 I began file upload.
> hadoop fs -put centos64-2.5.5.qcow2
> After waiting several seconds, I kill -9'd the active NN on styx01 and manually failed
over to the NN on styx02. I ran into exception below. (rest of the stacktrace in the attached
file styx01_uploadLargeFile)
> 12/02/29 14:12:52 WARN retry.RetryInvocationHandler: A failover has occurred since the
start of this method invocation attempt.
> put: Failed on local exception: java.io.EOFException; Host Details : local host is: "styx01.sf.cloudera.com/172.29.5.192";
destination host is: ""styx01.sf.cloudera.com"\
> :12020;
> 12/02/29 14:12:52 ERROR hdfs.DFSClient: Failed to close file /user/schu/centos64-2-5.5.qcow2._COPYING_
> java.io.IOException: Failed on local exception: java.io.EOFException; Host Details :
local host is: "styx01.sf.cloudera.com/172.29.5.192"; destination host is: ""styx01.\
> sf.cloudera.com":12020;
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1145)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:188)
>         at $Proxy9.addBlock(Unknown Source)
>         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:302)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
>         at $Proxy10.addBlock(Unknown Source)
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1097)
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:973)
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:455)
> Caused by: java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:375)
>         at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:830)
>         at org.apache.hadoop.ipc.Client$Connection.run(Client.java:762)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message