Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6FAD4C339 for ; Fri, 11 May 2012 19:03:15 +0000 (UTC) Received: (qmail 66307 invoked by uid 500); 11 May 2012 19:03:15 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 66268 invoked by uid 500); 11 May 2012 19:03:15 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 66260 invoked by uid 99); 11 May 2012 19:03:15 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 May 2012 19:03:15 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 May 2012 19:03:11 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id BE25F491921 for ; Fri, 11 May 2012 19:02:50 +0000 (UTC) Date: Fri, 11 May 2012 19:02:50 +0000 (UTC) From: "Aaron T. Myers (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: <1363510967.55686.1336762970785.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <80372957.5224.1330555797131.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (HDFS-3031) HA: Error (failed to close file) when uploading large file + kill active NN + manual failover MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-3031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13273511#comment-13273511 ] Aaron T. Myers commented on HDFS-3031: -------------------------------------- Patch looks pretty good to me, Todd. A few comments: # Did you intend to leave that INFO log in DFSOutputStream? It seems to me like it should either be removed or lowered to debug or trace. # Why make the changes in MiniDFSCluster? Those seem unrelated to this issue, and has the potential to cause unintended side effects in the tests, since some tests might be relying on the RPC behavior which this change cuts out of the call stack. I really like the new test testIdempotentAllocateBlock, and thanks for converting TestFileAppend3 to JUnit 4-style. > HA: Error (failed to close file) when uploading large file + kill active NN + manual failover > --------------------------------------------------------------------------------------------- > > Key: HDFS-3031 > URL: https://issues.apache.org/jira/browse/HDFS-3031 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha > Affects Versions: 0.24.0 > Reporter: Stephen Chu > Assignee: Todd Lipcon > Attachments: hdfs-3031.txt, hdfs-3031.txt, hdfs-3031.txt, styx01_killNNfailover, styx01_uploadLargeFile > > > I executed section 3.4 of Todd's HA test plan. https://issues.apache.org/jira/browse/HDFS-1623 > 1. A large file upload is started. > 2. While the file is being uploaded, the administrator kills the first NN and performs a failover. > 3. After the file finishes being uploaded, it is verified for correct length and contents. > For the test, I have a vm_template styx01:/home/schu/centos64-2-5.5.qcow2. styx01 hosted the active NN and styx02 hosted the standby NN. > In the log files I attached, you can see that on styx01 I began file upload. > hadoop fs -put centos64-2.5.5.qcow2 > After waiting several seconds, I kill -9'd the active NN on styx01 and manually failed over to the NN on styx02. I ran into exception below. (rest of the stacktrace in the attached file styx01_uploadLargeFile) > 12/02/29 14:12:52 WARN retry.RetryInvocationHandler: A failover has occurred since the start of this method invocation attempt. > put: Failed on local exception: java.io.EOFException; Host Details : local host is: "styx01.sf.cloudera.com/172.29.5.192"; destination host is: ""styx01.sf.cloudera.com"\ > :12020; > 12/02/29 14:12:52 ERROR hdfs.DFSClient: Failed to close file /user/schu/centos64-2-5.5.qcow2._COPYING_ > java.io.IOException: Failed on local exception: java.io.EOFException; Host Details : local host is: "styx01.sf.cloudera.com/172.29.5.192"; destination host is: ""styx01.\ > sf.cloudera.com":12020; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731) > at org.apache.hadoop.ipc.Client.call(Client.java:1145) > at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:188) > at $Proxy9.addBlock(Unknown Source) > at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:302) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164) > at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83) > at $Proxy10.addBlock(Unknown Source) > at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1097) > at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:973) > at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:455) > Caused by: java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:375) > at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:830) > at org.apache.hadoop.ipc.Client$Connection.run(Client.java:762) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira