Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9148ADD5F for ; Fri, 16 Nov 2012 21:57:14 +0000 (UTC) Received: (qmail 19164 invoked by uid 500); 16 Nov 2012 21:57:13 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 19118 invoked by uid 500); 16 Nov 2012 21:57:13 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 18992 invoked by uid 99); 16 Nov 2012 21:57:13 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 Nov 2012 21:57:13 +0000 Date: Fri, 16 Nov 2012 21:57:12 +0000 (UTC) From: "stack (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: <1777880728.125980.1353103033015.JavaMail.jiratomcat@arcas> Subject: [jira] [Created] (HDFS-4203) After recoverFileLease, datanode gets stuck complaining block '...has out of data GS ....may already be committed' MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 stack created HDFS-4203: --------------------------- Summary: After recoverFileLease, datanode gets stuck complaining block '...has out of data GS ....may already be committed' Key: HDFS-4203 URL: https://issues.apache.org/jira/browse/HDFS-4203 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 1.1.0 Reporter: stack After calling recoverFileLease, an append to a file gets stuck retying this: {code} 2012-11-16 13:06:14,298 DEBUG [IPC Server handler 2 on 53224] namenode.PendingReplicationBlocks(92): Removing pending replication for blockblk_-3222397051272483489_1006 2012-11-16 13:06:43,881 WARN [DataStreamer for file /hbase/hlog/hlog.dat.2 block blk_-3222397051272483489_1003] hdfs.DFSClient$DFSOutputStream(3216): Error Recovery for block blk_-3222397051272483489_1003 bad datanode[0] 127.0.0.1:53228 2012-11-16 13:06:43,881 WARN [DataStreamer for file /hbase/hlog/hlog.dat.2 block blk_-3222397051272483489_1003] hdfs.DFSClient$DFSOutputStream(3267): Error Recovery for block blk_-3222397051272483489_1003 in pipeline 127.0.0.1:53228, 127.0.0.1:53231: bad datanode 127.0.0.1:53228 2012-11-16 13:06:43,884 INFO [IPC Server handler 1 on 53233] datanode.DataNode(2123): Client calls recoverBlock(block=blk_-3222397051272483489_1003, targets=[127.0.0.1:53231]) 2012-11-16 13:06:43,884 DEBUG [IPC Server handler 1 on 53233] datanode.FSDataset(2143): Interrupting active writer threads for block blk_-3222397051272483489_1006 2012-11-16 13:06:43,884 DEBUG [IPC Server handler 1 on 53233] datanode.FSDataset(2159): getBlockMetaDataInfo successful block=blk_-3222397051272483489_1006 length 120559 genstamp 1006 2012-11-16 13:06:43,884 DEBUG [IPC Server handler 1 on 53233] datanode.DataNode(2039): block=blk_-3222397051272483489_1003, (length=120559), syncList=[BlockRecord(info=BlockRecoveryInfo(block=blk_-3222397051272483489_1006 wasRecoveredOnStartup=false) node=127.0.0.1:53231)], closeFile=false 2012-11-16 13:06:43,885 INFO [IPC Server handler 2 on 53224] namenode.FSNamesystem(5468): blk_-3222397051272483489_1003 has out of date GS 1003 found 1006, may already be committed 2012-11-16 13:06:43,885 ERROR [IPC Server handler 2 on 53224] security.UserGroupInformation(1139): PriviledgedActionException as:stack cause:java.io.IOException: blk_-3222397051272483489_1003 has out of date GS 1003 found 1006, may already be committed 2012-11-16 13:06:43,885 ERROR [IPC Server handler 1 on 53233] security.UserGroupInformation(1139): PriviledgedActionException as:blk_-3222397051272483489_1003 cause:org.apache.hadoop.ipc.RemoteException: java.io.IOException: blk_-3222397051272483489_1003 has out of date GS 1003 found 1006, may already be committed at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.nextGenerationStampForBlock(FSNamesystem.java:5469) at org.apache.hadoop.hdfs.server.namenode.NameNode.nextGenerationStamp(NameNode.java:781) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:578) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387) 2012-11-16 13:06:43,886 WARN [DataStreamer for file /hbase/hlog/hlog.dat.2 block blk_-3222397051272483489_1003] hdfs.DFSClient$DFSOutputStream(3292): Failed recovery attempt #1 from primary datanode 127.0.0.1:53231 org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.ipc.RemoteException: java.io.IOException: blk_-3222397051272483489_1003 has out of date GS 1003 found 1006, may already be committed at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.nextGenerationStampForBlock(FSNamesystem.java:5469) at org.apache.hadoop.hdfs.server.namenode.NameNode.nextGenerationStamp(NameNode.java:781) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:578) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387) at org.apache.hadoop.ipc.Client.call(Client.java:1092) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229) at $Proxy12.nextGenerationStamp(Unknown Source) at org.apache.hadoop.hdfs.server.datanode.DataNode.syncBlock(DataNode.java:2059) at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:2027) at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:2107) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:578) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387) at org.apache.hadoop.ipc.Client.call(Client.java:1092) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229) at $Proxy17.recoverBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3290) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2300(DFSClient.java:2754) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2958) {code} Eventually the DFSClient fails with: {code} 2012-11-16 13:06:53,945 WARN [ZombieLastLogWriterRegionServer] hdfs.DFSClient$DFSOutputStream(3918): Error while syncing java.io.IOException: Error Recovery for block blk_-3222397051272483489_1003 failed because recovery from primary datanode 127.0.0.1:53228 failed 6 times. Pipeline was 127.0.0.1:53228. Aborting... at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3326) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2300(DFSClient.java:2754) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2958) {code} It is good that it eventually comes back so this issue does not mean it is the end of the world. It seems uninterruptible up in the client while it is doing the above retrying. It would just be nice if the DN gave up quicker when lease is gone. Looking in code, it seems that in DataNode#syncBlock, we are doing an explicit call against the NN to getNewGenerationStamp and it is this call that is failing. Should we be checking we have the file lease first? I see this issue in 1.1 and 1.0. It looks like it is also in trunk. I've not tried it. If someone made suggestion on what should do here, I can try making a patch. Thanks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira