hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-4203) After recoverFileLease, datanode gets stuck complaining block '...has out of data GS ....may already be committed'
Date Fri, 16 Nov 2012 22:33:12 GMT

    [ https://issues.apache.org/jira/browse/HDFS-4203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13499204#comment-13499204
] 

Todd Lipcon commented on HDFS-4203:
-----------------------------------

A few options here, I think:

1) When we call recoverBlock for a client, we pass the client lease holder name to the DN.
Then when the DN asks for a new generation stamp, it passes along the lease holder to the
NN, who can reject if it's not the current lease holder. This would likely be difficult to
do in branch-1 without breaking compatibility, but could be done in branch-2 given we have
protobufs.

2) In the client, if the error message we get back from attempting recovery is that the generation
stamp is out of date, then no number of retries is going to help, so we can immediately stop
retrying at that point.
                
> After recoverFileLease, datanode gets stuck complaining block '...has out of data GS
....may already be committed'
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-4203
>                 URL: https://issues.apache.org/jira/browse/HDFS-4203
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 1.1.0
>            Reporter: stack
>
> After calling recoverFileLease, an append to a file gets stuck retying this:
> {code}
> 2012-11-16 13:06:14,298 DEBUG [IPC Server handler 2 on 53224] namenode.PendingReplicationBlocks(92):
Removing pending replication for blockblk_-3222397051272483489_1006
> 2012-11-16 13:06:43,881 WARN  [DataStreamer for file /hbase/hlog/hlog.dat.2 block blk_-3222397051272483489_1003]
hdfs.DFSClient$DFSOutputStream(3216): Error Recovery for block blk_-3222397051272483489_1003
bad datanode[0] 127.0.0.1:53228
> 2012-11-16 13:06:43,881 WARN  [DataStreamer for file /hbase/hlog/hlog.dat.2 block blk_-3222397051272483489_1003]
hdfs.DFSClient$DFSOutputStream(3267): Error Recovery for block blk_-3222397051272483489_1003
in pipeline 127.0.0.1:53228, 127.0.0.1:53231: bad datanode 127.0.0.1:53228
> 2012-11-16 13:06:43,884 INFO  [IPC Server handler 1 on 53233] datanode.DataNode(2123):
Client calls recoverBlock(block=blk_-3222397051272483489_1003, targets=[127.0.0.1:53231])
> 2012-11-16 13:06:43,884 DEBUG [IPC Server handler 1 on 53233] datanode.FSDataset(2143):
Interrupting active writer threads for block blk_-3222397051272483489_1006
> 2012-11-16 13:06:43,884 DEBUG [IPC Server handler 1 on 53233] datanode.FSDataset(2159):
getBlockMetaDataInfo successful block=blk_-3222397051272483489_1006 length 120559 genstamp
1006
> 2012-11-16 13:06:43,884 DEBUG [IPC Server handler 1 on 53233] datanode.DataNode(2039):
block=blk_-3222397051272483489_1003, (length=120559), syncList=[BlockRecord(info=BlockRecoveryInfo(block=blk_-3222397051272483489_1006
wasRecoveredOnStartup=false) node=127.0.0.1:53231)], closeFile=false
> 2012-11-16 13:06:43,885 INFO  [IPC Server handler 2 on 53224] namenode.FSNamesystem(5468):
blk_-3222397051272483489_1003 has out of date GS 1003 found 1006, may already be committed
> 2012-11-16 13:06:43,885 ERROR [IPC Server handler 2 on 53224] security.UserGroupInformation(1139):
PriviledgedActionException as:stack cause:java.io.IOException: blk_-3222397051272483489_1003
has out of date GS 1003 found 1006, may already be committed
> 2012-11-16 13:06:43,885 ERROR [IPC Server handler 1 on 53233] security.UserGroupInformation(1139):
PriviledgedActionException as:blk_-3222397051272483489_1003 cause:org.apache.hadoop.ipc.RemoteException:
java.io.IOException: blk_-3222397051272483489_1003 has out of date GS 1003 found 1006, may
already be committed
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.nextGenerationStampForBlock(FSNamesystem.java:5469)
> 	at org.apache.hadoop.hdfs.server.namenode.NameNode.nextGenerationStamp(NameNode.java:781)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:578)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387)
> 2012-11-16 13:06:43,886 WARN  [DataStreamer for file /hbase/hlog/hlog.dat.2 block blk_-3222397051272483489_1003]
hdfs.DFSClient$DFSOutputStream(3292): Failed recovery attempt #1 from primary datanode 127.0.0.1:53231
> org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.ipc.RemoteException: java.io.IOException:
blk_-3222397051272483489_1003 has out of date GS 1003 found 1006, may already be committed
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.nextGenerationStampForBlock(FSNamesystem.java:5469)
> 	at org.apache.hadoop.hdfs.server.namenode.NameNode.nextGenerationStamp(NameNode.java:781)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:578)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387)
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1092)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
> 	at $Proxy12.nextGenerationStamp(Unknown Source)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.syncBlock(DataNode.java:2059)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:2027)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:2107)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:578)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387)
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1092)
> 	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
> 	at $Proxy17.recoverBlock(Unknown Source)
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3290)
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2300(DFSClient.java:2754)
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2958)
> {code}
> Eventually the DFSClient fails with:
> {code}
> 2012-11-16 13:06:53,945 WARN  [ZombieLastLogWriterRegionServer] hdfs.DFSClient$DFSOutputStream(3918):
Error while syncing
> java.io.IOException: Error Recovery for block blk_-3222397051272483489_1003 failed  because
recovery from primary datanode 127.0.0.1:53228 failed 6 times.  Pipeline was 127.0.0.1:53228.
Aborting...
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3326)
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2300(DFSClient.java:2754)
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2958)
> {code}
> It is good that it eventually comes back so this issue does not mean it is the end of
the world.  It seems uninterruptible up in the client while it is doing the above retrying.
 It would just be nice if the DN gave up quicker when lease is gone.
> Looking in code, it seems that in DataNode#syncBlock, we are doing an explicit call against
the NN to getNewGenerationStamp and it is this call that is failing.  Should we be checking
we have the file lease first?
> I see this issue in 1.1 and 1.0.  It looks like it is also in trunk.  I've not tried
it.
> If someone made suggestion on what should do here, I can try making a patch.  Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message