hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Keith Turner (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HDFS-8406) Lease recovery continually failed
Date Thu, 14 May 2015 23:01:00 GMT
Keith Turner created HDFS-8406:
----------------------------------

             Summary: Lease recovery continually failed
                 Key: HDFS-8406
                 URL: https://issues.apache.org/jira/browse/HDFS-8406
             Project: Hadoop HDFS
          Issue Type: Bug
    Affects Versions: 2.6.0
            Reporter: Keith Turner


While testing Accumulo on a cluster and killing processes, I ran into a situation where the
lease on an accumulo write ahead log in HDFS could not be recovered.   Even restarting HDFS
and Accumulo would not fix the problem.

The following message was seen in an Accumulo tablet server log immediately before the tablet
server was killed.

{noformat}
2015-05-14 17:12:37,466 [hdfs.DFSClient] WARN : DFSOutputStream ResponseProcessor exception
 for block BP-802741494-10.1.5.6-1431557089849:blk_1073932823_192060
java.io.IOException: Bad response ERROR for block BP-802741494-10.1.5.6-1431557089849:blk_1073932823_192060
from datanode 10.1.5.9:50010
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:897)
2015-05-14 17:12:37,466 [hdfs.DFSClient] WARN : Error Recovery for block BP-802741494-10.1.5.6-1431557089849:blk_1073932823_192060
in pipeline 10.1.5.55:50010, 10.1.5.9:5
{noformat}

Before recovering data from a write ahead log, the Accumulo master attempts to recover the
lease.   This repeatedly failed with messages like the following.

{noformat}
2015-05-14 17:14:54,301 [recovery.HadoopLogCloser] WARN : Error recovering lease on hdfs://10.1.5.6:10000/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException):
failed to create file /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 for
DFSClient_NONMAPREDUCE_950713214_16 for client 10.1.5.158 because pendingCreates is non-null
but no leases found.
{noformat}

Below is some info from the NN logs for the problematic file.

{noformat}
[ec2-user@leader2 logs]$ grep 3a731759-3594-4535-8086-245 hadoop-ec2-user-namenode-leader2.log

2015-05-14 17:10:46,299 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2.
BP-802741494-10.1.5.6-1431557089849 blk_1073932823_192060{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW],
ReplicaUnderConstruction[[DISK]DS-6efec702-3f1f-4ec0-a31f-de947e7e6097:NORMAL:10.1.5.9:50010|RBW],
ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW]]}
2015-05-14 17:10:46,628 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* fsync: /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2
for DFSClient_NONMAPREDUCE_-1128465883_16
2015-05-14 17:14:49,288 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: recoverLease:
[Lease.  Holder: DFSClient_NONMAPREDUCE_-1128465883_16, pendingcreates: 1], src=/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2
from client DFSClient_NONMAPREDUCE_-1128465883_16
2015-05-14 17:14:49,288 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering
[Lease.  Holder: DFSClient_NONMAPREDUCE_-1128465883_16, pendingcreates: 1], src=/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2
2015-05-14 17:14:49,289 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease:
File /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 has not been closed.
Lease recovery is in progress. RecoveryId = 192257 for block blk_1073932823_192060{blockUCState=UNDER_RECOVERY,
primaryNodeIndex=2, replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW],
ReplicaUnderConstruction[[DISK]DS-6efec702-3f1f-4ec0-a31f-de947e7e6097:NORMAL:10.1.5.9:50010|RBW],
ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW]]}
java.lang.IllegalStateException: Failed to finalize INodeFile 3a731759-3594-4535-8086-245eed7cd4c2
since blocks[0] is non-complete, where blocks=[blk_1073932823_192257{blockUCState=COMMITTED,
primaryNodeIndex=2, replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW],
ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW]]}].
2015-05-14 17:14:54,292 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 10000,
call org.apache.hadoop.hdfs.protocol.ClientProtocol.recoverLease from 10.1.5.158:53784 Call#529
Retry#0: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file
/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 for DFSClient_NONMAPREDUCE_950713214_16
for client 10.1.5.158 because pendingCreates is non-null but no leases found.
2015-05-14 17:14:54,319 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.append: failed
to create file /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 for DFSClient_NONMAPREDUCE_950713214_16
for client 10.1.5.158 because pendingCreates is non-null but no leases found.
2015-05-14 17:14:54,320 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 10000,
call org.apache.hadoop.hdfs.protocol.ClientProtocol.append from 10.1.5.158:53784 Call#530
Retry#0: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file
/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 for DFSClient_NONMAPREDUCE_950713214_16
for client 10.1.5.158 because pendingCreates is non-null but no leases found.
{noformat}

Below is expanded info from the NN logs about the exception mentioned above.

{noformat}
2015-05-14 17:14:51,704 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 10000,
call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.commitBlockSynchronization from
10.1.5.45:35687 Call#66737 Retry#0

java.lang.IllegalStateException: Failed to finalize INodeFile 3a731759-3594-4535-8086-245eed7cd4c2
since blocks[0] is non-complete, where blocks=[blk_1073932823_192257{blockUCState=COMMITTED,
primaryNodeIndex=2, replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW],
ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW]]}].
        at com.google.common.base.Preconditions.checkState(Preconditions.java:172)
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.assertAllBlocksComplete(INodeFile.java:214)
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.toCompleteFile(INodeFile.java:201)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.finalizeINodeFileUnderConstruction(FSNamesystem.java:4663)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.closeFileCommitBlocks(FSNamesystem.java:4878)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitBlockSynchronization(FSNamesystem.java:4842)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.commitBlockSynchronization(NameNodeRpcServer.java:734)
        at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.commitBlockSynchronization(DatanodeProtocolServerSideTranslatorPB.java:270)
        at org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:26394)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
{noformat}

Below is the output of running fsck on the file.

{noformat}
[ec2-user@leader2 sbin]$ hdfs fsck /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2
-files -blocks -locations -openforwrite
Connecting to namenode via http://leader2:50070
FSCK started by ec2-user (auth:SIMPLE) from /10.1.5.6 for path /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2
at Thu May 14 19:03:21 UTC 2015
/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 541965482 bytes, 1 block(s),
OPENFORWRITE:  Under replicated BP-802741494-10.1.5.6-1431557089849:blk_1073932823_192257{blockUCState=COMMITTED,
primaryNodeIndex=2, replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW],
ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW],
ReplicaUnderConstruction[[DISK]DS-6efec702-3f1f-4ec0-a31f-de947e7e6097:NORMAL:10.1.5.9:50010|RWR],
ReplicaUnderConstruction[[DISK]DS-b42bf32b-dc72-4be5-bf33-e1ee73825420:NORMAL:10.1.5.131:50010|RBW]]}.
Target Replicas is 3 but found 2 replica(s).
0. BP-802741494-10.1.5.6-1431557089849:blk_1073932823_192257{blockUCState=COMMITTED, primaryNodeIndex=2,
replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW],
ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW],
ReplicaUnderConstruction[[DISK]DS-6efec702-3f1f-4ec0-a31f-de947e7e6097:NORMAL:10.1.5.9:50010|RWR],
ReplicaUnderConstruction[[DISK]DS-b42bf32b-dc72-4be5-bf33-e1ee73825420:NORMAL:10.1.5.131:50010|RBW]]}
len=541965482 repl=2 [10.1.5.55:50010, 10.1.5.45:50010, 10.1.5.9:50010, 10.1.5.131:50010]
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message