hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bogdan Raducanu <lrd...@gmail.com>
Subject hdfs: weird lease expiration issue
Date Fri, 21 Aug 2015 17:05:45 GMT
I have an application that continuously appends to an hdfs file and keeps
it open a long time.
At some point the application crashed and left the file open.
It was then restarted and it resumed normal operation, completing some
writes (appending to the file). But, an hour after the crash it experienced
write errors because its lease was removed.
Digging in the NN log I found this weird behavior.

Events timeline:

1. 15:25: application crashes

2. 15:28: application restarted, writing doesn't start immediately

3. 15:37 first write

4. some more writes

5. new block needed: 2015-08-11 15:52:59,223 INFO
org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /path/to/my/file.
... blk_1079708083_9361735{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]...|RBW],
ReplicaUnderConstruction[[DISK]...|RBW],
ReplicaUnderConstruction[[DISK]...|RBW]]}

6. some more writes; application uses hsync so we can see the writes in the
nn log: 2015-08-11 15:52:59,234 INFO org.apache.hadoop.hdfs.StateChange:
BLOCK* fsync: /path/to/my/file for DFSClient_NONMAPREDUCE_-1953764790_1

7. 60 minutes after crash: 2015-08-11 16:25:18,397 INFO
org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease.  Holder:
DFSClient_NONMAPREDUCE_830713991_1, pendingcreates: 1] has expired hard
limit

8. 2015-08-11 16:25:18,398 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.
Holder: DFSClient_NONMAPREDUCE_830713991_1, pendingcreates: 1],
src=/path/to/my/file

9. 2015-08-11 16:25:18,398 INFO BlockStateChange: BLOCK*
blk_1079708083_9361735{blockUCState=UNDER_RECOVERY, primaryNodeIndex=0,
replicas=[ReplicaUnderConstruction[[DISK]...|RBW],
ReplicaUnderConstruction[[DISK]...|RBW],
ReplicaUnderConstruction[[DISK]...|RBW]]} recovery started,
primary=ReplicaUnderConstruction[[DISK]...|RBW]

10. 2015-08-11 16:25:18,398 WARN org.apache.hadoop.hdfs.StateChange: DIR*
NameSystem.internalReleaseLease: File /path/to/my/file has not been closed.
Lease recovery is in progress. RecoveryId = 9361840 for block
blk_1079708083_9361735{blockUCState=UNDER_RECOVERY, primaryNodeIndex=0,
replicas=...

So, somehow the crashed client's lease remained and an hour after the crash
it was removed.
This happened even though during this hour another client obtained a lease
and appended to the file.
Also, there is no "startFile: recover lease" log message when the new
client opens the file. It is like the old lease is not seen until the 1
hour hard limit expires.
Any idea how this could happen? This is on a distribution based on 2.6.0,
with HA

Mime
View raw message