hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wei-Chiu Chuang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-11219) Lease is leaked if block has no replica and length > 0
Date Wed, 07 Dec 2016 19:33:58 GMT

     [ https://issues.apache.org/jira/browse/HDFS-11219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Wei-Chiu Chuang updated HDFS-11219:
-----------------------------------
    Description: 
I am not sure how it started initially, because part of NameNode log has rolled.

Background: a HBase cluster's WAL file got corrupt when it is splitted. The last block went
missing, and NameNode thinks the length of the last block is 83 bytes.

When I tried to recover lease manually via hdfs debug -recoverLease, NameNode was unable to
recover the lease with the following log message:

{noformat}
    2016-10-12 16:39:32,885 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering
[Lease. Holder: DFSClient_NONMAPREDUCE_751860335_1, pendingcreates: 3], src=/hbase/WALs/hadoopdev6.example.com,60020,1465915112353-splitting/hadoopdev6.example.com%2C60020%2C1465915112353.null0.1471340470256
    2016-10-12 16:39:32,885 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: recoverLease:
[Lease. Holder: DFSClient_NONMAPREDUCE_751860335_1, pendingcreates: 3], src=/hbase/WALs/hadoopdev6.example.com,60020,1465915112353-splitting/hadoopdev6.example.com%2C60020%2C1465915112353.null0.1471340470256
from client DFSClient_NONMAPREDUCE_751860335_1

    2016-10-12 16:39:32,885 DEBUG org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8020:
org.apache.hadoop.hdfs.protocol.ClientProtocol.recoverLease from 192.168.4.86:35276 Call#708331
Retry#0 for RpcKind RPC_PROTOCOL_BUFFER
    2016-10-12 16:39:32,885 DEBUG org.apache.hadoop.ipc.Server: got #708331
    2016-10-12 16:39:32,886 WARN BlockStateChange: BLOCK* BlockInfoUnderConstruction.initLeaseRecovery:
No blocks found, lease removed.
    2016-10-12 16:39:32,886 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease:
File /hbase/WALs/hadoopdev6.example.com,60020,1465915112353-splitting/hadoopdev6.example.com%2C60020%2C1465915112353.null0.1471340470256
has not been closed. Lease recovery is in progress. RecoveryId = 1099610544470 for block blk_1074528903_1099610239882
{blockUCState=UNDER_RECOVERY, primary NodeIndex=-1, replicas=[]}

    2016-10-12 16:39:32,887 DEBUG org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8020:
responding to org.apache.hadoop.hdfs.protocol.ClientProtocol.recoverLease from 192.168.4.86:35276
Call#708331 Retry#0 Wrote 36 bytes.
    2016-10-12 16:39:32,887 DEBUG org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8020:
responding to org.apache.hadoop.hdfs.protocol.ClientProtocol.recoverLease from 192.168.4.86:35276
Call#708331 Retry#0
    2016-10-12 16:39:32,887 DEBUG org.apache.hadoop.ipc.Server: Served: recoverLease queueTime=
0 procesingTime= 2
{noformat}

Despite the log says {{No blocks found, lease removed.}} the lease is actually not removed,
and it is preserved into fsimage. The files remains open for write for couple of months. In
the mean time, HBase region servers were restarted multiple times, but that didn't clean lease
either.

Initially, I thought this is HDFS-10763. But checking the error message on HDFS-10763, this
is definitely not the same error, because it doesn't have this error message:
{noformat}
Failed to RECOVER_LEASE /xyz/xyz for user1 on
 0.0.0.1 because the file is under construction but no leases found.
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:295
{noformat}

I looked at the patch at HDFS-8344, but it doesn't seem to address the issue when replica
size = 0.

On some datanodes I saw some effort to recover blocks with 83 bytes in length, but because
NameNode log has rolled, I can't be certain if these are the splitted WAL file last blocks.
Some of these blocks are deleted soon thereafter, so maybe there are some wild race conditions
where NameNode deleted the blocks but didn't delete the file.

{noformat}
2016-09-19 11:54:41,093 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
initReplicaRecovery: blk_1074562914_1099610357708, recoveryId=1099610357849, replica=ReplicaWaitingToBeRecovered,
blk_1074562914_1099610357708, RWR
  getNumBytes()     = 83
  getBytesOnDisk()  = 83
  getVisibleLength()= -1
  getVolume()       = /data/dfs/dn/current
  getBlockFile()    = /data/dfs/dn/current/BP-1231527650-25.109.5.133-1392914565955/current/rbw/blk_1074562914
{noformat}

  was:
I am not sure how it started initially, because part of NameNode log has rolled.

Background: a HBase cluster's WAL file got corrupt when it is splitted. The last block went
missing, and NameNode thinks the length of the last block is 83 bytes.

When I tried to recover lease manually via hdfs debug -recoverLease, NameNode was unable to
recover the lease with the following log message:

{noformat}
    2016-10-12 16:39:32,885 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering
[Lease. Holder: DFSClient_NONMAPREDUCE_751860335_1, pendingcreates: 3], src=/hbase/WALs/hadoopdev6.example.com,60020,1465915112353-splitting/hadoopdev6.example.com%2C60020%2C1465915112353.null0.1471340470256
    2016-10-12 16:39:32,885 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: recoverLease:
[Lease. Holder: DFSClient_NONMAPREDUCE_751860335_1, pendingcreates: 3], src=/hbase/WALs/hadoopdev6.example.com,60020,1465915112353-splitting/hadoopdev6.example.com%2C60020%2C1465915112353.null0.1471340470256
from client DFSClient_NONMAPREDUCE_751860335_1

    2016-10-12 16:39:32,885 DEBUG org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8020:
org.apache.hadoop.hdfs.protocol.ClientProtocol.recoverLease from 192.168.4.86:35276 Call#708331
Retry#0 for RpcKind RPC_PROTOCOL_BUFFER
    2016-10-12 16:39:32,885 DEBUG org.apache.hadoop.ipc.Server: got #708331
    2016-10-12 16:39:32,886 WARN BlockStateChange: BLOCK* BlockInfoUnderConstruction.initLeaseRecovery:
No blocks found, lease removed.
    2016-10-12 16:39:32,886 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease:
File /hbase/WALs/hadoopdev6.example.com,60020,1465915112353-splitting/hadoopdev6.example.com%2C60020%2C1465915112353.null0.1471340470256
has not been closed. Lease recovery is in progress. RecoveryId = 1099610544470 for block blk_1074528903_1099610239882
{blockUCState=UNDER_RECOVERY, primary NodeIndex=-1, replicas=[]}

    2016-10-12 16:39:32,887 DEBUG org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8020:
responding to org.apache.hadoop.hdfs.protocol.ClientProtocol.recoverLease from 192.168.4.86:35276
Call#708331 Retry#0 Wrote 36 bytes.
    2016-10-12 16:39:32,887 DEBUG org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8020:
responding to org.apache.hadoop.hdfs.protocol.ClientProtocol.recoverLease from 192.168.4.86:35276
Call#708331 Retry#0
    2016-10-12 16:39:32,887 DEBUG org.apache.hadoop.ipc.Server: Served: recoverLease queueTime=
0 procesingTime= 2
{noformat}

Despite the log says {{No blocks found, lease removed.}} the lease is actually not removed,
and it is preserved into fsimage. The files remains open for write for couple of months.

Initially, I thought this is HDFS-10763. But checking the error message on HDFS-10763, this
is definitely not the same error, because it doesn't have this error message:
{noformat}
Failed to RECOVER_LEASE /xyz/xyz for user1 on
 0.0.0.1 because the file is under construction but no leases found.
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:295
{noformat}

I looked at the patch at HDFS-8344, but it doesn't seem to address the issue when replica
size = 0.

On some datanodes I saw some effort to recover blocks with 83 bytes in length, but because
NameNode log has rolled, I can't be certain if these are the splitted WAL file last blocks.
Some of these blocks are deleted soon thereafter, so maybe there are some wild race conditions
where NameNode deleted the blocks but didn't delete the file.

{noformat}
2016-09-19 11:54:41,093 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
initReplicaRecovery: blk_1074562914_1099610357708, recoveryId=1099610357849, replica=ReplicaWaitingToBeRecovered,
blk_1074562914_1099610357708, RWR
  getNumBytes()     = 83
  getBytesOnDisk()  = 83
  getVisibleLength()= -1
  getVolume()       = /data/dfs/dn/current
  getBlockFile()    = /data/dfs/dn/current/BP-1231527650-25.109.5.133-1392914565955/current/rbw/blk_1074562914
{noformat}


> Lease is leaked if block has no replica and length > 0
> ------------------------------------------------------
>
>                 Key: HDFS-11219
>                 URL: https://issues.apache.org/jira/browse/HDFS-11219
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>            Reporter: Wei-Chiu Chuang
>
> I am not sure how it started initially, because part of NameNode log has rolled.
> Background: a HBase cluster's WAL file got corrupt when it is splitted. The last block
went missing, and NameNode thinks the length of the last block is 83 bytes.
> When I tried to recover lease manually via hdfs debug -recoverLease, NameNode was unable
to recover the lease with the following log message:
> {noformat}
>     2016-10-12 16:39:32,885 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
Recovering [Lease. Holder: DFSClient_NONMAPREDUCE_751860335_1, pendingcreates: 3], src=/hbase/WALs/hadoopdev6.example.com,60020,1465915112353-splitting/hadoopdev6.example.com%2C60020%2C1465915112353.null0.1471340470256
>     2016-10-12 16:39:32,885 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
recoverLease: [Lease. Holder: DFSClient_NONMAPREDUCE_751860335_1, pendingcreates: 3], src=/hbase/WALs/hadoopdev6.example.com,60020,1465915112353-splitting/hadoopdev6.example.com%2C60020%2C1465915112353.null0.1471340470256
from client DFSClient_NONMAPREDUCE_751860335_1
>     2016-10-12 16:39:32,885 DEBUG org.apache.hadoop.ipc.Server: IPC Server handler 0
on 8020: org.apache.hadoop.hdfs.protocol.ClientProtocol.recoverLease from 192.168.4.86:35276
Call#708331 Retry#0 for RpcKind RPC_PROTOCOL_BUFFER
>     2016-10-12 16:39:32,885 DEBUG org.apache.hadoop.ipc.Server: got #708331
>     2016-10-12 16:39:32,886 WARN BlockStateChange: BLOCK* BlockInfoUnderConstruction.initLeaseRecovery:
No blocks found, lease removed.
>     2016-10-12 16:39:32,886 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease:
File /hbase/WALs/hadoopdev6.example.com,60020,1465915112353-splitting/hadoopdev6.example.com%2C60020%2C1465915112353.null0.1471340470256
has not been closed. Lease recovery is in progress. RecoveryId = 1099610544470 for block blk_1074528903_1099610239882
{blockUCState=UNDER_RECOVERY, primary NodeIndex=-1, replicas=[]}
>     2016-10-12 16:39:32,887 DEBUG org.apache.hadoop.ipc.Server: IPC Server handler 0
on 8020: responding to org.apache.hadoop.hdfs.protocol.ClientProtocol.recoverLease from 192.168.4.86:35276
Call#708331 Retry#0 Wrote 36 bytes.
>     2016-10-12 16:39:32,887 DEBUG org.apache.hadoop.ipc.Server: IPC Server handler 0
on 8020: responding to org.apache.hadoop.hdfs.protocol.ClientProtocol.recoverLease from 192.168.4.86:35276
Call#708331 Retry#0
>     2016-10-12 16:39:32,887 DEBUG org.apache.hadoop.ipc.Server: Served: recoverLease
queueTime= 0 procesingTime= 2
> {noformat}
> Despite the log says {{No blocks found, lease removed.}} the lease is actually not removed,
and it is preserved into fsimage. The files remains open for write for couple of months. In
the mean time, HBase region servers were restarted multiple times, but that didn't clean lease
either.
> Initially, I thought this is HDFS-10763. But checking the error message on HDFS-10763,
this is definitely not the same error, because it doesn't have this error message:
> {noformat}
> Failed to RECOVER_LEASE /xyz/xyz for user1 on
>  0.0.0.1 because the file is under construction but no leases found.
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:295
> {noformat}
> I looked at the patch at HDFS-8344, but it doesn't seem to address the issue when replica
size = 0.
> On some datanodes I saw some effort to recover blocks with 83 bytes in length, but because
NameNode log has rolled, I can't be certain if these are the splitted WAL file last blocks.
Some of these blocks are deleted soon thereafter, so maybe there are some wild race conditions
where NameNode deleted the blocks but didn't delete the file.
> {noformat}
> 2016-09-19 11:54:41,093 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
initReplicaRecovery: blk_1074562914_1099610357708, recoveryId=1099610357849, replica=ReplicaWaitingToBeRecovered,
blk_1074562914_1099610357708, RWR
>   getNumBytes()     = 83
>   getBytesOnDisk()  = 83
>   getVisibleLength()= -1
>   getVolume()       = /data/dfs/dn/current
>   getBlockFile()    = /data/dfs/dn/current/BP-1231527650-25.109.5.133-1392914565955/current/rbw/blk_1074562914
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message