hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kihwal Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7342) Lease Recovery doesn't happen some times
Date Mon, 24 Nov 2014 16:27:14 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223105#comment-14223105
] 

Kihwal Lee commented on HDFS-7342:
----------------------------------

bq. How about scheduling replication during the lease recovery for such penultimate blocks
with atleast one replica available to satisfy min-replication, then go ahead for lease recovery.
Till now this situation might not have experienced as minReplication itself by default was
1. 

Let's first think about the meaning of min-replication. It is the level of degradation that
is allowed before being considered critical in terms of data durability. Falling below this
level does not necessarily mean a failure (i.e. data not available) unless min-replica is
1. For synchronous or semi-synchronous operations such as {{addBlcok()}} and {{complete()}},
this is *enforced* to maintain the healthy steady state. Clients also do their best to meet
this, but any failures on datanodes between finalizing a block and sending the IBR are beyond
their control.  For asynchronous recovery activities such as lease recovery and replication,
min-replica should be advisory.  Since replication is already doing the right thing, let's
focus on lease recovery.

Dealing with COMMITTED blocks is simpler. Being committed means the client thought enough
number of replicas were finalized. If a lease is expired, the block can simply turn in to
COMPLETE. If it has at least one live replica, it will be replicated soon after closing the
file. If it doesn't, the block will be considered missing.  I think it is better to report
the committed but missing data early rather than hiding it in the infinite lease recovery
cycle.  Also, recovery will be faster this way.  If all blocks in a file are in either complete
or committed state, lease recovery may force complete all committed blocks and close the file.
The rest will be up to the replication monitor.

{{recoverLeaseInternal()}} and {{internalReleaseLease()}} will need to be made to distinguish
the on-demand recovery from normal lease expiration.  For on-demand recovery, we might want
it to fail if there is no live replicas, as a file lease is normally recovered for subsequent
append or copy(read). If there is no data, they will fail.

For recovering blocks in the UNDER_CONSTRUCTION state, we can make {{commitBlockSynchronization()}}
to force commit when there is at least one replica, ignoring min-replication. It will allow
the recovery to make progress and eventually the file to be closed if there is at least one
replica per block. Then the blocks can be replicated.  This is far better than getting stuck
in recovery.

> Lease Recovery doesn't happen some times
> ----------------------------------------
>
>                 Key: HDFS-7342
>                 URL: https://issues.apache.org/jira/browse/HDFS-7342
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.0.0-alpha
>            Reporter: Ravi Prakash
>            Assignee: Ravi Prakash
>         Attachments: HDFS-7342.1.patch, HDFS-7342.2.patch, HDFS-7342.3.patch
>
>
> In some cases, LeaseManager tries to recover a lease, but is not able to. HDFS-4882 describes
a possibility of that. We should fix this



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message