hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nicolas Fraison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-10220) A large number of expired leases can make namenode unresponsive and cause failover
Date Thu, 19 May 2016 07:51:13 GMT

    [ https://issues.apache.org/jira/browse/HDFS-10220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15290640#comment-15290640

Nicolas Fraison commented on HDFS-10220:

[~yzhangal] this is how I have understand the [~daryn] comment, having the release running
every 2s during 5ms and if the release is not finished after 5ms reducing the sleep to 500ms
before next release. Once the release finished well in 5ms get back to the 2s sleep.
I agree that this doesn't happens on normal behaviour but on our quite 2 big clusters (40Po)
with around 100k jobs running every day on each with lots of distcp syncing data between both
of them I can see some times (2 to 4 times  a month) some quite big release of path (around
200 to 400k) that could affect the hdfs layer (we have applied this patch so it is not the
case). Even if the issues are always due to something bad from the jobs running I think that
the namenode should be protective here to avoid affecting the whole cluster.
In the first patch the config was a configurable parameters but there was a consensus to move
it to a constant parameter. So I would like to have feedback from other guys on this to not
do and redo this.

Thanks for your feedback

> A large number of expired leases can make namenode unresponsive and cause failover
> ----------------------------------------------------------------------------------
>                 Key: HDFS-10220
>                 URL: https://issues.apache.org/jira/browse/HDFS-10220
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>            Reporter: Nicolas Fraison
>            Assignee: Nicolas Fraison
>            Priority: Minor
>         Attachments: HADOOP-10220.001.patch, HADOOP-10220.002.patch, HADOOP-10220.003.patch,
HADOOP-10220.004.patch, HADOOP-10220.005.patch, HADOOP-10220.006.patch, threaddump_zkfc.txt
> I have faced a namenode failover due to unresponsive namenode detected by the zkfc with
lot's of WARN messages (5 millions) like this one:
> _org.apache.hadoop.hdfs.StateChange: BLOCK* internalReleaseLease: All existing blocks
are COMPLETE, lease removed, file closed._
> On the threaddump taken by the zkfc there are lots of thread blocked due to a lock.
> Looking at the code, there are a lock taken by the LeaseManager.Monitor when some lease
must be released. Due to the really big number of lease to be released the namenode has taken
too many times to release them blocking all other tasks and making the zkfc thinking that
the namenode was not available/stuck.
> The idea of this patch is to limit the number of leased released each time we check for
lease so the lock won't be taken for a too long time period.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message