Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Date: Fri, 13 May 2016 22:22:13 +0000 (UTC)
From: "Yongjun Zhang (JIRA)" <jira@apache.org>
To: hdfs-issues@hadoop.apache.org
Message-ID: <JIRA.12953960.1459170734000.188391.1463178133018@Atlassian.JIRA>
In-Reply-To: <JIRA.12953960.1459170734000@Atlassian.JIRA>
References: <JIRA.12953960.1459170734000@Atlassian.JIRA> <JIRA.12953960.1459170734867@arcas>
Subject: [jira] [Commented] (HDFS-10220) A large number of expired leases
 can make namenode unresponsive and cause failover
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Fri, 13 May 2016 22:22:15 -0000


    [ https://issues.apache.org/jira/browse/HDFS-10220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15283237#comment-15283237 ] 

Yongjun Zhang commented on HDFS-10220:
--------------------------------------

I happen to see this jira now.

Hi [~daryn], do you suggest to dynamically adjust the lease check interval by saying "if it broke out early then perhaps it could sleep for less than 2s"?

I agree with [~kihwal]'s "I don't think this kind of mass lease recoveries are normal" comment. I wonder if we could just make both MAX_LOCK_HOLD_TO_RELEASE_LEASE_MS and the lease check interval config parameters instead of fixed numbers. If this config is there, it can solve the abnormal situation. I know we have many configs already though. What do you guys think?

Thanks.

> A large number of expired leases can make namenode unresponsive and cause failover
> ----------------------------------------------------------------------------------
>
>                 Key: HDFS-10220
>                 URL: https://issues.apache.org/jira/browse/HDFS-10220
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>            Reporter: Nicolas Fraison
>            Assignee: Nicolas Fraison
>            Priority: Minor
>         Attachments: HADOOP-10220.001.patch, HADOOP-10220.002.patch, HADOOP-10220.003.patch, HADOOP-10220.004.patch, HADOOP-10220.005.patch, HADOOP-10220.006.patch, threaddump_zkfc.txt
>
>
> I have faced a namenode failover due to unresponsive namenode detected by the zkfc with lot's of WARN messages (5 millions) like this one:
> _org.apache.hadoop.hdfs.StateChange: BLOCK* internalReleaseLease: All existing blocks are COMPLETE, lease removed, file closed._
> On the threaddump taken by the zkfc there are lots of thread blocked due to a lock.
> Looking at the code, there are a lock taken by the LeaseManager.Monitor when some lease must be released. Due to the really big number of lease to be released the namenode has taken too many times to release them blocking all other tasks and making the zkfc thinking that the namenode was not available/stuck.
> The idea of this patch is to limit the number of leased released each time we check for lease so the lock won't be taken for a too long time period.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org