Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 9012D200B22 for ; Wed, 1 Jun 2016 09:24:01 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 8EB26160A41; Wed, 1 Jun 2016 07:24:01 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id DA785160A4B for ; Wed, 1 Jun 2016 09:24:00 +0200 (CEST) Received: (qmail 30352 invoked by uid 500); 1 Jun 2016 07:23:59 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 30102 invoked by uid 99); 1 Jun 2016 07:23:59 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Jun 2016 07:23:59 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 7AA7E2C1F68 for ; Wed, 1 Jun 2016 07:23:59 +0000 (UTC) Date: Wed, 1 Jun 2016 07:23:59 +0000 (UTC) From: "Nicolas Fraison (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HDFS-10220) A large number of expired leases can make namenode unresponsive and cause failover MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 01 Jun 2016 07:24:01 -0000 [ https://issues.apache.org/jira/browse/HDFS-10220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Fraison updated HDFS-10220: ----------------------------------- Attachment: HADOOP-10220.007.patch > A large number of expired leases can make namenode unresponsive and cause failover > ---------------------------------------------------------------------------------- > > Key: HDFS-10220 > URL: https://issues.apache.org/jira/browse/HDFS-10220 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Reporter: Nicolas Fraison > Assignee: Nicolas Fraison > Priority: Minor > Attachments: HADOOP-10220.001.patch, HADOOP-10220.002.patch, HADOOP-10220.003.patch, HADOOP-10220.004.patch, HADOOP-10220.005.patch, HADOOP-10220.006.patch, HADOOP-10220.007.patch, threaddump_zkfc.txt > > > I have faced a namenode failover due to unresponsive namenode detected by the zkfc with lot's of WARN messages (5 millions) like this one: > _org.apache.hadoop.hdfs.StateChange: BLOCK* internalReleaseLease: All existing blocks are COMPLETE, lease removed, file closed._ > On the threaddump taken by the zkfc there are lots of thread blocked due to a lock. > Looking at the code, there are a lock taken by the LeaseManager.Monitor when some lease must be released. Due to the really big number of lease to be released the namenode has taken too many times to release them blocking all other tasks and making the zkfc thinking that the namenode was not available/stuck. > The idea of this patch is to limit the number of leased released each time we check for lease so the lock won't be taken for a too long time period. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org