hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nicolas Fraison (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HDFS-10220) Namenode failover due to too long loking in LeaseManager.Monitor
Date Mon, 28 Mar 2016 13:12:25 GMT
Nicolas Fraison created HDFS-10220:

             Summary: Namenode failover due to too long loking in LeaseManager.Monitor
                 Key: HDFS-10220
                 URL: https://issues.apache.org/jira/browse/HDFS-10220
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: namenode
            Reporter: Nicolas Fraison
            Priority: Minor

I have faced a namenode failover due to unresponsive namenode detected by the zkfc with lot's
of WARN messages (5 millions) like this one:
_org.apache.hadoop.hdfs.StateChange: BLOCK* internalReleaseLease: All existing blocks are
COMPLETE, lease removed, file closed._

On the threaddump taken by the zkfc there are lots of thread blocked due to a lock.

Looking at the code, there are a lock taken by the LeaseManager.Monitor when some lease must
be released. Due to the really big number of lease to be released the namenode has taken too
many times to release them blocking all other tasks and making the zkfc thinking that the
namenode was not available/stuck.

The idea of this patch is to limit the number of leased released each time we check for lease
so the lock won't be taken for a too long time period.

This message was sent by Atlassian JIRA

View raw message