hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Walter Su (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-10220) Namenode failover due to too long loking in LeaseManager.Monitor
Date Wed, 27 Apr 2016 02:04:13 GMT

    [ https://issues.apache.org/jira/browse/HDFS-10220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15259384#comment-15259384
] 

Walter Su commented on HDFS-10220:
----------------------------------

bq. I think it add some readability and also because it is used twice.
I only took a peek last time. Yeah, i'm ok with that.
Another problem when I go through the details,
{code}
    while(!sortedLeases.isEmpty() && sortedLeases.peek().expiredHardLimit()
      && !isMaxLockHoldToReleaseLease(start)) {
      Lease leaseToCheck = sortedLeases.poll();
      ...
      Collection<Long> files = leaseToCheck.getFiles();
     ...
      for(Long id : leaseINodeIds) {
        ...
        } finally {
          filesLeasesChecked++;
          if (isMaxLockHoldToReleaseLease(start)) {
            LOG.debug("Breaking out of checkLeases() after " +
                filesLeasesChecked + " file leases checked.");
            break;
          }
      }
{code}
You can't just break the inside for-loop, the {{leaseToCheck}} has been polled out of the
queue. This will cause some files won't be closed.

> Namenode failover due to too long loking in LeaseManager.Monitor
> ----------------------------------------------------------------
>
>                 Key: HDFS-10220
>                 URL: https://issues.apache.org/jira/browse/HDFS-10220
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>            Reporter: Nicolas Fraison
>            Assignee: Nicolas Fraison
>            Priority: Minor
>         Attachments: HADOOP-10220.001.patch, HADOOP-10220.002.patch, HADOOP-10220.003.patch,
HADOOP-10220.004.patch, HADOOP-10220.005.patch, threaddump_zkfc.txt
>
>
> I have faced a namenode failover due to unresponsive namenode detected by the zkfc with
lot's of WARN messages (5 millions) like this one:
> _org.apache.hadoop.hdfs.StateChange: BLOCK* internalReleaseLease: All existing blocks
are COMPLETE, lease removed, file closed._
> On the threaddump taken by the zkfc there are lots of thread blocked due to a lock.
> Looking at the code, there are a lock taken by the LeaseManager.Monitor when some lease
must be released. Due to the really big number of lease to be released the namenode has taken
too many times to release them blocking all other tasks and making the zkfc thinking that
the namenode was not available/stuck.
> The idea of this patch is to limit the number of leased released each time we check for
lease so the lock won't be taken for a too long time period.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message