From hdfs-issues-return-232754-archive-asf-public=cust-asf.ponee.io@hadoop.apache.org  Fri Sep  7 23:58:05 2018
Return-Path: <hdfs-issues-return-232754-archive-asf-public=cust-asf.ponee.io@hadoop.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 00524180672
	for <archive-asf-public@cust-asf.ponee.io>; Fri,  7 Sep 2018 23:58:04 +0200 (CEST)
Received: (qmail 66179 invoked by uid 500); 7 Sep 2018 21:58:04 -0000
Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:hdfs-issues-help@hadoop.apache.org>
List-Unsubscribe: <mailto:hdfs-issues-unsubscribe@hadoop.apache.org>
List-Post: <mailto:hdfs-issues@hadoop.apache.org>
List-Id: <hdfs-issues.hadoop.apache.org>
Delivered-To: mailing list hdfs-issues@hadoop.apache.org
Received: (qmail 66168 invoked by uid 99); 7 Sep 2018 21:58:04 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Sep 2018 21:58:03 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 9018FC1FB2
	for <hdfs-issues@hadoop.apache.org>; Fri,  7 Sep 2018 21:58:03 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: -109.501
X-Spam-Level:
X-Spam-Status: No, score=-109.501 tagged_above=-999 required=6.31
	tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_ASCII_DIVIDERS=0.8,
	RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, USER_IN_DEF_SPF_WL=-7.5,
	USER_IN_WHITELIST=-100] autolearn=disabled
Received: from mx1-lw-eu.apache.org ([10.40.0.8])
	by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024)
	with ESMTP id zT5g2cARdtmA for <hdfs-issues@hadoop.apache.org>;
	Fri,  7 Sep 2018 21:58:02 +0000 (UTC)
Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139])
	by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 95FA25F4E7
	for <hdfs-issues@hadoop.apache.org>; Fri,  7 Sep 2018 21:58:01 +0000 (UTC)
Received: from jira-lw-us.apache.org (unknown [207.244.88.139])
	by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id B3450E0E5D
	for <hdfs-issues@hadoop.apache.org>; Fri,  7 Sep 2018 21:58:00 +0000 (UTC)
Received: from jira-lw-us.apache.org (localhost [127.0.0.1])
	by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 2F9BA26B57
	for <hdfs-issues@hadoop.apache.org>; Fri,  7 Sep 2018 21:58:00 +0000 (UTC)
Date: Fri, 7 Sep 2018 21:58:00 +0000 (UTC)
From: "Erik Krogen (JIRA)" <jira@apache.org>
To: hdfs-issues@hadoop.apache.org
Message-ID: <JIRA.13183701.1536343887000.27940.1536357480193@Atlassian.JIRA>
In-Reply-To: <JIRA.13183701.1536343887000@Atlassian.JIRA>
References: <JIRA.13183701.1536343887000@Atlassian.JIRA> <JIRA.13183701.1536343887115@jira-lw-us.apache.org>
Subject: [jira] [Comment Edited] (HDFS-13904) ContentSummary does not always
 respect processing limit, resulting in long lock acquisitions
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394


    [ https://issues.apache.org/jira/browse/HDFS-13904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16607707#comment-16607707 ] 

Erik Krogen edited comment on HDFS-13904 at 9/7/18 9:57 PM:
------------------------------------------------------------

Thanks [~arpitagarwal] and [~kihwal] for taking a look!
{quote}
Do you have snapshots? We don't.
{quote}
No snapshots!

{quote}
I wonder what is actually causing the long lock in your case.
...
We have long getContentSummary calls made periodically and they can take close to a minute. But we do not hit this kind of long locking problems.
{quote}
The directory in question had around 100M entries. But, you make good points. To describe more fully the situation under which we observed this behavior:
* There was a job periodically calling {{getContentSummary}} on this huge directory. No long lock warnings were seen.
* The NameNode was restarted.
* Now, consistently, every {{contentSummary}} call on that directory caused the lock hold warning.
* The NameNode was restarted again.
* The long lock holds were no longer seen.

So I agree that it is an odd case, and probably only triggered under certain circumstances...

{quote}The change may help if the concurrent long reads are all getContentSummary calls.{quote}
Are there other long read operations besides {{getContentSummary}}? {{listStatus}} is already bounded by an even smaller limit.


I'll see if I can replicate this more deterministically to test the efficacy of the patch, perhaps with Dynamometer...


was (Author: xkrogen):
Thanks [~arpitagarwal] and [~kihwal] for taking a look!
{quote}
Do you have snapshots? We don't.
{quote}
No snapshots!

{quote}
I wonder what is actually causing the long lock in your case.
...
We have long getContentSummary calls made periodically and they can take close to a minute. But we do not hit this kind of long locking problems.
{quote}
The directory in question had around 100M entries. But, you make good points. To describe more fully the situation under which we observed this behavior:
* There was a job periodically calling {{getContentSummary}} on this huge directory. No long lock warnings were seen.
* The NameNode was restarted.
* Now, consistently, every {{contentSummary}} call on that directory caused the lock hold warning.
* The NameNode was restarted again.
* The long lock holds were no longer seen.

So I agree that it is an odd case, and probably only triggered under certain circumstances...

{quote}The change may help if the concurrent long reads are all getContentSummary calls.{quote}
Are there other long read operations besides {{getContentSummary}}? {{listStatus}} is already bounded by an even smaller limit.

> ContentSummary does not always respect processing limit, resulting in long lock acquisitions
> --------------------------------------------------------------------------------------------
>
>                 Key: HDFS-13904
>                 URL: https://issues.apache.org/jira/browse/HDFS-13904
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs, namenode
>            Reporter: Erik Krogen
>            Assignee: Erik Krogen
>            Priority: Major
>
> HDFS-4995 added a config {{dfs.content-summary.limit}} which allows for an administrator to set a limit on the number of entries processed during a single acquisition of the {{FSNamesystemLock}} during the creation of a content summary. This is useful to prevent very long (multiple seconds) pauses on the NameNode when {{getContentSummary}} is called on large directories.
> However, even on versions with HDFS-4995, we have seen warnings like:
> {code}
> INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem read lock held for 9398 ms via
> java.lang.Thread.getStackTrace(Thread.java:1552)
> org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:950)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.readUnlock(FSNamesystemLock.java:188)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.readUnlock(FSNamesystem.java:1486)
> org.apache.hadoop.hdfs.server.namenode.ContentSummaryComputationContext.yield(ContentSummaryComputationContext.java:109)
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeDirectoryContentSummary(INodeDirectory.java:679)
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeContentSummary(INodeDirectory.java:642)
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeDirectoryContentSummary(INodeDirectory.java:656)
> {code}
> happen quite consistently when {{getContentSummary}} was called on a large directory on a heavily-loaded NameNode. Such long pauses completely destroy the performance of the NameNode. We have the limit set to its default of 5000; if it was respected, clearly there would not be a 10-second pause.
> The current {{yield()}} code within {{ContentSummaryComputationContext}} looks like:
> {code}
>   public boolean yield() {
>     // Are we set up to do this?
>     if (limitPerRun <= 0 || dir == null || fsn == null) {
>       return false;
>     }
>     // Have we reached the limit?
>     long currentCount = counts.getFileCount() +
>         counts.getSymlinkCount() +
>         counts.getDirectoryCount() +
>         counts.getSnapshotableDirectoryCount();
>     if (currentCount <= nextCountLimit) {
>       return false;
>     }
>     // Update the next limit
>     nextCountLimit = currentCount + limitPerRun;
>     boolean hadDirReadLock = dir.hasReadLock();
>     boolean hadDirWriteLock = dir.hasWriteLock();
>     boolean hadFsnReadLock = fsn.hasReadLock();
>     boolean hadFsnWriteLock = fsn.hasWriteLock();
>     // sanity check.
>     if (!hadDirReadLock || !hadFsnReadLock || hadDirWriteLock ||
>         hadFsnWriteLock || dir.getReadHoldCount() != 1 ||
>         fsn.getReadHoldCount() != 1) {
>       // cannot relinquish
>       return false;
>     }
>     // unlock
>     dir.readUnlock();
>     fsn.readUnlock("contentSummary");
>     try {
>       Thread.sleep(sleepMilliSec, sleepNanoSec);
>     } catch (InterruptedException ie) {
>     } finally {
>       // reacquire
>       fsn.readLock();
>       dir.readLock();
>     }
>     yieldCount++;
>     return true;
>   }
> {code}
> We believe that this check in particular is the culprit:
> {code}
>     if (!hadDirReadLock || !hadFsnReadLock || hadDirWriteLock ||
>         hadFsnWriteLock || dir.getReadHoldCount() != 1 ||
>         fsn.getReadHoldCount() != 1) {
>       // cannot relinquish
>       return false;
>     }
> {code}
> The content summary computation will only relinquish the lock if it is currently the _only_ holder of the lock. Given the high volume of read requests on a heavily loaded NameNode, especially when unfair locking is enabled, it is likely there may be another holder of the read lock performing some short-lived operation. By refusing to give up the lock in this case, the content summary computation ends up never relinquishing the lock.
> We propose to simply remove the readHoldCount checks from this {{yield()}}. This should alleviate the case described above by giving up the read lock and allowing other short-lived operations to complete (while the content summary thread sleeps) so that the lock can finally be given up completely. This has the drawback that sometimes, the content summary may give up the lock unnecessarily, if the read lock is never actually released by the time the thread continues again. The only negative impact from this is to make some large content summary operations slightly slower, with the tradeoff of reducing NameNode-wide performance impact.


--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org