Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-issues@hadoop.apache.org
Date: Sat, 15 Feb 2014 22:09:19 +0000 (UTC)
From: "Chris Nauroth (JIRA)" <jira@apache.org>
To: hdfs-issues@hadoop.apache.org
Message-ID: <JIRA.12695381.1392501955884.49640.1392502159054@arcas>
In-Reply-To: <JIRA.12695381.1392501955884@arcas>
References: <JIRA.12695381.1392501955884@arcas>
Subject: [jira] [Commented] (HDFS-5957) Provide support for different mmap
 cache retention policies in ShortCircuitCache.
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HDFS-5957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13902547#comment-13902547 ] 

Chris Nauroth commented on HDFS-5957:
-------------------------------------

Here are some additional details on the scenario that prompted filing this issue.  Thanks to [~gopalv] for sharing the details.

Gopal has a YARN application that performs strictly sequential reads of HDFS files.  The application may rapidly iterate through a large number of blocks.  The reason for this is that each block contains a small metadata header, and based on the contents of this metadata, the application often can decide that there is nothing relevant in the rest of the block.  If that happens, then the application seeks all the way past that block.  Gopal estimates that it's feasible this code would scan through ~100 HDFS blocks in ~10 seconds.

This usage pattern in combination with zero-copy read causes retention of a large number of memory-mapped regions in the {{ShortCircuitCache}}.  Eventually, YARN's resource check kills the container process for exceeding the enforced physical memory bounds.  The asynchronous nature of our {{munmap}} calls was surprising for Gopal, who had carefully calculated his memory usage to stay under YARN's resource checks.

As a workaround, I advised Gopal to downtune {{dfs.client.mmap.cache.timeout.ms}} to make the {{munmap}} happen more quickly.  A better solution would be to provide support in the HDFS client for a caching policy that fits this usage pattern.  Two possibilities are:

# LRU bounded by a client-specified maximum memory size.  (Note this is maximum memory size and not number of files or number of blocks, because of the possibility of differing block counts and block sizes.)
# Do not cache at all.  Effectively, there is only one memory-mapped region alive at a time.  The sequential read usage pattern described above always results in a cache miss anyway, so a cache adds no value.

I don't propose removing the current time-triggered threshold, because I think that's valid for other use cases.  I only propose adding support for new policies.

In addition to the caching policy itself, I want to propose a way to move the {{munmap}} calls to run synchronous with the caller instead of in a background thread.  This would be a better fit for clients who want deterministic resource cleanup.  Right now, we have no way to guarantee that the OS will schedule the {{CacheCleaner}} thread ahead of YARN's resource check thread.  This isn't a proposal to remove support for the background thread, only to add support for synchronous {{munmap}}.

I think you could also make an argument that YARN shouldn't count these memory-mapped regions towards the container process's RSS.  It's really the DataNode process that owns that memory, and clients who {{mmap}} the same region shouldn't get penalized.  Let's address that part separately though.


> Provide support for different mmap cache retention policies in ShortCircuitCache.
> ---------------------------------------------------------------------------------
>
>                 Key: HDFS-5957
>                 URL: https://issues.apache.org/jira/browse/HDFS-5957
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs-client
>    Affects Versions: 2.3.0
>            Reporter: Chris Nauroth
>
> Currently, the {{ShortCircuitCache}} retains {{mmap}} regions for reuse by multiple reads of the same block or by multiple threads.  The eventual {{munmap}} executes on a background thread after an expiration period.  Some client usage patterns would prefer strict bounds on this cache and deterministic cleanup by calling {{munmap}}.  This issue proposes additional support for different caching policies that better fit these usage patterns.


--
This message was sent by Atlassian JIRA
(v6.1.5#6160)