hadoop-yarn-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Varun Vasudev <vvasu...@apache.org>
Subject Re: ProcFsBasedProcessTree and clean pages in smaps
Date Fri, 05 Feb 2016 14:10:51 GMT
Hi Jan,

YARN-1856 was recently committed which allows admins to use cgroups instead the ProcFsBasedProcessTree
monitory. Would that solve your problem? However, that requires usage of the LinuxContainerExecutor.

-Varun



On 2/5/16, 6:45 PM, "Jan Lukavsk√Ĺ" <jan.lukavsky@firma.seznam.cz> wrote:

>Hi Chris,
>
>thanks for your reply. As far as I can see right, new linux kernels show 
>the locked memory in "Locked" field.
>
>If mmap file a mlock it, I see the following in 'smaps' file:
>
>7efd20aeb000-7efd2172b000 r--p 00000000 103:04 1870                      
>/tmp/file.bin
>Size:              12544 kB
>Rss:               12544 kB
>Pss:               12544 kB
>Shared_Clean:          0 kB
>Shared_Dirty:          0 kB
>Private_Clean:     12544 kB
>Private_Dirty:         0 kB
>Referenced:        12544 kB
>Anonymous:             0 kB
>AnonHugePages:         0 kB
>Swap:                  0 kB
>KernelPageSize:        4 kB
>MMUPageSize:           4 kB
>Locked:            12544 kB
>
>...
># uname -a
>Linux XXXXXX 3.2.0-4-amd64 #1 SMP Debian 3.2.68-1+deb7u3 x86_64 GNU/Linux
>
>If I do this on an older kernel (2.6.x), the Locked field is missing.
>
>I can make a patch for the ProcfsBasedProcessTree that will calculate 
>the "Locked" pages instead of the "Private_Clean" (based on 
>configuration option). The question is - should there be made even more 
>changes in the way the memory footprint is calculated? For instance, I 
>believe the kernel can write to disk even all dirty pages (if they are 
>backed by a file), making them clean and therefore can later free them. 
>Should I open a JIRA for this to have some discussion on this topic?
>
>Regards,
>  Jan
>
>
>On 02/04/2016 07:20 PM, Chris Nauroth wrote:
>> Hello Jan,
>>
>> I am moving this thread from user@hadoop.apache.org to
>> yarn-dev@hadoop.apache.org, since it's less a question of general usage
>> and more a question of internal code implementation details and possible
>> enhancements.
>>
>> I think the issue is that it's not guaranteed in the general case that
>> Private_Clean pages are easily evictable from page cache by the kernel.
>> For example, if the pages have been pinned into RAM by calling mlock [1],
>> then the kernel cannot evict them.  Since YARN can execute any code
>> submitted by an application, including possibly code that calls mlock, it
>> takes a cautious approach and assumes that these pages must be counted
>> towards the process footprint.  Although your Spark use case won't mlock
>> the pages (I assume), YARN doesn't have a way to identify this.
>>
>> Perhaps there is room for improvement here.  If there is a reliable way to
>> count only mlock'ed pages, then perhaps that behavior could be added as
>> another option in ProcfsBasedProcessTree.  Off the top of my head, I can't
>> think of a reliable way to do this, and I can't research it further
>> immediately.  Do others on the thread have ideas?
>>
>> --Chris Nauroth
>>
>> [1] http://linux.die.net/man/2/mlock
>>
>>
>>
>>
>> On 2/4/16, 5:11 AM, "Jan Lukavsk√Ĺ" <jan.lukavsky@firma.seznam.cz> wrote:
>>
>>> Hello,
>>>
>>> I have a question about the way LinuxResourceCalculatorPlugin calculates
>>> memory consumed by process tree (it is calculated via
>>> ProcfsBasedProcessTree class). When we enable caching (disk) in apache
>>> spark jobs run on YARN cluster, the node manager starts to kill the
>>> containers while reading the cached data, because of "Container is
>>> running beyond memory limits ...". The reason is that even if we enable
>>> parsing of the smaps file
>>> (yarn.nodemanager.container-monitor.procfs-tree.smaps-based-rss.enabled)
>>> the ProcfsBasedProcessTree calculates mmaped read-only pages as consumed
>>> by the process tree, while spark uses FileChannel.map(MapMode.READ_ONLY)
>>> to read the cached data. The JVM then consumes *a lot* more memory than
>>> the configured heap size (and it cannot be really controlled), but this
>>> memory is IMO not really consumed by the process, the kernel can reclaim
>>> these pages, if needed. My question is - is there any explicit reason
>>> why "Private_Clean" pages are calculated as consumed by process tree? I
>>> patched the ProcfsBasedProcessTree not to calculate them, but I don't
>>> know if this is the "correct" solution.
>>>
>>> Thanks for opinions,
>>>   cheers,
>>>   Jan
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
>>> For additional commands, e-mail: user-help@hadoop.apache.org
>>>
>>>
>


Mime
View raw message