cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Heiko Sommer (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-12699) Excessive use of "hidden" Linux page table memory
Date Tue, 27 Sep 2016 09:19:20 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-12699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15525576#comment-15525576
] 

Heiko Sommer commented on CASSANDRA-12699:
------------------------------------------

I think that C* could have page tables much smaller than the 3-4 GB that I see, no matter
how large the data files are. For compactions and anticompactions, I believe that C* reads
through large data files, merging or splitting their content in an ordered fashion, but never
jumping around in any of these files. Shouldn't it be possible to "mmap" only a small "running
working section" of these files, and to quickly "munmap" the pages that have been processed
already? Or to "mmap" the whole file at once, trusting that Linux will only populate the page
table upon page access (which explains the slow ramp-up of PTE memory in my plot), but still
to "munmap" pages as soon as possible?

This immediate unmapping might be difficult to implement, especially if Java libraries do
not support it, but at least http://lanyrd.com/2013/apachecon/scctbq/ talks about C* already
doing unmapping tricks. Could it be extended to the file access done for (anti-)compactions?

I agree that buying 4 GB more RAM to work around the PTE memory issue could be an easy option
for our cluster, but I still suspect that there could be "freak wave" cases where much more
than 4 GB PTE are needed, so that the issue is of general interest to get it fixed. 

I will also look into issues with VmData and anonymously mapped memory that Ariel pointed
out, to see how our cluster is doing with these. 

> Excessive use of "hidden" Linux page table memory
> -------------------------------------------------
>
>                 Key: CASSANDRA-12699
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12699
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: Cassandra 2.2.7 on Red Hat 6.7, with Java 1.8.0_73. Probably others.

>            Reporter: Heiko Sommer
>         Attachments: PageTableMemoryExample.png, cassandra-env.sh, cassandra.yaml, cassandraMemoryLog.sh,
cassandraMemoryLog.sh
>
>
> free 
> The cassandra JVM process uses many gigabytes of page table memory during certain activities,
which can lead to oom-killer action with "java.lang.OutOfMemoryError: null" logs.
> Page table memory is not reported by Linux tools such as "top" or "ps" and therefore
might be responsible also for other spurious Cassandra issues with "memory eating" or crashes,
e.g. CASSANDRA-8723.
> The problem happens especially (or only?) during large compactions and anticompactions.

> Eventually all memory gets released, which means there is no real leak. Still I suspect
that the memory mappings that fill the page table could be released much sooner, to keep the
page table size at a small fraction of the total Cassandra process memory. 
> How to reproduce: Record the memory use on a Cassandra node, including page table memory,
for example using the attached script cassandraMemoryLog.sh. Even when there is no crash,
the ramping up and sudden release of page table memory is visible. 
> A stacked area plot for the memory on one of our crashed nodes is attached (PageTableMemoryExample.png).
The page table memory used by Cassandra is shown in red ("VmPTE").
> (In the plot we also see that the sum of measured memory portions sometimes exceeds the
total memory. This is probably an issue of how RSS memory is measured, perhaps including some
buffers/cache memory that also counts toward available memory. It does not invalidate the
finding that page table memory is growing to enormous sizes.) 
> Shortly before the crash, /proc/$PID/status reported 
>                 VmPeak: 6989760944 kB
>                 VmSize: 5742400572 kB
>                 VmLck:   4735036 kB
>                 VmHWM:   8589972 kB
>                 VmRSS:   7022036 kB
>                 VmData: 10019732 kB
>                 VmStk:        92 kB
>                 VmExe:         4 kB
>                 VmLib:     17584 kB
>                 VmPTE:   3965856 kB
>                 VmSwap:        0 kB
> The files cassandra.yaml and cassandra-env.sh used on the node where the data was taken
are attached. 
> Please let me know if I should provide any other data or descriptions to help with this
ticket. 
> Known workarounds: Use more RAM, or limit the amount of Java heap memory. In the above
crash, MAX_HEAP_SIZE was not set, so that the default heap size for 12 GB RAM was used (-Xms2976M,
-Xmx2976M). 
> We have not tried yet if variations of heap vs. offheap config choices make a difference.

> Perhaps there are other workarounds using -XX+UseLargePages or related Linux settings
to reduce the size of the process page table?
> I believe that we see these crashes more often than other projects because we have a
test system with not much RAM but with a lot of data (compressed ~3 TB per node), while the
CPUs are slow so that anti-/compactions overlap a lot. 
> Ideally Cassandra (native) code should be changed to release memory in smaller chunks,
so that page table size cannot cause an otherwise stable system to crash.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message