cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "graham sanderson (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-6275) 2.0.x leaks file handles
Date Tue, 19 Nov 2013 20:43:21 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-6275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13826912#comment-13826912
] 

graham sanderson edited comment on CASSANDRA-6275 at 11/19/13 8:42 PM:
-----------------------------------------------------------------------

Yes I believe we can mitigate the problem in the OpCenter case, however it is a good test
bed since it makes the problem easy to spot - note it seems to be worse under high read/write
activity on tracked keyspaces/CFs, however that makes sense.

Note I was poking (somewhat blindly) thru the (2.0.2) code (partly out of interest) looking
for what might be leaking these file handles, and I also took a heap dump. I discovered what
turned out to be CASSANDRA-6358 which leaks FileDescriptors though their refCounts all seemed
to be 0. In any case there weren't enough (total FileDescriptors - in the heap dump) to account
for the problem. They were also for mem-mapped files (the ifile in SSTableReader) and none
of the leaked deleted file handles were mem-mapped (since they were compressed data files)

That said CASSANDRA-6358 was pinning the SSTableReaders in memory (since the Runnable was
an anonymous inner class), so someone with more knowledge of the code might have a better
idea if this might be a problem (other than the memory leak)

I don't have an environment yet where I can easily build and install code changes, though
we could downgrade our system test environment to 2.0.0 to see if we can reproduce the problem
there - unsure if we can downgrade to 1.2.X easily given our current testing.

Note while I was looking at the code I came across CASSANDRA-5555... What caught my eye was
the interaction between FileCacheService and RAR.deallocate, but more specifically related
to the fact that this change, added a concurrent structure inside another separate concurrent
structure, and it seemed like there might be a case where a RAR was recycled into a concurrent
queue that was already removed and drained, in which case it would get GCed without close,
presumably causing a file handle leak on the native side. Though I couldn't come up with any
significantly convincing interactions that would cause this to happen without some very very
unlucky things happening (and my knowledge of the google cache implementation was even more
limited!), so this is unlikely the cause of this issue (especially if the issue doesn't happen
in the 1.2.7+ branch), because I think nearly all deleted data files are being leakd, and
finally because there is no particular correlation with TTL.


was (Author: graham sanderson):
Yes I believe we can mitigate the problem in the OpCenter case, however it is a good test
bed since it makes the problem easy to spot - note it seems to be worse under high read/write
activity on tracked keyspaces/CFs, however that makes sense.

Note I was poking (somewhat blindly) thru the (2.0.2) code (partly out of interest) looking
for what might be leaking these file handles, and I also took a heap dump. I discovered what
turned out to be CASSANDRA-6358 which leaks FileDescriptors though their refCounts all seemed
to be 0. In any case there weren't enough (total FileDescriptors - in a heap dump) to account
for the problem. They were also for mem-mapped files (the ifile in SSTableReader) and none
of the leaked delete file handles were mem-mapped (since they were compressed data files)

That said CASSANDRA-6358 was pinning the SSTableReaders in memory (since the Runnable was
an anonymous inner class), so someone with more knowledge of the code might have a better
idea, if this might be a problem (other than the memory leak)

I don't have an environment yet where I can easily build and install code changes, though
we could downgrade our system test environment to 2.0.0 to see if we can reproduce the problem
there - unsure if we can downgrade to 1.2.X easily given our current testing.

Note while I was looking at the code I came across CASSANDRA-5555... What caught my eye was
the interaction between FileCacheService and RAR.deallocate, but more specifically related
to the fact that this change, added a concurrent structure inside another separate concurrent
structure, and it seemed like there might be a case where a RAR was recycled into a concurrent
queue which was already completely removed and deallocated, in which case it would get GCed
without close, presumably causing a file handle leak on the native side. Though I couldn't
come up with any significantly convincing interactions that would cause this to happen without
some very very unlucky things happening (and my knowledge of the google cache implementation
was even more limited!), so this is unlikely the cause of this issue (especially if the issue
doesn't happen in the 1.2.7+ branch), and unlikely also because there is no particular correlation
with TTL I don't think

> 2.0.x leaks file handles
> ------------------------
>
>                 Key: CASSANDRA-6275
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6275
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: java version "1.7.0_25"
> Java(TM) SE Runtime Environment (build 1.7.0_25-b15)
> Java HotSpot(TM) 64-Bit Server VM (build 23.25-b01, mixed mode)
> Linux cassandra-test1 2.6.32-279.el6.x86_64 #1 SMP Thu Jun 21 15:00:18 EDT 2012 x86_64
x86_64 x86_64 GNU/Linux
>            Reporter: Mikhail Mazursky
>            Assignee: Michael Shuler
>         Attachments: c_file-descriptors_strace.tbz, cassandra_jstack.txt, leak.log, position_hints.tgz,
slog.gz
>
>
> Looks like C* is leaking file descriptors when doing lots of CAS operations.
> {noformat}
> $ sudo cat /proc/15455/limits
> Limit                     Soft Limit           Hard Limit           Units    
> Max cpu time              unlimited            unlimited            seconds  
> Max file size             unlimited            unlimited            bytes    
> Max data size             unlimited            unlimited            bytes    
> Max stack size            10485760             unlimited            bytes    
> Max core file size        0                    0                    bytes    
> Max resident set          unlimited            unlimited            bytes    
> Max processes             1024                 unlimited            processes
> Max open files            4096                 4096                 files    
> Max locked memory         unlimited            unlimited            bytes    
> Max address space         unlimited            unlimited            bytes    
> Max file locks            unlimited            unlimited            locks    
> Max pending signals       14633                14633                signals  
> Max msgqueue size         819200               819200               bytes    
> Max nice priority         0                    0                   
> Max realtime priority     0                    0                   
> Max realtime timeout      unlimited            unlimited            us 
> {noformat}
> Looks like the problem is not in limits.
> Before load test:
> {noformat}
> cassandra-test0 ~]$ lsof -n | grep java | wc -l
> 166
> cassandra-test1 ~]$ lsof -n | grep java | wc -l
> 164
> cassandra-test2 ~]$ lsof -n | grep java | wc -l
> 180
> {noformat}
> After load test:
> {noformat}
> cassandra-test0 ~]$ lsof -n | grep java | wc -l
> 967
> cassandra-test1 ~]$ lsof -n | grep java | wc -l
> 1766
> cassandra-test2 ~]$ lsof -n | grep java | wc -l
> 2578
> {noformat}
> Most opened files have names like:
> {noformat}
> java      16890 cassandra 1636r      REG             202,17  88724987     655520 /var/lib/cassandra/data/system/paxos/system-paxos-jb-644-Data.db
> java      16890 cassandra 1637r      REG             202,17 161158485     655420 /var/lib/cassandra/data/system/paxos/system-paxos-jb-255-Data.db
> java      16890 cassandra 1638r      REG             202,17  88724987     655520 /var/lib/cassandra/data/system/paxos/system-paxos-jb-644-Data.db
> java      16890 cassandra 1639r      REG             202,17 161158485     655420 /var/lib/cassandra/data/system/paxos/system-paxos-jb-255-Data.db
> java      16890 cassandra 1640r      REG             202,17  88724987     655520 /var/lib/cassandra/data/system/paxos/system-paxos-jb-644-Data.db
> java      16890 cassandra 1641r      REG             202,17 161158485     655420 /var/lib/cassandra/data/system/paxos/system-paxos-jb-255-Data.db
> java      16890 cassandra 1642r      REG             202,17  88724987     655520 /var/lib/cassandra/data/system/paxos/system-paxos-jb-644-Data.db
> java      16890 cassandra 1643r      REG             202,17 161158485     655420 /var/lib/cassandra/data/system/paxos/system-paxos-jb-255-Data.db
> java      16890 cassandra 1644r      REG             202,17  88724987     655520 /var/lib/cassandra/data/system/paxos/system-paxos-jb-644-Data.db
> java      16890 cassandra 1645r      REG             202,17 161158485     655420 /var/lib/cassandra/data/system/paxos/system-paxos-jb-255-Data.db
> java      16890 cassandra 1646r      REG             202,17  88724987     655520 /var/lib/cassandra/data/system/paxos/system-paxos-jb-644-Data.db
> java      16890 cassandra 1647r      REG             202,17 161158485     655420 /var/lib/cassandra/data/system/paxos/system-paxos-jb-255-Data.db
> java      16890 cassandra 1648r      REG             202,17  88724987     655520 /var/lib/cassandra/data/system/paxos/system-paxos-jb-644-Data.db
> java      16890 cassandra 1649r      REG             202,17 161158485     655420 /var/lib/cassandra/data/system/paxos/system-paxos-jb-255-Data.db
> java      16890 cassandra 1650r      REG             202,17  88724987     655520 /var/lib/cassandra/data/system/paxos/system-paxos-jb-644-Data.db
> java      16890 cassandra 1651r      REG             202,17 161158485     655420 /var/lib/cassandra/data/system/paxos/system-paxos-jb-255-Data.db
> java      16890 cassandra 1652r      REG             202,17  88724987     655520 /var/lib/cassandra/data/system/paxos/system-paxos-jb-644-Data.db
> java      16890 cassandra 1653r      REG             202,17 161158485     655420 /var/lib/cassandra/data/system/paxos/system-paxos-jb-255-Data.db
> java      16890 cassandra 1654r      REG             202,17  88724987     655520 /var/lib/cassandra/data/system/paxos/system-paxos-jb-644-Data.db
> java      16890 cassandra 1655r      REG             202,17 161158485     655420 /var/lib/cassandra/data/system/paxos/system-paxos-jb-255-Data.db
> java      16890 cassandra 1656r      REG             202,17  88724987     655520 /var/lib/cassandra/data/system/paxos/system-paxos-jb-644-Data.db
> {noformat}
> Also, when that happens it's not always possible to shutdown server process via SIGTERM.
Have to use SIGKILL.
> p.s. See mailing thread for more context information https://www.mail-archive.com/user@cassandra.apache.org/msg33035.html



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message