cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefania (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-11594) Too many open files on directories
Date Thu, 25 Aug 2016 04:14:20 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-11594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15436259#comment-15436259
] 

Stefania commented on CASSANDRA-11594:
--------------------------------------

I've reproduced a leak of directory descriptors with a [test|https://github.com/stef1927/cassandra-dtest/commit/6ca5ae864676589132e656105da4c204621692be#diff-60812631a43b8e1f0c9fb53d9f7487ebR209].
The bug mentioned above would leak file descriptors for directories if:

# A transaction log file is present, this means there is an ongoing transaction such as flushing
or compaction (note that repair uses many validation compactions)
# A request to list the files in a table directory is issued in parallel, this would occur
in the following cases:
## when loading sstables at startup or via nodetool
## when calculating the snapshots size via nodetool listsnapshots or reading the SnapshotSize
metric
## when adding new tables to the schema or updating the keyspace
## when creating indexes or materialized views, or when they get rebuilt

Of all the points above, I think the most likely is the SnapshotSize metric being monitored,
which is how I reproduced the leak in my test, by inserting data in parallel with a repair
and with a nodetool listshapshots (which exercises the same code as reading SnapshotSize does).

So [~n0rad] do you monitor SnapshotSize by any chance, or can you think of any other of the
points above that might apply? If you want to test the patch let us know, I can apply it to
your release version.

> Too many open files on directories
> ----------------------------------
>
>                 Key: CASSANDRA-11594
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11594
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: n0rad
>            Assignee: Stefania
>            Priority: Critical
>         Attachments: Grafana   Cassandra   Cluster.png, openfiles.zip, screenshot.png
>
>
> I have a 6 nodes cluster in prod in 3 racks.
> each node :
> - 4Gb commitlogs on 343 files
> - 275Gb data on 504 files 
> On saturday, 1 node in each rack crash with with too many open files (seems to be the
similar node in each rack).
> {code}
> lsof -n -p $PID give me 66899 out of 65826 max
> {code}
> it contains 64527 open directories (2371 uniq)
> a part of the list :
> {code}
> java    19076 root 2140r      DIR   8,17      143360 4386718705 /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java    19076 root 2141r      DIR   8,17      143360 4386718705 /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java    19076 root 2142r      DIR   8,17      143360 4386718705 /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java    19076 root 2143r      DIR   8,17      143360 4386718705 /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java    19076 root 2144r      DIR   8,17      143360 4386718705 /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java    19076 root 2145r      DIR   8,17      143360 4386718705 /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java    19076 root 2146r      DIR   8,17      143360 4386718705 /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java    19076 root 2147r      DIR   8,17      143360 4386718705 /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java    19076 root 2148r      DIR   8,17      143360 4386718705 /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java    19076 root 2149r      DIR   8,17      143360 4386718705 /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java    19076 root 2150r      DIR   8,17      143360 4386718705 /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java    19076 root 2151r      DIR   8,17      143360 4386718705 /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java    19076 root 2152r      DIR   8,17      143360 4386718705 /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java    19076 root 2153r      DIR   8,17      143360 4386718705 /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java    19076 root 2154r      DIR   8,17      143360 4386718705 /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> java    19076 root 2155r      DIR   8,17      143360 4386718705 /opt/stage2/pod-cassandra-aci-cassandra/rootfs/data/keyspaces/email_logs_query/emails-2d4abd00e9ea11e591199d740e07bd95
> {code}
> The 3 others nodes crashes 4 hours later



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message