cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Serg Shnerson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-4571) Strange permament socket descriptors increasing leads to "Too many open files"
Date Thu, 23 Aug 2012 18:54:42 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-4571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13440552#comment-13440552
] 

Serg Shnerson commented on CASSANDRA-4571:
------------------------------------------

It seems that bug is related to Java NIO internals (May be to Thrift framework). Please, read
https://forums.oracle.com/forums/thread.jspa?threadID=1146235 for more details and give your
thoughts about.
>From topic: "I am submitting this post to highlight a possible NIO "gotcha" in multithreaded
applications and pose a couple of questions. We have observed file descriptor resource leakage
(eventually leading to server failure) in a server process using NIO within the excellent
framework written by Ronny Standtke (http://nioframework.sourceforge.net). Platform is JDK1.6.0_05
on RHEL4. I don't think that this is the same issue as that in connection with TCP CLOSED
sockets reported elsewhere - What leaks here are descriptors connected to Unix domain sockets.

In the framework, SelectableChannels registered in a selector are select()-ed in a single
thread that handles data transfer to clients of the selector channels, executing in different
threads. When a client shuts down its connection (invoking key.cancel() and key.channel.close())
eventually we get to JRE AbstractInterruptibleChannel::close() and SocketChannelImpl::implCloseSelectableChannel()
which does the preClose() - via JNI this dup2()s a statically maintained descriptor (attached
to a dummy Unix domain socket) onto the underlying file descriptor (as discussed by Alan Bateman
(http://mail.openjdk.java.net/pipermail/core-libs-dev/2008-January/000219.html)). The problem
occurs when the select() thread runs at the same time and the cancelled key is seen by SelectorImpl::processDeregisterQueue().
Eventually (in our case) EPollSelectorImpl::implDereg() tests the "channel closed" flag set
by AbstractInterruptibleChannel::close() (this is not read-protected by a lock) and executes
channel.kill() which closes the underlying file descriptor. If this happens before the preClose()
in the other thread, the out-of-sequence dup2() leaks the file descriptor, attached to the
UNIX domain socket.

In the framework mentioned, we don't particularly want to add locking in the select() thread
as this would impact other clients of the selector - alternatively a fix is to simply comment
out the key.cancel(). channel.close() does the cancel() for us anyway, but after the close()/preClose()
has completed, so the select() processing then occurs in the right sequence. (I am notifying
Ronny Standtke of this issue independently)."

See also following links for more information:
http://stackoverflow.com/questions/7038688/java-nio-causes-file-descriptor-leak
http://mail-archives.apache.org/mod_mbox/tomcat-users/201201.mbox/%3CCAJkSUv-DDKTCQ-pD7W=QOVmPH1dXeXOvcr+3mCgu05cqpT7Zjg@mail.gmail.com%3E
http://www.apacheserver.net/HBase-Thrift-for-CDH3U3-leaking-file-descriptors-socket-at1580921.htm

                
> Strange permament socket descriptors increasing leads to "Too many open files"
> ------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4571
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4571
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.2
>         Environment: CentOS 5.8 Linux 2.6.18-308.13.1.el5 #1 SMP Tue Aug 21 17:10:18
EDT 2012 x86_64 x86_64 x86_64 GNU/Linux. 
> java version "1.6.0_33"
> Java(TM) SE Runtime Environment (build 1.6.0_33-b03)
> Java HotSpot(TM) 64-Bit Server VM (build 20.8-b03, mixed mode)
>            Reporter: Serg Shnerson
>            Priority: Critical
>
> On the two-node cluster there was found strange socket descriptors increasing. lsof -n
| grep java shows many rows like"
> java       8380 cassandra  113r     unix 0xffff8101a374a080            938348482 socket
> java       8380 cassandra  114r     unix 0xffff8101a374a080            938348482 socket
> java       8380 cassandra  115r     unix 0xffff8101a374a080            938348482 socket
> java       8380 cassandra  116r     unix 0xffff8101a374a080            938348482 socket
> java       8380 cassandra  117r     unix 0xffff8101a374a080            938348482 socket
> java       8380 cassandra  118r     unix 0xffff8101a374a080            938348482 socket
> java       8380 cassandra  119r     unix 0xffff8101a374a080            938348482 socket
> java       8380 cassandra  120r     unix 0xffff8101a374a080            938348482 socket
> " And number of this rows constantly increasing. After about 24 hours this situation
leads to error.
> We use PHPCassa client. Load is not so high (aroud ~50kb/s on write). 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Mime
View raw message