Greetings.

I'm operating a several two-node clusters (version 0.6.5) on VMs in our development and test environments.

After about a week of operation under similar conditions, one of them started throwing this:

WARN [main] 2010-10-12 08:08:31,245 CustomTThreadPoolServer.java (line 104) Transport error occurred during acceptance of message.
org.apache.thrift.transport.TTransportException: java.net.SocketException: Too many open files
        at org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:124)
        at org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:35)
        at org.apache.thrift.transport.TServerTransport.accept(TServerTransport.java:31)
        at org.apache.cassandra.thrift.CustomTThreadPoolServer.serve(CustomTThreadPoolServer.java:98)
        at org.apache.cassandra.thrift.CassandraDaemon.start(CassandraDaemon.java:186)
        at org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:227)
Caused by: java.net.SocketException: Too many open files
        at java.net.PlainSocketImpl.socketAccept(Native Method)
        at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:384)
        at java.net.ServerSocket.implAccept(ServerSocket.java:453)
        at java.net.ServerSocket.accept(ServerSocket.java:421)
        at org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:119)
        ... 5 more

I found that the offending node had hundreds of sockets (on the StoragePort, between the two nodes) in CLOSE_WAIT state, which was causing new connections to bump into the fd limit. It seems similar to what is originally described (but never resolved) several months ago in this thread:

http://www.mail-archive.com/user@cassandra.apache.org/msg01381.html

Has anyone else encountered this problem? I am curious about what might trigger this in one cluster and not on the others (which operate in the same environment, and are configured similarly).

Any insight would be appreciated.

Thanks,
Adam