I'm operating a several two-node clusters (version 0.6.5) on VMs in our development and test environments.
After about a week of operation under similar conditions, one of them started throwing this:
WARN [main] 2010-10-12 08:08:31,245 CustomTThreadPoolServer.java (line 104) Transport error occurred during acceptance of message.
org.apache.thrift.transport.TTransportException: java.net.SocketException: Too many open files
Caused by: java.net.SocketException: Too many open files
at java.net.PlainSocketImpl.socketAccept(Native Method)
... 5 more
I found that the offending node had hundreds of sockets (on the StoragePort, between the two nodes) in CLOSE_WAIT state, which was causing new connections to bump into the fd limit. It seems similar to what is originally described (but never resolved) several months ago in this thread:
Has anyone else encountered this problem? I am curious about what might trigger this in one cluster and not on the others (which operate in the same environment, and are configured similarly).
Any insight would be appreciated.