incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From graham sanderson <gra...@vast.com>
Subject Re: binary protocol server side sockets
Date Wed, 09 Apr 2014 20:17:22 GMT
Thanks Michael,

Yup keepalive is not the default. It is possible they are going away after nf_conntrack_tcp_timeout_established;
will have to do more digging (it is hard to tell how old a connection is - there are no visible
timers (thru netstat) on an ESTABLISHED connection))…

This is actually low on my priority list, I was just spending a bit of time trying to track
down the source of 

ERROR [Native-Transport-Requests:3833603] 2014-04-09 17:46:48,833 ErrorMessage.java (line
222) Unexpected exception during request
java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
	at sun.nio.ch.IOUtil.read(IOUtil.java:192)
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
	at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64)
	at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:109)
	at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312)
	at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:90)
	at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:744)

errors, which are spamming our server logs quite a lot (I originally thought this might be
caused by KEEPALIVE, which is when I realized that the connections weren’t in keep alive
and were building up) - it would be nice if netty would tell us which a little about the Socket
channel in the error message (maybe there is a way to do this by changing log levels, but
as I say I haven’t had time to go digging there)

I will probably file a JIRA issue to add the setting (since I can’t see any particular harm
to setting keepalive)

On Apr 9, 2014, at 1:34 PM, Michael Shuler <michael@pbandjelly.org> wrote:

> On 04/09/2014 12:41 PM, graham sanderson wrote:
>> Michael, it is not that the connections are being dropped, it is that
>> the connections are not being dropped.
> 
> Thanks for the clarification.
> 
>> These server side sockets are ESTABLISHED, even though the client
>> connection on the other side of the network device is long gone. This
>> may well be an issue with the network device (it is valiantly trying
>> to keep the connection alive it seems).
> 
> Have you tested if they *ever* time out on their own, or do they just keep sticking around
forever? (maybe 432000 sec (120 hours), which is the default for nf_conntrack_tcp_timeout_established?)
Trying out all the usage scenarios is really the way to track it down - directly on switch,
behind/in front of firewall, on/off the VPN.
> 
>> That said KEEPALIVE on the server side would not be a bad idea. At
>> least then the OS on the server would eventually (probably after 2
>> hours of inactivity) attempt to ping the client. At that point
>> hopefully something interesting would happen perhaps causing an error
>> and destroying the server side socket (note KEEPALIVE is also good
>> for preventing idle connections from being dropped by other network
>> devices along the way)
> 
> Tuning net.ipv4.tcp_keepalive_* could be helpful, if you know they timeout after 2 hours,
which is the default.
> 
>> rpc_keepalive on the server sets keep alive on the server side
>> sockets for thrift, and is true by default
>> 
>> There doesn’t seem to be a setting for the native protocol
>> 
>> Note this isn’t a huge issue for us, they can be cleaned up by a
>> rolling restart, and this particular case is not production, but
>> related to development/testing against alpha by people working
>> remotely over VPN - and it may well be the VPNs fault in this case…
>> that said and maybe this is a dev list question, it seems like the
>> option to set keepalive should exist.
> 
> Yeah, but I agree you shouldn't have to restart to clean up connections - that's why
I think it is lower in the network stack, and that a bit of troubleshooting and tuning might
be helpful. That setting sounds like a good Jira request - keepalive may be the default, I'm
not sure. :)
> 
> -- 
> Michael
> 
>> On Apr 9, 2014, at 12:25 PM, Michael Shuler <michael@pbandjelly.org>
>> wrote:
>> 
>>> On 04/09/2014 11:39 AM, graham sanderson wrote:
>>>> Thanks, but I would think that just sets keep alive from the
>>>> client end; I’m talking about the server end… this is one of
>>>> those issues where there is something (e.g. switch, firewall, VPN
>>>> in between the client and the server) and we get left with
>>>> orphaned established connections to the server when the client is
>>>> gone.
>>> 
>>> There would be no server setting for any service, not just c*, that
>>> would correct mis-configured connection-assassinating network gear
>>> between the client and server. Fix the gear to allow persistent
>>> connections.
>>> 
>>> Digging through the various timeouts in c*.yaml didn't lead me to a
>>> simple answer for something tunable, but I think this may be more
>>> basic networking related. I believe it's up to the client to keep
>>> the connection open as Duy indicated. I don't think c* will
>>> arbitrarily sever connections - something that disconnects the
>>> client may happen. In that case, the TCP connection on the server
>>> should drop to TIME_WAIT. Is this what you are seeing in `netstat
>>> -a` on the server - a bunch of TIME_WAIT connections hanging
>>> around? Those should eventually be recycled, but that's tunable in
>>> the network stack, if they are being generated at a high rate.
>>> 
>>> -- Michael
>> 
> 


Mime
View raw message