incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Plowe <eric.pl...@gmail.com>
Subject Re: binary protocol server side sockets
Date Thu, 10 Apr 2014 19:44:30 GMT
I am having the exact same issue. I see the connections pile up and pile
up, but they never seem to come down. Any insight into this would be
amazing.


Eric Plowe


On Wed, Apr 9, 2014 at 4:17 PM, graham sanderson <graham@vast.com> wrote:

> Thanks Michael,
>
> Yup keepalive is not the default. It is possible they are going away after
> nf_conntrack_tcp_timeout_established; will have to do more digging (it is
> hard to tell how old a connection is - there are no visible timers (thru
> netstat) on an ESTABLISHED connection))...
>
> This is actually low on my priority list, I was just spending a bit of
> time trying to track down the source of
>
> ERROR [Native-Transport-Requests:3833603] 2014-04-09 17:46:48,833
> ErrorMessage.java (line 222) Unexpected exception during request
> java.io.IOException: Connection reset by peer
>         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>         at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>         at
> org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64)
>         at
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:109)
>         at
> org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312)
>         at
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:90)
>         at
> org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:744)
>
> errors, which are spamming our server logs quite a lot (I originally
> thought this might be caused by KEEPALIVE, which is when I realized that
> the connections weren't in keep alive and were building up) - it would be
> nice if netty would tell us which a little about the Socket channel in the
> error message (maybe there is a way to do this by changing log levels, but
> as I say I haven't had time to go digging there)
>
> I will probably file a JIRA issue to add the setting (since I can't see
> any particular harm to setting keepalive)
>
> On Apr 9, 2014, at 1:34 PM, Michael Shuler <michael@pbandjelly.org> wrote:
>
> > On 04/09/2014 12:41 PM, graham sanderson wrote:
> >> Michael, it is not that the connections are being dropped, it is that
> >> the connections are not being dropped.
> >
> > Thanks for the clarification.
> >
> >> These server side sockets are ESTABLISHED, even though the client
> >> connection on the other side of the network device is long gone. This
> >> may well be an issue with the network device (it is valiantly trying
> >> to keep the connection alive it seems).
> >
> > Have you tested if they *ever* time out on their own, or do they just
> keep sticking around forever? (maybe 432000 sec (120 hours), which is the
> default for nf_conntrack_tcp_timeout_established?) Trying out all the usage
> scenarios is really the way to track it down - directly on switch,
> behind/in front of firewall, on/off the VPN.
> >
> >> That said KEEPALIVE on the server side would not be a bad idea. At
> >> least then the OS on the server would eventually (probably after 2
> >> hours of inactivity) attempt to ping the client. At that point
> >> hopefully something interesting would happen perhaps causing an error
> >> and destroying the server side socket (note KEEPALIVE is also good
> >> for preventing idle connections from being dropped by other network
> >> devices along the way)
> >
> > Tuning net.ipv4.tcp_keepalive_* could be helpful, if you know they
> timeout after 2 hours, which is the default.
> >
> >> rpc_keepalive on the server sets keep alive on the server side
> >> sockets for thrift, and is true by default
> >>
> >> There doesn't seem to be a setting for the native protocol
> >>
> >> Note this isn't a huge issue for us, they can be cleaned up by a
> >> rolling restart, and this particular case is not production, but
> >> related to development/testing against alpha by people working
> >> remotely over VPN - and it may well be the VPNs fault in this case...
> >> that said and maybe this is a dev list question, it seems like the
> >> option to set keepalive should exist.
> >
> > Yeah, but I agree you shouldn't have to restart to clean up connections
> - that's why I think it is lower in the network stack, and that a bit of
> troubleshooting and tuning might be helpful. That setting sounds like a
> good Jira request - keepalive may be the default, I'm not sure. :)
> >
> > --
> > Michael
> >
> >> On Apr 9, 2014, at 12:25 PM, Michael Shuler <michael@pbandjelly.org>
> >> wrote:
> >>
> >>> On 04/09/2014 11:39 AM, graham sanderson wrote:
> >>>> Thanks, but I would think that just sets keep alive from the
> >>>> client end; I'm talking about the server end... this is one of
> >>>> those issues where there is something (e.g. switch, firewall, VPN
> >>>> in between the client and the server) and we get left with
> >>>> orphaned established connections to the server when the client is
> >>>> gone.
> >>>
> >>> There would be no server setting for any service, not just c*, that
> >>> would correct mis-configured connection-assassinating network gear
> >>> between the client and server. Fix the gear to allow persistent
> >>> connections.
> >>>
> >>> Digging through the various timeouts in c*.yaml didn't lead me to a
> >>> simple answer for something tunable, but I think this may be more
> >>> basic networking related. I believe it's up to the client to keep
> >>> the connection open as Duy indicated. I don't think c* will
> >>> arbitrarily sever connections - something that disconnects the
> >>> client may happen. In that case, the TCP connection on the server
> >>> should drop to TIME_WAIT. Is this what you are seeing in `netstat
> >>> -a` on the server - a bunch of TIME_WAIT connections hanging
> >>> around? Those should eventually be recycled, but that's tunable in
> >>> the network stack, if they are being generated at a high rate.
> >>>
> >>> -- Michael
> >>
> >
>
>

Mime
View raw message