Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of eric.plowe@gmail.com designates
 209.85.219.52 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <A8B4DDCD-CACB-42C3-BC65-060841F4765A@vast.com>
References: <882BA080-1B92-4856-8324-2904382FCBFF@vast.com>
	<CABNXB2CXRuLsNxFoKAPAYvO8h7b1T3z6v4UbA7w_S6kgQEwhpA@mail.gmail.com>
	<306BB8DC-4061-4C23-AFB9-2ED13846AE6D@vast.com>
	<53458290.70809@pbandjelly.org>
	<34118C18-3E83-45B9-9AEC-8DFF7BF8C8AB@vast.com>
	<534592C5.3000103@pbandjelly.org>
	<A8B4DDCD-CACB-42C3-BC65-060841F4765A@vast.com>
Date: Thu, 10 Apr 2014 15:44:30 -0400
Message-ID: 
 <CALKyeu9Pxo5__sBWrsP6nbrGgOhdabm0h37JtNOBtUx00ftgBQ@mail.gmail.com>
Subject: Re: binary protocol server side sockets
From: Eric Plowe <eric.plowe@gmail.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=001a11c2efb814972e04f6b5724d

--001a11c2efb814972e04f6b5724d
Content-Type: text/plain; charset=ISO-8859-1

I am having the exact same issue. I see the connections pile up and pile
up, but they never seem to come down. Any insight into this would be
amazing.


Eric Plowe


On Wed, Apr 9, 2014 at 4:17 PM, graham sanderson <graham@vast.com> wrote:

> Thanks Michael,
>
> Yup keepalive is not the default. It is possible they are going away after
> nf_conntrack_tcp_timeout_established; will have to do more digging (it is
> hard to tell how old a connection is - there are no visible timers (thru
> netstat) on an ESTABLISHED connection))...
>
> This is actually low on my priority list, I was just spending a bit of
> time trying to track down the source of
>
> ERROR [Native-Transport-Requests:3833603] 2014-04-09 17:46:48,833
> ErrorMessage.java (line 222) Unexpected exception during request
> java.io.IOException: Connection reset by peer
>         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>         at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>         at
> org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64)
>         at
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:109)
>         at
> org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312)
>         at
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:90)
>         at
> org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:744)
>
> errors, which are spamming our server logs quite a lot (I originally
> thought this might be caused by KEEPALIVE, which is when I realized that
> the connections weren't in keep alive and were building up) - it would be
> nice if netty would tell us which a little about the Socket channel in the
> error message (maybe there is a way to do this by changing log levels, but
> as I say I haven't had time to go digging there)
>
> I will probably file a JIRA issue to add the setting (since I can't see
> any particular harm to setting keepalive)
>
> On Apr 9, 2014, at 1:34 PM, Michael Shuler <michael@pbandjelly.org> wrote:
>
> > On 04/09/2014 12:41 PM, graham sanderson wrote:
> >> Michael, it is not that the connections are being dropped, it is that
> >> the connections are not being dropped.
> >
> > Thanks for the clarification.
> >
> >> These server side sockets are ESTABLISHED, even though the client
> >> connection on the other side of the network device is long gone. This
> >> may well be an issue with the network device (it is valiantly trying
> >> to keep the connection alive it seems).
> >
> > Have you tested if they *ever* time out on their own, or do they just
> keep sticking around forever? (maybe 432000 sec (120 hours), which is the
> default for nf_conntrack_tcp_timeout_established?) Trying out all the usage
> scenarios is really the way to track it down - directly on switch,
> behind/in front of firewall, on/off the VPN.
> >
> >> That said KEEPALIVE on the server side would not be a bad idea. At
> >> least then the OS on the server would eventually (probably after 2
> >> hours of inactivity) attempt to ping the client. At that point
> >> hopefully something interesting would happen perhaps causing an error
> >> and destroying the server side socket (note KEEPALIVE is also good
> >> for preventing idle connections from being dropped by other network
> >> devices along the way)
> >
> > Tuning net.ipv4.tcp_keepalive_* could be helpful, if you know they
> timeout after 2 hours, which is the default.
> >
> >> rpc_keepalive on the server sets keep alive on the server side
> >> sockets for thrift, and is true by default
> >>
> >> There doesn't seem to be a setting for the native protocol
> >>
> >> Note this isn't a huge issue for us, they can be cleaned up by a
> >> rolling restart, and this particular case is not production, but
> >> related to development/testing against alpha by people working
> >> remotely over VPN - and it may well be the VPNs fault in this case...
> >> that said and maybe this is a dev list question, it seems like the
> >> option to set keepalive should exist.
> >
> > Yeah, but I agree you shouldn't have to restart to clean up connections
> - that's why I think it is lower in the network stack, and that a bit of
> troubleshooting and tuning might be helpful. That setting sounds like a
> good Jira request - keepalive may be the default, I'm not sure. :)
> >
> > --
> > Michael
> >
> >> On Apr 9, 2014, at 12:25 PM, Michael Shuler <michael@pbandjelly.org>
> >> wrote:
> >>
> >>> On 04/09/2014 11:39 AM, graham sanderson wrote:
> >>>> Thanks, but I would think that just sets keep alive from the
> >>>> client end; I'm talking about the server end... this is one of
> >>>> those issues where there is something (e.g. switch, firewall, VPN
> >>>> in between the client and the server) and we get left with
> >>>> orphaned established connections to the server when the client is
> >>>> gone.
> >>>
> >>> There would be no server setting for any service, not just c*, that
> >>> would correct mis-configured connection-assassinating network gear
> >>> between the client and server. Fix the gear to allow persistent
> >>> connections.
> >>>
> >>> Digging through the various timeouts in c*.yaml didn't lead me to a
> >>> simple answer for something tunable, but I think this may be more
> >>> basic networking related. I believe it's up to the client to keep
> >>> the connection open as Duy indicated. I don't think c* will
> >>> arbitrarily sever connections - something that disconnects the
> >>> client may happen. In that case, the TCP connection on the server
> >>> should drop to TIME_WAIT. Is this what you are seeing in `netstat
> >>> -a` on the server - a bunch of TIME_WAIT connections hanging
> >>> around? Those should eventually be recycled, but that's tunable in
> >>> the network stack, if they are being generated at a high rate.
> >>>
> >>> -- Michael
> >>
> >
>
>

--001a11c2efb814972e04f6b5724d
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><div>I am having the exact same issue. I see the conn=
ections pile up and pile up, but they never seem to come down. Any insight =
into this would be amazing. <br></div><br><br></div>Eric Plowe<br></div><di=
v class=3D"gmail_extra">
<br><br><div class=3D"gmail_quote">On Wed, Apr 9, 2014 at 4:17 PM, graham s=
anderson <span dir=3D"ltr">&lt;<a href=3D"mailto:graham@vast.com" target=3D=
"_blank">graham@vast.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmai=
l_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left=
:1ex">
Thanks Michael,<br>
<br>
Yup keepalive is not the default. It is possible they are going away after =
nf_conntrack_tcp_timeout_established; will have to do more digging (it is h=
ard to tell how old a connection is - there are no visible timers (thru net=
stat) on an ESTABLISHED connection))&hellip;<br>

<br>
This is actually low on my priority list, I was just spending a bit of time=
 trying to track down the source of<br>
<br>
ERROR [Native-Transport-Requests:3833603] 2014-04-09 17:46:48,833 ErrorMess=
age.java (line 222) Unexpected exception during request<br>
java.io.IOException: Connection reset by peer<br>
&nbsp; &nbsp; &nbsp; &nbsp; at sun.nio.ch.FileDispatcherImpl.read0(Native M=
ethod)<br>
&nbsp; &nbsp; &nbsp; &nbsp; at sun.nio.ch.SocketDispatcher.read(SocketDispa=
tcher.java:39)<br>
&nbsp; &nbsp; &nbsp; &nbsp; at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUti=
l.java:223)<br>
&nbsp; &nbsp; &nbsp; &nbsp; at sun.nio.ch.IOUtil.read(IOUtil.java:192)<br>
&nbsp; &nbsp; &nbsp; &nbsp; at sun.nio.ch.SocketChannelImpl.read(SocketChan=
nelImpl.java:379)<br>
&nbsp; &nbsp; &nbsp; &nbsp; at org.jboss.netty.channel.socket.nio.NioWorker=
.read(NioWorker.java:64)<br>
&nbsp; &nbsp; &nbsp; &nbsp; at org.jboss.netty.channel.socket.nio.AbstractN=
ioWorker.process(AbstractNioWorker.java:109)<br>
&nbsp; &nbsp; &nbsp; &nbsp; at org.jboss.netty.channel.socket.nio.AbstractN=
ioSelector.run(AbstractNioSelector.java:312)<br>
&nbsp; &nbsp; &nbsp; &nbsp; at org.jboss.netty.channel.socket.nio.AbstractN=
ioWorker.run(AbstractNioWorker.java:90)<br>
&nbsp; &nbsp; &nbsp; &nbsp; at org.jboss.netty.channel.socket.nio.NioWorker=
.run(NioWorker.java:178)<br>
&nbsp; &nbsp; &nbsp; &nbsp; at java.util.concurrent.ThreadPoolExecutor.runW=
orker(ThreadPoolExecutor.java:1145)<br>
&nbsp; &nbsp; &nbsp; &nbsp; at java.util.concurrent.ThreadPoolExecutor$Work=
er.run(ThreadPoolExecutor.java:615)<br>
&nbsp; &nbsp; &nbsp; &nbsp; at java.lang.Thread.run(Thread.java:744)<br>
<br>
errors, which are spamming our server logs quite a lot (I originally though=
t this might be caused by KEEPALIVE, which is when I realized that the conn=
ections weren&rsquo;t in keep alive and were building up) - it would be nic=
e if netty would tell us which a little about the Socket channel in the err=
or message (maybe there is a way to do this by changing log levels, but as =
I say I haven&rsquo;t had time to go digging there)<br>

<br>
I will probably file a JIRA issue to add the setting (since I can&rsquo;t s=
ee any particular harm to setting keepalive)<br>
<div class=3D"HOEnZb"><div class=3D"h5"><br>
On Apr 9, 2014, at 1:34 PM, Michael Shuler &lt;<a href=3D"mailto:michael@pb=
andjelly.org">michael@pbandjelly.org</a>&gt; wrote:<br>
<br>
&gt; On 04/09/2014 12:41 PM, graham sanderson wrote:<br>
&gt;&gt; Michael, it is not that the connections are being dropped, it is t=
hat<br>
&gt;&gt; the connections are not being dropped.<br>
&gt;<br>
&gt; Thanks for the clarification.<br>
&gt;<br>
&gt;&gt; These server side sockets are ESTABLISHED, even though the client<=
br>
&gt;&gt; connection on the other side of the network device is long gone. T=
his<br>
&gt;&gt; may well be an issue with the network device (it is valiantly tryi=
ng<br>
&gt;&gt; to keep the connection alive it seems).<br>
&gt;<br>
&gt; Have you tested if they *ever* time out on their own, or do they just =
keep sticking around forever? (maybe 432000 sec (120 hours), which is the d=
efault for nf_conntrack_tcp_timeout_established?) Trying out all the usage =
scenarios is really the way to track it down - directly on switch, behind/i=
n front of firewall, on/off the VPN.<br>

&gt;<br>
&gt;&gt; That said KEEPALIVE on the server side would not be a bad idea. At=
<br>
&gt;&gt; least then the OS on the server would eventually (probably after 2=
<br>
&gt;&gt; hours of inactivity) attempt to ping the client. At that point<br>
&gt;&gt; hopefully something interesting would happen perhaps causing an er=
ror<br>
&gt;&gt; and destroying the server side socket (note KEEPALIVE is also good=
<br>
&gt;&gt; for preventing idle connections from being dropped by other networ=
k<br>
&gt;&gt; devices along the way)<br>
&gt;<br>
&gt; Tuning net.ipv4.tcp_keepalive_* could be helpful, if you know they tim=
eout after 2 hours, which is the default.<br>
&gt;<br>
&gt;&gt; rpc_keepalive on the server sets keep alive on the server side<br>
&gt;&gt; sockets for thrift, and is true by default<br>
&gt;&gt;<br>
&gt;&gt; There doesn&rsquo;t seem to be a setting for the native protocol<b=
r>
&gt;&gt;<br>
&gt;&gt; Note this isn&rsquo;t a huge issue for us, they can be cleaned up =
by a<br>
&gt;&gt; rolling restart, and this particular case is not production, but<b=
r>
&gt;&gt; related to development/testing against alpha by people working<br>
&gt;&gt; remotely over VPN - and it may well be the VPNs fault in this case=
&hellip;<br>
&gt;&gt; that said and maybe this is a dev list question, it seems like the=
<br>
&gt;&gt; option to set keepalive should exist.<br>
&gt;<br>
&gt; Yeah, but I agree you shouldn&#39;t have to restart to clean up connec=
tions - that&#39;s why I think it is lower in the network stack, and that a=
 bit of troubleshooting and tuning might be helpful. That setting sounds li=
ke a good Jira request - keepalive may be the default, I&#39;m not sure. :)=
<br>

&gt;<br>
&gt; --<br>
&gt; Michael<br>
&gt;<br>
&gt;&gt; On Apr 9, 2014, at 12:25 PM, Michael Shuler &lt;<a href=3D"mailto:=
michael@pbandjelly.org">michael@pbandjelly.org</a>&gt;<br>
&gt;&gt; wrote:<br>
&gt;&gt;<br>
&gt;&gt;&gt; On 04/09/2014 11:39 AM, graham sanderson wrote:<br>
&gt;&gt;&gt;&gt; Thanks, but I would think that just sets keep alive from t=
he<br>
&gt;&gt;&gt;&gt; client end; I&rsquo;m talking about the server end&hellip;=
 this is one of<br>
&gt;&gt;&gt;&gt; those issues where there is something (e.g. switch, firewa=
ll, VPN<br>
&gt;&gt;&gt;&gt; in between the client and the server) and we get left with=
<br>
&gt;&gt;&gt;&gt; orphaned established connections to the server when the cl=
ient is<br>
&gt;&gt;&gt;&gt; gone.<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; There would be no server setting for any service, not just c*,=
 that<br>
&gt;&gt;&gt; would correct mis-configured connection-assassinating network =
gear<br>
&gt;&gt;&gt; between the client and server. Fix the gear to allow persisten=
t<br>
&gt;&gt;&gt; connections.<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; Digging through the various timeouts in c*.yaml didn&#39;t lea=
d me to a<br>
&gt;&gt;&gt; simple answer for something tunable, but I think this may be m=
ore<br>
&gt;&gt;&gt; basic networking related. I believe it&#39;s up to the client =
to keep<br>
&gt;&gt;&gt; the connection open as Duy indicated. I don&#39;t think c* wil=
l<br>
&gt;&gt;&gt; arbitrarily sever connections - something that disconnects the=
<br>
&gt;&gt;&gt; client may happen. In that case, the TCP connection on the ser=
ver<br>
&gt;&gt;&gt; should drop to TIME_WAIT. Is this what you are seeing in `nets=
tat<br>
&gt;&gt;&gt; -a` on the server - a bunch of TIME_WAIT connections hanging<b=
r>
&gt;&gt;&gt; around? Those should eventually be recycled, but that&#39;s tu=
nable in<br>
&gt;&gt;&gt; the network stack, if they are being generated at a high rate.=
<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; -- Michael<br>
&gt;&gt;<br>
&gt;<br>
<br>
</div></div></blockquote></div><br></div>

--001a11c2efb814972e04f6b5724d--