Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7732B11CA3 for ; Thu, 10 Apr 2014 19:45:04 +0000 (UTC) Received: (qmail 74081 invoked by uid 500); 10 Apr 2014 19:45:01 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 73583 invoked by uid 500); 10 Apr 2014 19:45:00 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 73557 invoked by uid 99); 10 Apr 2014 19:44:57 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 10 Apr 2014 19:44:57 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of eric.plowe@gmail.com designates 209.85.219.52 as permitted sender) Received: from [209.85.219.52] (HELO mail-oa0-f52.google.com) (209.85.219.52) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 10 Apr 2014 19:44:52 +0000 Received: by mail-oa0-f52.google.com with SMTP id l6so5074290oag.11 for ; Thu, 10 Apr 2014 12:44:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=9xrOKbU3lGb9x41GJ80F6ZlN5+NkfnPyyQDxJxt1/v8=; b=KBgPEx8ZuoewuuFw2NcDjCqOlPMVkaY0AWrzjTQJ7zxwY2XRD+sJF8COAQZknj6XGk I6QRn561rLF36ejA4J6PF/N0XQN2wZohZqjVSgw41A2uxgvSCGGTX4r4iARbIfOFvlyJ ZrbYAmyvLCp+dEkc6pkpzZu6Zyo4zGLgHpU79XOjXVWYmqDkEF2tOamdC9Xx0MunqGB6 qrBimAhD4AsE3gBRLM9RxdHgIfMjv0o7dYCREQ+fN+bTcK2m2+yDxyB9tyiHtLlO0Cjt UY8eob0DQGy1Img0zoYwgQVlUiTn2qOZ0yIGA0d+SaOwyUILeoQV3pmBDyLOs+UH1P9l 3cFw== MIME-Version: 1.0 X-Received: by 10.182.241.67 with SMTP id wg3mr15818143obc.16.1397159070306; Thu, 10 Apr 2014 12:44:30 -0700 (PDT) Received: by 10.182.78.229 with HTTP; Thu, 10 Apr 2014 12:44:30 -0700 (PDT) In-Reply-To: References: <882BA080-1B92-4856-8324-2904382FCBFF@vast.com> <306BB8DC-4061-4C23-AFB9-2ED13846AE6D@vast.com> <53458290.70809@pbandjelly.org> <34118C18-3E83-45B9-9AEC-8DFF7BF8C8AB@vast.com> <534592C5.3000103@pbandjelly.org> Date: Thu, 10 Apr 2014 15:44:30 -0400 Message-ID: Subject: Re: binary protocol server side sockets From: Eric Plowe To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=001a11c2efb814972e04f6b5724d X-Virus-Checked: Checked by ClamAV on apache.org --001a11c2efb814972e04f6b5724d Content-Type: text/plain; charset=ISO-8859-1 I am having the exact same issue. I see the connections pile up and pile up, but they never seem to come down. Any insight into this would be amazing. Eric Plowe On Wed, Apr 9, 2014 at 4:17 PM, graham sanderson wrote: > Thanks Michael, > > Yup keepalive is not the default. It is possible they are going away after > nf_conntrack_tcp_timeout_established; will have to do more digging (it is > hard to tell how old a connection is - there are no visible timers (thru > netstat) on an ESTABLISHED connection))... > > This is actually low on my priority list, I was just spending a bit of > time trying to track down the source of > > ERROR [Native-Transport-Requests:3833603] 2014-04-09 17:46:48,833 > ErrorMessage.java (line 222) Unexpected exception during request > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:192) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) > at > org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64) > at > org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:109) > at > org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312) > at > org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:90) > at > org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > > errors, which are spamming our server logs quite a lot (I originally > thought this might be caused by KEEPALIVE, which is when I realized that > the connections weren't in keep alive and were building up) - it would be > nice if netty would tell us which a little about the Socket channel in the > error message (maybe there is a way to do this by changing log levels, but > as I say I haven't had time to go digging there) > > I will probably file a JIRA issue to add the setting (since I can't see > any particular harm to setting keepalive) > > On Apr 9, 2014, at 1:34 PM, Michael Shuler wrote: > > > On 04/09/2014 12:41 PM, graham sanderson wrote: > >> Michael, it is not that the connections are being dropped, it is that > >> the connections are not being dropped. > > > > Thanks for the clarification. > > > >> These server side sockets are ESTABLISHED, even though the client > >> connection on the other side of the network device is long gone. This > >> may well be an issue with the network device (it is valiantly trying > >> to keep the connection alive it seems). > > > > Have you tested if they *ever* time out on their own, or do they just > keep sticking around forever? (maybe 432000 sec (120 hours), which is the > default for nf_conntrack_tcp_timeout_established?) Trying out all the usage > scenarios is really the way to track it down - directly on switch, > behind/in front of firewall, on/off the VPN. > > > >> That said KEEPALIVE on the server side would not be a bad idea. At > >> least then the OS on the server would eventually (probably after 2 > >> hours of inactivity) attempt to ping the client. At that point > >> hopefully something interesting would happen perhaps causing an error > >> and destroying the server side socket (note KEEPALIVE is also good > >> for preventing idle connections from being dropped by other network > >> devices along the way) > > > > Tuning net.ipv4.tcp_keepalive_* could be helpful, if you know they > timeout after 2 hours, which is the default. > > > >> rpc_keepalive on the server sets keep alive on the server side > >> sockets for thrift, and is true by default > >> > >> There doesn't seem to be a setting for the native protocol > >> > >> Note this isn't a huge issue for us, they can be cleaned up by a > >> rolling restart, and this particular case is not production, but > >> related to development/testing against alpha by people working > >> remotely over VPN - and it may well be the VPNs fault in this case... > >> that said and maybe this is a dev list question, it seems like the > >> option to set keepalive should exist. > > > > Yeah, but I agree you shouldn't have to restart to clean up connections > - that's why I think it is lower in the network stack, and that a bit of > troubleshooting and tuning might be helpful. That setting sounds like a > good Jira request - keepalive may be the default, I'm not sure. :) > > > > -- > > Michael > > > >> On Apr 9, 2014, at 12:25 PM, Michael Shuler > >> wrote: > >> > >>> On 04/09/2014 11:39 AM, graham sanderson wrote: > >>>> Thanks, but I would think that just sets keep alive from the > >>>> client end; I'm talking about the server end... this is one of > >>>> those issues where there is something (e.g. switch, firewall, VPN > >>>> in between the client and the server) and we get left with > >>>> orphaned established connections to the server when the client is > >>>> gone. > >>> > >>> There would be no server setting for any service, not just c*, that > >>> would correct mis-configured connection-assassinating network gear > >>> between the client and server. Fix the gear to allow persistent > >>> connections. > >>> > >>> Digging through the various timeouts in c*.yaml didn't lead me to a > >>> simple answer for something tunable, but I think this may be more > >>> basic networking related. I believe it's up to the client to keep > >>> the connection open as Duy indicated. I don't think c* will > >>> arbitrarily sever connections - something that disconnects the > >>> client may happen. In that case, the TCP connection on the server > >>> should drop to TIME_WAIT. Is this what you are seeing in `netstat > >>> -a` on the server - a bunch of TIME_WAIT connections hanging > >>> around? Those should eventually be recycled, but that's tunable in > >>> the network stack, if they are being generated at a high rate. > >>> > >>> -- Michael > >> > > > > --001a11c2efb814972e04f6b5724d Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
I am having the exact same issue. I see the conn= ections pile up and pile up, but they never seem to come down. Any insight = into this would be amazing.


Eric Plowe


On Wed, Apr 9, 2014 at 4:17 PM, graham s= anderson <graham@vast.com> wrote:
Thanks Michael,

Yup keepalive is not the default. It is possible they are going away after = nf_conntrack_tcp_timeout_established; will have to do more digging (it is h= ard to tell how old a connection is - there are no visible timers (thru net= stat) on an ESTABLISHED connection))…

This is actually low on my priority list, I was just spending a bit of time= trying to track down the source of

ERROR [Native-Transport-Requests:3833603] 2014-04-09 17:46:48,833 ErrorMess= age.java (line 222) Unexpected exception during request
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native M= ethod)
        at sun.nio.ch.SocketDispatcher.read(SocketDispa= tcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUti= l.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:192)
        at sun.nio.ch.SocketChannelImpl.read(SocketChan= nelImpl.java:379)
        at org.jboss.netty.channel.socket.nio.NioWorker= .read(NioWorker.java:64)
        at org.jboss.netty.channel.socket.nio.AbstractN= ioWorker.process(AbstractNioWorker.java:109)
        at org.jboss.netty.channel.socket.nio.AbstractN= ioSelector.run(AbstractNioSelector.java:312)
        at org.jboss.netty.channel.socket.nio.AbstractN= ioWorker.run(AbstractNioWorker.java:90)
        at org.jboss.netty.channel.socket.nio.NioWorker= .run(NioWorker.java:178)
        at java.util.concurrent.ThreadPoolExecutor.runW= orker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Work= er.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)

errors, which are spamming our server logs quite a lot (I originally though= t this might be caused by KEEPALIVE, which is when I realized that the conn= ections weren’t in keep alive and were building up) - it would be nic= e if netty would tell us which a little about the Socket channel in the err= or message (maybe there is a way to do this by changing log levels, but as = I say I haven’t had time to go digging there)

I will probably file a JIRA issue to add the setting (since I can’t s= ee any particular harm to setting keepalive)

On Apr 9, 2014, at 1:34 PM, Michael Shuler <michael@pbandjelly.org> wrote:

> On 04/09/2014 12:41 PM, graham sanderson wrote:
>> Michael, it is not that the connections are being dropped, it is t= hat
>> the connections are not being dropped.
>
> Thanks for the clarification.
>
>> These server side sockets are ESTABLISHED, even though the client<= br> >> connection on the other side of the network device is long gone. T= his
>> may well be an issue with the network device (it is valiantly tryi= ng
>> to keep the connection alive it seems).
>
> Have you tested if they *ever* time out on their own, or do they just = keep sticking around forever? (maybe 432000 sec (120 hours), which is the d= efault for nf_conntrack_tcp_timeout_established?) Trying out all the usage = scenarios is really the way to track it down - directly on switch, behind/i= n front of firewall, on/off the VPN.
>
>> That said KEEPALIVE on the server side would not be a bad idea. At=
>> least then the OS on the server would eventually (probably after 2=
>> hours of inactivity) attempt to ping the client. At that point
>> hopefully something interesting would happen perhaps causing an er= ror
>> and destroying the server side socket (note KEEPALIVE is also good=
>> for preventing idle connections from being dropped by other networ= k
>> devices along the way)
>
> Tuning net.ipv4.tcp_keepalive_* could be helpful, if you know they tim= eout after 2 hours, which is the default.
>
>> rpc_keepalive on the server sets keep alive on the server side
>> sockets for thrift, and is true by default
>>
>> There doesn’t seem to be a setting for the native protocol >>
>> Note this isn’t a huge issue for us, they can be cleaned up = by a
>> rolling restart, and this particular case is not production, but >> related to development/testing against alpha by people working
>> remotely over VPN - and it may well be the VPNs fault in this case= …
>> that said and maybe this is a dev list question, it seems like the=
>> option to set keepalive should exist.
>
> Yeah, but I agree you shouldn't have to restart to clean up connec= tions - that's why I think it is lower in the network stack, and that a= bit of troubleshooting and tuning might be helpful. That setting sounds li= ke a good Jira request - keepalive may be the default, I'm not sure. :)=
>
> --
> Michael
>
>> On Apr 9, 2014, at 12:25 PM, Michael Shuler <michael@pbandjelly.org>
>> wrote:
>>
>>> On 04/09/2014 11:39 AM, graham sanderson wrote:
>>>> Thanks, but I would think that just sets keep alive from t= he
>>>> client end; I’m talking about the server end…= this is one of
>>>> those issues where there is something (e.g. switch, firewa= ll, VPN
>>>> in between the client and the server) and we get left with=
>>>> orphaned established connections to the server when the cl= ient is
>>>> gone.
>>>
>>> There would be no server setting for any service, not just c*,= that
>>> would correct mis-configured connection-assassinating network = gear
>>> between the client and server. Fix the gear to allow persisten= t
>>> connections.
>>>
>>> Digging through the various timeouts in c*.yaml didn't lea= d me to a
>>> simple answer for something tunable, but I think this may be m= ore
>>> basic networking related. I believe it's up to the client = to keep
>>> the connection open as Duy indicated. I don't think c* wil= l
>>> arbitrarily sever connections - something that disconnects the=
>>> client may happen. In that case, the TCP connection on the ser= ver
>>> should drop to TIME_WAIT. Is this what you are seeing in `nets= tat
>>> -a` on the server - a bunch of TIME_WAIT connections hanging >>> around? Those should eventually be recycled, but that's tu= nable in
>>> the network stack, if they are being generated at a high rate.=
>>>
>>> -- Michael
>>
>


--001a11c2efb814972e04f6b5724d--