Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B616B181CA for ; Sun, 29 Nov 2015 17:14:18 +0000 (UTC) Received: (qmail 31404 invoked by uid 500); 29 Nov 2015 17:14:16 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 31368 invoked by uid 500); 29 Nov 2015 17:14:16 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 31358 invoked by uid 99); 29 Nov 2015 17:14:16 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 29 Nov 2015 17:14:16 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id BE44A180A4A for ; Sun, 29 Nov 2015 17:14:15 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.401 X-Spam-Level: ** X-Spam-Status: No, score=2.401 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, FREEMAIL_REPLYTO_END_DIGIT=0.25, KAM_ASCII_DIVIDERS=0.8, KAM_LINEPADDING=1.2, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=yahoo.co.in Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id KGQO0RDy9Vge for ; Sun, 29 Nov 2015 17:14:00 +0000 (UTC) Received: from nm48-vm5.bullet.mail.ne1.yahoo.com (nm48-vm5.bullet.mail.ne1.yahoo.com [98.138.121.117]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id 086742304C for ; Sun, 29 Nov 2015 17:13:57 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.co.in; s=s2048; t=1448817230; bh=AhJq5I14Y/1SNmO9sVcW/L3OVjpZ9D2UmsqGqffwCU8=; h=Date:From:Reply-To:To:Subject:References:From:Subject; b=hnZwPlwHfiwX7q3tdyXzTrwWpI3dCGlfb1aQpOCD+uJwNTS3PF7XNS2ZcQiCxYXPpHteZbuA0uNMZ+p+O2AlhDmThAUer5FCT3bgA2C189ev8+O9LdnhLfA9VmGsSufabxX6OAaeAD33J02Pv/q46uN9Pj262YZQ2X+hLLeJcYtTEQQEHIyESFvMqm/bSiMKTLSqwMq+/N8XuzctZnnCxf7lVxS9QGSROZKTxX8ux2LcHfgFGTClMlLKXb+FOmmZYzPp7rmkQbeatnVaoh5VeHkKJTeAsnBO/jEKTJE7QUw5m848Av2n9zoecY0xwY2G42wc6ovQgYzcveFBbl3Y/g== Received: from [127.0.0.1] by nm48.bullet.mail.ne1.yahoo.com with NNFMP; 29 Nov 2015 17:13:50 -0000 Received: from [98.138.100.112] by nm48.bullet.mail.ne1.yahoo.com with NNFMP; 29 Nov 2015 17:11:00 -0000 Received: from [106.10.166.124] by tm103.bullet.mail.ne1.yahoo.com with NNFMP; 29 Nov 2015 17:10:59 -0000 Received: from [106.10.151.139] by tm13.bullet.mail.sg3.yahoo.com with NNFMP; 29 Nov 2015 17:10:58 -0000 Received: from [127.0.0.1] by omp1007.mail.sg3.yahoo.com with NNFMP; 29 Nov 2015 17:10:58 -0000 X-Yahoo-Newman-Property: ymail-4 X-Yahoo-Newman-Id: 954610.47226.bm@omp1007.mail.sg3.yahoo.com X-YMail-OSG: XmQHyoUVM1mjKyzrNiwXkdxuoCub2SWhrodVT43yqrPLoXx6vo1fLUjj4Gnr.eV gCTKBft34cDPn7NxYFo7pqCpUXGLyP9JW7bHRK97an6hE4TO_Xl2MjXrV3RNPZCiWYY67NRaXSNE BZaU_n2k.1r0K.GiWAYmnFNCHwHeuk4Qhz8Fmman3bZ0j33UXarmhnBGpKyFVEEsVAOdTRym5L9w treN6pjqwNS7Txv_huQ0pz7zOfi_fklRYnVeadCnHWyHjl7Efv8EbDisvwfNKOnJtilLtRpi1cC8 ZFtm0U6F5xiBGXP_jPhZrjyCsrU3.mTjYlbR3G1GhqONrXQQPFgqBNth78b.k1dnMcGO_6yfkFN2 q3sL5BRY9pVZAAMlq8J5pDjjMRkyAad27ymOJHVKrkKdKyALVLsuZDRMEP_BbJy5jhW2Rndc7.pp jUONEaYyoVMk6IsAXXl0xomFN_RLfcxtzZD77sRkbBe3.pBHRTGOpYjIRExSAJHiRxrPXCtJ_4h7 C6LmF6VEPvXhdkbCZW0Rx Received: by 106.10.196.95; Sun, 29 Nov 2015 17:10:58 +0000 Date: Sun, 29 Nov 2015 17:10:57 +0000 (UTC) From: Anuj Wadehra Reply-To: Anuj Wadehra To: Message-ID: <1116112354.10488363.1448817057930.JavaMail.yahoo@mail.yahoo.com> Subject: Re: Repair Hangs while requesting Merkle Trees MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable References: <1116112354.10488363.1448817057930.JavaMail.yahoo.ref@mail.yahoo.com> Yes. I think you are correct, problem might have resolved via Cassandra res= tart rather than increasing request timeout. We are NOT on EC2. We have 2 interfaces on each node: one private and one p= ublic. We have strange configuration and we need to correct it as per the recommen= dation at https://docs.datastax.com/en/cassandra/2.0/cassandra/configuratio= n/configMultiNetworks.html .=20 AS-IS config: We use broadcast address=3Dlisten address=3DPUBLIC IP address.=20 In seeds, we put PUBLIC IP of other nodes but private IP for the local node= . There were some issues if we tried to access local node via its public IP= . Thanks Anuj =20 -------------------------------------------- On Tue, 24/11/15, Paulo Motta wrote: Subject: Re: Repair Hangs while requesting Merkle Trees To: "user@cassandra.apache.org" , "Anuj Wadehra= " Date: Tuesday, 24 November, 2015, 12:38 AM =20 The issue might be related to the ESTABLISHED connections just in one end. I don't think it might be related to inter_dc_tcp_nodelay or request_timeout_in_ms options. Did you restart the process when you changed the request_timeout_in_ms option? This might be why the problem got fixed and not the option change. =20 This seem like a network issue or a misconfiguration of this specific node. Are you using EC2? Is listen_address =3D=3D broadcast_address? Are all nodes using the same configuration? What java are you using? =20 You may want to enable TRACE on OutgoingTcpConnection and IncomingTcpConnection and compare the outputs of healthy nodes with the faulty node. =20 2015-11-23 10:04 GMT-08:00 Anuj Wadehra : Any comments on ESTABLISHED connections at one end? =20 =20 =20 Moreover, inter_dc_tcp_nodelay is false. Can this be the reason that=C2=A0 latency between two DC is more and repair messages are getting dropped? =20 =20 =20 Can increasing request_timeout_in_ms deal with the latency issue.. =20 =20 =20 I see some hinted handoffs being triggered for cross DC nodes..and hints replay being timed-out..Is that an indication of a network issue? =20 =20 =20 I am getting in tough with network team to capture netstats and tcpdump too.. =20 =20 =20 Thanks =20 Anuj =20 =20 =20 =20 =20 -------------------------------------------- =20 On Wed, 18/11/15, Anuj Wadehra wrote: =20 =20 =20 =C2=A0Subject: Re: Repair Hangs while requesting Merkle Trees =20 =C2=A0To: "user@cassandra.apache.org" =20 =C2=A0Date: Wednesday, 18 November, 2015, 7:57 AM =20 =20 =20 =C2=A0Thanks Bryan !! =20 =C2=A0Connection =20 =C2=A0is in ESTBLISHED state on on end and completely missing at =20 =C2=A0other end (in another dc). =20 =C2=A0Yes, =20 =C2=A0we can revisit TCP tuning.But the problem is node specific. =20 =C2=A0So not sure whether tuning is the culprit. =20 =20 =20 =C2=A0ThanksAnuj =20 =C2=A0Sent =20 =C2=A0from Yahoo Mail on Android=C2=A0 From:"Bryan =20 =C2=A0Cheng" =20 =C2=A0Date:Wed, 18 Nov, 2015 at =20 =C2=A0 2:04 am =20 =C2=A0Subject:Re: Repair Hangs =20 =C2=A0while requesting Merkle Trees =20 =20 =20 =C2=A0 Ah OK, might =20 =C2=A0have misunderstood you. Streaming socket should not be in =20 =C2=A0play during merkle tree generation (validation compaction). =20 =C2=A0They may come in play during merkle tree exchange- that =20 =C2=A0I'm not sure about. You can read a bit more here:=C2=A0https://issue= s.apache.org/jira/browse/CASSANDRA-8611. =20 =C2=A0Regardless, you should have it set- =20 =C2=A01 hr is usually a good conservative estimate, but you can go =20 =C2=A0much lower safely. =20 =C2=A0What state is the connection on that =20 =C2=A0only shows on one side? Is it ESTABLISHED, or something like =20 =C2=A0CLOSE_WAIT? =20 =C2=A0Here's =20 =C2=A0a good place to start for tuning, though it doesn't have =20 =C2=A0as much about network tuning:=C2=A0https://tobert.github.io/pages/al= s-cassandra-21-tuning-guide.html. =20 =C2=A0More generally, TCP tuning usually revolves around a balance =20 =C2=A0between latency and bandwidth. Over long connections =20 =C2=A0(we're talking 10s of ms, instead of the sub 1ms you =20 =C2=A0usually see in a good dc network), your expectations will =20 =C2=A0shift greatly. Stuff like NODELAY on tcp is very nice for =20 =C2=A0cutting your latencies when you're inside a DC, but will =20 =C2=A0generate lots of small packets that will hurt your bandwidth =20 =C2=A0over longer connections due to the need to wait for acks. =20 =C2=A0otc_coalescing_strategy is on a similar vein, bundling =20 =C2=A0together nearby messages to trade latency for throughput. =20 =C2=A0You'll also probably want to tune your tcp buffers and =20 =C2=A0window sizes, since that determines how much data can be =20 =C2=A0in-flight between acknowledgements, and the default size is =20 =C2=A0pitiful for any decent =C2=A0network size. Google =20 =C2=A0 around for TCP tuning/buffer tuning and you should find =20 =C2=A0some good resources. =20 =C2=A0On Mon, Nov 16, 2015 at =20 =C2=A05:23 PM, Anuj Wadehra wrote: =20 =C2=A0Hi Bryan, =20 =C2=A0Thanks for the reply !!I =20 =C2=A0didnt mean streaming_socket_tomeout_in_ms. I meant when you =20 =C2=A0run netstats (Linux cmnd) on =C2=A0node A in DC1, you will =20 =C2=A0notice that there is connection in established state with =20 =C2=A0node B in DC2. But when you run netstats on node B, you wont =20 =C2=A0 find any connection with node A. Such connections are there =20 =C2=A0across dc? Is it a problem. =20 =C2=A0We havent set =20 =C2=A0streaming_socket_timeout_in_ms which I know must be set. But =20 =C2=A0I am not =C2=A0sure wtheher setting this property has any impact =20 =C2=A0on merkle tree requests. I thought its valid for data =20 =C2=A0streaming if some mismatch is =20 =C2=A0 found and data needs to be streamed.Please confirm. Whats =20 =C2=A0the value you use for streaming socket =20 =C2=A0timeout? =20 =C2=A0Morever, if =20 =C2=A0socket timeout is the issue, that should happen on other =20 =C2=A0nodes too...repair is not running on just one node, as =20 =C2=A0merkle tree request is getting lost n not transmitted to one =20 =C2=A0or more nodes in remote dc. =20 =C2=A0I am not sure about exact distance. =20 =C2=A0But they are connected with a very high speed 10gbps =20 =C2=A0link. =20 =C2=A0When you say =20 =C2=A0different TCP stack tuning..do u have any document/blog/link =20 =C2=A0describing recommendations for multi Dc Cassandra setup?=C2=A0 =20 =C2=A0Can you elaborate what all settings =20 =C2=A0 need to be different?=C2=A0 =20 =20 =20 =C2=A0ThanksAnuj =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =C2=A0Sent =20 =C2=A0from Yahoo Mail on Android=C2=A0 From:"Bryan =20 =C2=A0Cheng" =20 =C2=A0Date:Tue, 17 Nov, 2015 at 5:54 =20 =C2=A0am =20 =C2=A0Subject:Re: Repair =20 =C2=A0 Hangs while requesting Merkle Trees =20 =20 =20 =C2=A0 Hi Anuj, =20 =C2=A0Did you mean =20 =C2=A0streaming_socket_timeout_in_ms? If not, then you definitely =20 =C2=A0want that set. Even the best network connections will break =20 =C2=A0occasionally, and in Cassandra < 2.1.10 (I believe) this =20 =C2=A0would leave those connections hanging indefinitely on one =20 =C2=A0end. =20 =C2=A0How far away are =20 =C2=A0your two DC's from a network perspective, out of =20 =C2=A0curiosity? You'll almost certainly be doing different =20 =C2=A0TCP stack tuning for cross-DC, notably your buffer sizes, =20 =C2=A0window params, cassandra-specific stuff like =20 =C2=A0otc_coalescing_strategy, inter_dc_tcp_nodelay, =20 =C2=A0etc. =20 =C2=A0On Sat, Nov 14, 2015 at =20 =C2=A010:35 AM, Anuj Wadehra wrote: =20 =C2=A0One more observation.We observed =20 =C2=A0that there are few TCP connections which node shows as =20 =C2=A0Established but when we go to node at other end,connection =20 =C2=A0is not there. They are called "phantom" =20 =C2=A0connections I guess. Can this be a possible cause? =20 =C2=A0ThanksAnuj =20 =20 =20 =C2=A0Sent =20 =C2=A0from Yahoo Mail on Android=C2=A0 From:"Anuj =20 =C2=A0Wadehra" =20 =C2=A0Date:Sat, 14 Nov, 2015 at 11:59 =20 =C2=A0pm =20 =C2=A0Subject:Re: Repair Hangs =20 =C2=A0while =20 =C2=A0 requesting Merkle Trees =20 =20 =20 =C2=A0 Thanks Daemeon =20 =C2=A0!! =20 =C2=A0I wil capture the output =20 =C2=A0of netstats and share in next few days. We were thinking of =20 =C2=A0taking tcp dumps also. If its a network issue and increasing =20 =C2=A0request timeout worked, not sure how Cassandra is dropping =20 =C2=A0messages based on timeout.Repair messages are non droppable =20 =C2=A0and not supposed to be timedout. =20 =C2=A02 of the 3 nodes in the DC are able =20 =C2=A0to complete repair without any issue. Just one node is =20 =C2=A0problematic. =20 =C2=A0I also observed =20 =C2=A0frequent messages in logs of other =20 =C2=A0 nodes which say that hints replay timedout..and the node =20 =C2=A0where hints were being replayed is always a remote dc =20 =C2=A0 node. Is it related some how? =20 =C2=A0ThanksAnujSent =20 =C2=A0from Yahoo Mail on Android=C2=A0 From:"daemeon =20 =C2=A0reiydelle" =20 =C2=A0Date:Thu, 12 Nov, 2015 at 10:34 am =20 =C2=A0Subject:Re: Repair Hangs while =20 =C2=A0requesting Merkle Trees =20 =20 =20 =20 =20 =C2=A0 Have you checked the network =20 =C2=A0statistics on that machine? (netstats -tas) while attempting =20 =C2=A0to repair ... if netstats show ANY issues =20 =C2=A0 you have a problem. If you can put the command in a loop =20 =C2=A0running every 60 seconds for maybe 15 minutes and post =20 =C2=A0back? =20 =20 =20 =C2=A0Out of curiousity, =20 =C2=A0how many remote DC nodes are getting successfully =20 =C2=A0repaired? =20 =20 =20 =20 =20 =C2=A0....... =20 =C2=A0=E2=80=9CLife should not be a journey to the =20 =C2=A0grave with the intention of =20 =C2=A0 arriving safely in a =20 =C2=A0pretty and well =20 =C2=A0preserved body, but rather to skid =20 =C2=A0 in broadside in a cloud of smoke, =20 =C2=A0thoroughly used up, totally worn out, =20 =C2=A0 and loudly proclaiming =E2=80=9CWow! What a Ride!=E2=80=9D =20 =C2=A0- Hunter Thompson =20 =20 =20 =C2=A0Daemeon C.M. Reiydelle =20 =C2=A0USA (+1) 415.501.0198 =20 =C2=A0London (+44) (0) =20 =C2=A020 8144 9872 =20 =20 =20 =20 =20 =C2=A0On Wed, Nov 11, 2015 at =20 =C2=A01:06 PM, Anuj Wadehra wrote: =20 =C2=A0Hi, =20 =C2=A0we are using 2.0.14. We =20 =C2=A0 have 2 DCs at remote locations with 10GBps connectivity.We =20 =C2=A0are able to =20 =C2=A0complete repair (-par -pr) on 5 nodes. On only one node in =20 =C2=A0DC2, we are =20 =C2=A0unable to complete repair as it always hangs. Node sends =20 =C2=A0Merkle Tree =20 =C2=A0requests, but one or more nodes in DC1 (remote) never show =20 =C2=A0that they =20 =C2=A0sent the merkle tree reply to requesting node. =20 =C2=A0Repair hangs infinitely. =20 =20 =20 =C2=A0After increasing request_timeout_in_ms on =20 =C2=A0affected node, we were able to successfully run repair on =20 =C2=A0one of the two occassions. =20 =20 =20 =C2=A0Any =20 =C2=A0 comments, why this is happening on just one node? In =20 =C2=A0OutboundTcpConnection.java,=C2=A0 when isTimeOut method always =20 =C2=A0returns false =20 =C2=A0for non-droppable verb such as Merkle Tree =20 =C2=A0Request(verb=3DREPAIR_MESSAGE),why increasing request timeout =20 =C2=A0solved =20 =C2=A0problem on one occasion ? =20 =20 =20 =C2=A0Thanks =20 =C2=A0Anuj Wadehra =20 =20 =20 =20 =20 =20 =20 =C2=A0 =C2=A0 =C2=A0 On Thursday, 12 =20 =C2=A0November 2015 2:35 AM, Anuj Wadehra wrote: =20 =20 =20 =20 =20 =C2=A0 Hi, =20 =C2=A0We have 2 DCs at remote =20 =C2=A0locations with 10GBps connectivity.We are able to complete =20 =C2=A0repair (-par -pr) on 5 nodes. On only one node in DC2, we =20 =C2=A0are unable to complete repair as it always hangs. Node sends =20 =C2=A0Merkle Tree requests, but one or more nodes in DC1 (remote) =20 =C2=A0never show that they sent the merkle tree reply to =20 =C2=A0requesting node. =20 =C2=A0Repair hangs infinitely. =20 =20 =20 =20 =20 =C2=A0After increasing =20 =C2=A0request_timeout_in_ms on affected node, we were able to =20 =C2=A0successfully run repair on one of the two occassions. =20 =20 =20 =C2=A0Any comments, why this is =20 =C2=A0happening on just one node? In OutboundTcpConnection.java,=C2=A0 =20 =C2=A0when isTimeOut method always returns false for non-droppable =20 =C2=A0verb such as Merkle Tree Request(verb=3DREPAIR_MESSAGE),why =20 =C2=A0increasing =20 =C2=A0 request timeout solved problem on one occasion ? =20 =20 =20 =C2=A0Thanks =20 =C2=A0Anuj Wadehra =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20