From user-return-27172-apmail-cassandra-user-archive=cassandra.apache.org@cassandra.apache.org Mon Jun 25 10:48:28 2012 Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C2CA0C2BD for ; Mon, 25 Jun 2012 10:48:28 +0000 (UTC) Received: (qmail 40012 invoked by uid 500); 25 Jun 2012 10:48:26 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 39782 invoked by uid 500); 25 Jun 2012 10:48:26 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 39763 invoked by uid 99); 25 Jun 2012 10:48:25 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 25 Jun 2012 10:48:25 +0000 X-ASF-Spam-Status: No, hits=-0.1 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [216.82.254.195] (HELO mail200.messagelabs.com) (216.82.254.195) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 25 Jun 2012 10:48:17 +0000 X-Env-Sender: andras.szerdahelyi@ignitionone.com X-Msg-Ref: server-16.tower-200.messagelabs.com!1340621274!13290361!1 X-Originating-IP: [208.52.173.250] X-StarScan-Version: 6.5.10; banners=-,-,- X-VirusChecked: Checked Received: (qmail 19282 invoked from network); 25 Jun 2012 10:47:55 -0000 Received: from mail.dentsunetwork.com (HELO mail.dentsunetwork.com) (208.52.173.250) by server-16.tower-200.messagelabs.com with AES128-SHA encrypted SMTP; 25 Jun 2012 10:47:55 -0000 Received: from ATL02MB02.corp.local ([fe80::7997:c980:b031:df37]) by ATL02HUB02.corp.local ([::1]) with mapi id 14.01.0355.002; Mon, 25 Jun 2012 06:49:41 -0400 From: Andras Szerdahelyi To: "" Subject: Re: repair never finishing 1.0.7 Thread-Topic: repair never finishing 1.0.7 Thread-Index: AQHNUrkwCfOL3m71skeKJzn+Vl7nNJcLHYwA Date: Mon, 25 Jun 2012 10:47:53 +0000 Message-ID: References: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.0.90.2] Content-Type: multipart/alternative; boundary="_000_B00B9B033F484ACE9E4C7DDE5A2EF34Adentsunetworkcom_" MIME-Version: 1.0 --_000_B00B9B033F484ACE9E4C7DDE5A2EF34Adentsunetworkcom_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable The DCs are communicating over a gateway where I do NAT for ports 7000, 91= 60 and 7199. Ah, that sounds familiar. You don't mention if you are VPN'd or not. I'll a= ssume you are not. So, your nodes are behind network address translation - is that to say they= advertise ( broadcast ) their internal or translated/forwarded IP to each = other? Setting up a Cassandra ring across NAT ( without a VPN ) is impossib= le in my experience. Either the nodes on your local network won't be able t= o communicate with each other, because they broadcast their translated ( pu= blic ) address which is normally ( router configuration ) not routable from= within the local network, or the nodes broadcast their internal IP, in whi= ch case the "outside" nodes are helpless in trying to connect to a local ne= t. On DC2 nodes/the node you issue the repair on, check for any sockets bei= ng opened to the internal addresses of the nodes in DC1. regards, Andras On 25 Jun 2012, at 11:57, Alexandru Sicoe wrote: Hello everyone, I have a 2 DC (DC1:3 and DC2:6) Cassandra1.0.7 setup. I have about 300GB/n= ode in the DC2. The DCs are communicating over a gateway where I do NAT for ports 7000, 91= 60 and 7199. I did a "nodetool repair" on a node in DC2 without any external load on th= e system. It took 5 hrs to finish the Merkle tree calculations (which is fine for me= ) but then in the streaming phase nothing happens (0% seen in "nodetool net= stats") and stays like that forever. Note: it has to stream to/from nodes i= n DC1! I tried another time and still the same. Looking around I found this thread http://www.mail-archive.com/user@cassandra.apache.org/msg22167= .html which seems to describe the same problem. The thread gives 2 suggestions: - a full cluster restart allows the first attempted repair to complete (hav= en't tested yet; this is not practical even if it works) - issue https://issues.apache.org/jira/browse/CASSANDRA-4223 can be the pro= blem Questions: 1) How can I make sure that the JIRA issue above is my real problem? (I see= no errors or warns in the logs; no other activity) 2) What should I do to make the repairs work? (If the JIRA issue is the pro= blem, then I see there is a fix for it in Version 1.0.11 which is not relea= sed yet) Thanks, Alex --_000_B00B9B033F484ACE9E4C7DDE5A2EF34Adentsunetworkcom_ Content-Type: text/html; charset="iso-8859-1" Content-ID: Content-Transfer-Encoding: quoted-printable

 The DCs are communicating over a gateway wh= ere I do NAT for ports 7000, 9160 and 7199.

Ah, that sounds familiar. You don't mention if you are VPN'd or not. I= 'll assume you are not.

So, your nodes are behind network address translation - is that to say= they advertise ( broadcast ) their internal or translated/forwarded IP to = each other? Setting up a Cassandra ring across NAT ( without a VPN ) is imp= ossible in my experience. Either the nodes on your local network won't be able to communicate with each oth= er, because they broadcast their translated ( public ) address which is nor= mally ( router configuration ) not routable from within the local network, = or the nodes broadcast their internal IP, in which case the "outside" nodes are helpless in trying to = connect to a local net. On DC2 nodes/the node you issue the repair on, chec= k for any sockets being opened to the internal addresses of the nodes in DC= 1.


regards,
Andras



On 25 Jun 2012, at 11:57, Alexandru Sicoe wrote:

Hello everyone,

 I have a 2 DC (DC1:3 and DC2:6) Cassandra1.0.7 setup. I have about 30= 0GB/node in the DC2.

 The DCs are communicating over a gateway where I do NAT for ports 700= 0, 9160 and 7199.

 I did a "nodetool repair" on a node in DC2 without any exte= rnal load on the system.

 It took 5 hrs to finish the Merkle tree calculations (which is fine f= or me) but then in the streaming phase nothing happens (0% seen in "no= detool netstats") and stays like that forever. Note: it has to stream = to/from nodes in DC1!

 I tried another time and still the same.

 Looking around I found this thread 
             http://www.mail-archive.com/user@cassandra.apache.org/msg22167.html
 which seems to describe the same problem.

The thread gives 2 suggestions:
- a full cluster restart allows the first attempted repair to complete (hav= en't tested yet; this is not practical even if it works)
- issue ht= tps://issues.apache.org/jira/browse/CASSANDRA-4223 can be the problem

Questions:
1) How can I make sure that the JIRA issue above is my real problem? (I see= no errors or warns in the logs; no other activity)
2) What should I do to make the repairs work? (If the JIRA issue is the pro= blem, then I see there is a fix for it in Version 1.0.11 which is not relea= sed yet)

Thanks,
Alex

--_000_B00B9B033F484ACE9E4C7DDE5A2EF34Adentsunetworkcom_--