Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=thelastpickle.com; h=from
	:mime-version:content-type:subject:date:in-reply-to:to
	:references:message-id; q=dns; s=thelastpickle.com; b=gPgPGwL4Yu
	0AlXikHy8ORZwa/mn7tx+CdUtzixXCH+JHrEtfR/gASn5ovxYWk+Q45/SGrk/PKB
	8ityJXb6SFHzGCJRhIhc1m1EwLB+UBNzZFU9boiZs9bFT5vkvHYayBUUICDO9o48
	O36CUNP4pTLkQYNFbO3XaUkPWnYm7QkMU=
From: aaron morton <aaron@thelastpickle.com>
Mime-Version: 1.0 (Apple Message framework v1278)
Content-Type: multipart/alternative;
 boundary="Apple-Mail=_48A7EC27-D28E-4FCA-A318-BE6160006770"
Subject: Re: repair never finishing 1.0.7
Date: Wed, 27 Jun 2012 21:29:35 +1200
In-Reply-To: 
 <CAADnm_cDNTY+mw8cqqwA8BV0WEyznZ9kRe3iiXKd_UGkJHqmVQ@mail.gmail.com>
To: user@cassandra.apache.org
References: 
 <CAADnm_cy6b+MSA3FNUcn95ywojUx0xDkt1YQBFj3impzN0JV5Q@mail.gmail.com>
 <B00B9B03-3F48-4ACE-9E4C-7DDE5A2EF34A@dentsunetwork.com>
 <CAADnm_cDNTY+mw8cqqwA8BV0WEyznZ9kRe3iiXKd_UGkJHqmVQ@mail.gmail.com>
Message-Id: <8C310716-E688-47FD-979D-752236026183@thelastpickle.com>


--Apple-Mail=_48A7EC27-D28E-4FCA-A318-BE6160006770
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=iso-8859-1

> Setting up a Cassandra ring across NAT ( without a VPN ) is impossible =
in my experience.=20
The broadcast_address allows a node to broadcast an address that is =
different to the ones it's bound to on the local interfaces =
https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml#L270

 1) How can I make sure that the JIRA issue above is my real problem? (I =
see no errors or warns in the logs; no other activity)
>=20
>>=20
If the errors are not there it is not your problem.=20

>> - a full cluster restart allows the first attempted repair to =
complete (haven't tested yet; this is not practical even if it works)
Rolling restart of the nodes involved in the repair is sufficient.=20

Double checking the networking and check the logs on both sides of the =
transfer for errors or warnings. The code around streaming is better at =
failing loudly now days.=20

If you dont see anything set DEBUG logging on =
org.apache.cassandra.streaming.FileStreamTask. That will let you know if =
things start and progress.=20

Hope that helps.=20


-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 26/06/2012, at 6:16 PM, Alexandru Sicoe wrote:

> Hi Andras,
>=20
> I am not using a VPN. The system has been running successfully in this =
configuration for a couple of weeks until I noticed the repair is not =
working.
>=20
> What happens is that I configure the IP Tables of the machine on each =
Cassandra node to forward packets that are sent to any of the IPs in the =
other DC (on ports 7000, 9160 and 7199)  to be sent to the gateway IP. =
The gateway does the NAT sending the packets on the other side to the =
real destination IP, having replaced the source IP with the initial =
sender's IP (at least in my understanding of it).=20
>=20
> What might be the problem given the configuration? How to fix this?
>=20
> Cheers,
> Alex
>=20
> On Mon, Jun 25, 2012 at 12:47 PM, Andras Szerdahelyi =
<andras.szerdahelyi@ignitionone.com> wrote:
>=20
>>  The DCs are communicating over a gateway where I do NAT for ports =
7000, 9160 and 7199.
>=20
>=20
> Ah, that sounds familiar. You don't mention if you are VPN'd or not. =
I'll assume you are not.
>=20
> So, your nodes are behind network address translation - is that to say =
they advertise ( broadcast ) their internal or translated/forwarded IP =
to each other? Setting up a Cassandra ring across NAT ( without a VPN ) =
is impossible in my experience. Either the nodes on your local network =
won't be able to communicate with each other, because they broadcast =
their translated ( public ) address which is normally ( router =
configuration ) not routable from within the local network, or the nodes =
broadcast their internal IP, in which case the "outside" nodes are =
helpless in trying to connect to a local net. On DC2 nodes/the node you =
issue the repair on, check for any sockets being opened to the internal =
addresses of the nodes in DC1.
>=20
>=20
> regards,
> Andras
>=20
>=20
>=20
> On 25 Jun 2012, at 11:57, Alexandru Sicoe wrote:
>=20
>> Hello everyone,
>>=20
>>  I have a 2 DC (DC1:3 and DC2:6) Cassandra1.0.7 setup. I have about =
300GB/node in the DC2.=20
>>=20
>>  The DCs are communicating over a gateway where I do NAT for ports =
7000, 9160 and 7199.
>>=20
>>  I did a "nodetool repair" on a node in DC2 without any external load =
on the system.=20
>>=20
>>  It took 5 hrs to finish the Merkle tree calculations (which is fine =
for me) but then in the streaming phase nothing happens (0% seen in =
"nodetool netstats") and stays like that forever. Note: it has to stream =
to/from nodes in DC1!
>>=20
>>  I tried another time and still the same.
>>=20
>>  Looking around I found this thread =20
>>              =
http://www.mail-archive.com/user@cassandra.apache.org/msg22167.html
>>  which seems to describe the same problem.
>>=20
>> The thread gives 2 suggestions:
>> - a full cluster restart allows the first attempted repair to =
complete (haven't tested yet; this is not practical even if it works)
>> - issue https://issues.apache.org/jira/browse/CASSANDRA-4223 can be =
the problem=20
>>=20
>> Questions:
>> 1) How can I make sure that the JIRA issue above is my real problem? =
(I see no errors or warns in the logs; no other activity)
>> 2) What should I do to make the repairs work? (If the JIRA issue is =
the problem, then I see there is a fix for it in Version 1.0.11 which is =
not released yet)
>>=20
>> Thanks,
>> Alex
>=20
>=20


--Apple-Mail=_48A7EC27-D28E-4FCA-A318-BE6160006770
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=iso-8859-1

<html><head></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space; =
"><blockquote type=3D"cite"><div class=3D"gmail_quote"><blockquote =
class=3D"gmail_quote" style=3D"margin-top: 0px; margin-right: 0px; =
margin-bottom: 0px; margin-left: 0.8ex; border-left-width: 1px; =
border-left-color: rgb(204, 204, 204); border-left-style: solid; =
padding-left: 1ex; position: static; z-index: auto; "><div =
style=3D"word-wrap: break-word; "><div>Setting up a Cassandra ring =
across NAT ( without a VPN ) is impossible in my =
experience.&nbsp;</div></div></blockquote></div></blockquote>The =
broadcast_address allows a node to broadcast an address that is =
different to the ones it's bound to on the local interfaces&nbsp;<a =
href=3D"https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml=
#L270">https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml#=
L270</a><div><br></div><div>&nbsp;1) How can I make sure that the JIRA =
issue above is my real problem? (I see no errors or warns in the logs; =
no other activity)<blockquote type=3D"cite"><div =
class=3D"gmail_quote"><blockquote class=3D"gmail_quote" =
style=3D"margin-top: 0px; margin-right: 0px; margin-bottom: 0px; =
margin-left: 0.8ex; border-left-width: 1px; border-left-color: rgb(204, =
204, 204); border-left-style: solid; padding-left: 1ex; position: =
static; z-index: auto; "><div style=3D"word-wrap: break-word; =
"><div><div class=3D"h5"><div><blockquote =
type=3D"cite"><br></blockquote></div></div></div></div></blockquote></div>=
</blockquote>If the errors are not there it is not your =
problem.&nbsp;</div><div><br></div><div><blockquote type=3D"cite"><div =
class=3D"gmail_quote"><blockquote class=3D"gmail_quote" =
style=3D"margin-top: 0px; margin-right: 0px; margin-bottom: 0px; =
margin-left: 0.8ex; border-left-width: 1px; border-left-color: rgb(204, =
204, 204); border-left-style: solid; padding-left: 1ex; position: =
static; z-index: auto; "><div style=3D"word-wrap: break-word; =
"><div><div class=3D"h5"><div><blockquote type=3D"cite">- a full cluster =
restart allows the first attempted repair to complete (haven't tested =
yet; this is not practical even if it =
works)<br></blockquote></div></div></div></div></blockquote></div></blockq=
uote>Rolling restart of the nodes involved in the repair is =
sufficient.&nbsp;</div><div><br></div><div>Double checking the =
networking and&nbsp;check the logs on both sides of the transfer for =
errors or warnings. The code around streaming is better at failing =
loudly now days.&nbsp;</div><div><br></div><div>If you dont see anything =
set DEBUG logging on&nbsp;org.apache.cassandra.streaming.FileStreamTask. =
That will let you know if things start and =
progress.&nbsp;</div><div><br></div><div>Hope that =
helps.&nbsp;</div><div><br></div><div><div apple-content-edited=3D"true">
</div>
<br><div apple-content-edited=3D"true">
<span class=3D"Apple-style-span" style=3D"border-collapse: separate; =
color: rgb(0, 0, 0); font-family: Helvetica; font-style: normal; =
font-variant: normal; font-weight: normal; letter-spacing: normal; =
line-height: normal; orphans: 2; text-align: -webkit-auto; text-indent: =
0px; text-transform: none; white-space: normal; widows: 2; word-spacing: =
0px; -webkit-border-horizontal-spacing: 0px; =
-webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><span =
class=3D"Apple-style-span" style=3D"border-collapse: separate; color: =
rgb(0, 0, 0); font-family: Helvetica; font-style: normal; font-variant: =
normal; font-weight: normal; letter-spacing: normal; line-height: =
normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: =
normal; widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: =
0px; -webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; "><span class=3D"Apple-style-span" =
style=3D"border-collapse: separate; color: rgb(0, 0, 0); font-family: =
Helvetica; font-style: normal; font-variant: normal; font-weight: =
normal; letter-spacing: normal; line-height: normal; orphans: 2; =
text-indent: 0px; text-transform: none; white-space: normal; widows: 2; =
word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; =
-webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; "><span class=3D"Apple-style-span" =
style=3D"border-collapse: separate; color: rgb(0, 0, 0); font-family: =
Helvetica; font-style: normal; font-variant: normal; font-weight: =
normal; letter-spacing: normal; line-height: normal; orphans: 2; =
text-indent: 0px; text-transform: none; white-space: normal; widows: 2; =
word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; =
-webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; =
"><div><div>-----------------</div><div>Aaron Morton</div><div>Freelance =
Developer</div><div>@aaronmorton</div><div><a =
href=3D"http://www.thelastpickle.com">http://www.thelastpickle.com</a></di=
v></div></div></span></div></span></div></span></span>
</div>
<br><div><div>On 26/06/2012, at 6:16 PM, Alexandru Sicoe wrote:</div><br =
class=3D"Apple-interchange-newline"><blockquote type=3D"cite">Hi =
Andras,<br><br>I am not using a VPN. The system has been running =
successfully in this configuration for a couple of weeks until I noticed =
the repair is not working.<br><br>What happens is that I configure the =
IP Tables of the machine on each Cassandra node to forward packets that =
are sent to any of the IPs in the other DC (on ports 7000, 9160 and =
7199)&nbsp; to be sent to the gateway IP. The gateway does the NAT =
sending the packets on the other side to the real destination IP, having =
replaced the source IP with the initial sender's IP (at least in my =
understanding of it). <br>
<br>What might be the problem given the configuration? How to fix =
this?<br><br>Cheers,<br>Alex<br><br><div class=3D"gmail_quote">On Mon, =
Jun 25, 2012 at 12:47 PM, Andras Szerdahelyi <span dir=3D"ltr">&lt;<a =
href=3D"mailto:andras.szerdahelyi@ignitionone.com" =
target=3D"_blank">andras.szerdahelyi@ignitionone.com</a>&gt;</span> =
wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin-top: 0px; =
margin-right: 0px; margin-bottom: 0px; margin-left: 0.8ex; =
border-left-width: 1px; border-left-color: rgb(204, 204, 204); =
border-left-style: solid; padding-left: 1ex; position: static; z-index: =
auto; ">


<div style=3D"word-wrap:break-word"><div class=3D"im">
<div><br>
</div>
<div>
<blockquote type=3D"cite">&nbsp;The DCs are communicating over a gateway =
where I do NAT for ports 7000, 9160 and 7199.<br>
</blockquote>
</div>
<div><br>
</div>
</div><div>Ah, that sounds familiar. You don't mention if you are VPN'd =
or not. I'll assume you are not.</div>
<div><br>
</div>
<div>So, your nodes are behind network address translation - is that to =
say they advertise ( broadcast ) their internal or translated/forwarded =
IP to each other? Setting up a Cassandra ring across NAT ( without a VPN =
) is impossible in my experience. Either
 the nodes on your local network won't be able to communicate with each =
other, because they broadcast their translated ( public ) address which =
is normally ( router configuration ) not routable from within the local =
network, or the nodes broadcast their internal
 IP, in which case the "outside" nodes are helpless in trying to connect =
to a local net. On DC2 nodes/the node you issue the repair on, check for =
any sockets being opened to the internal addresses of the nodes in =
DC1.</div>

<div><br>
</div>
<div><br>
</div>
<div>regards,</div>
<div>Andras</div><div><div class=3D"h5">
<div><br>
</div>
<div><br>
</div>
<br>
<div>
<div>On 25 Jun 2012, at 11:57, Alexandru Sicoe wrote:</div>
<br>
<blockquote type=3D"cite">Hello everyone,<br>
<br>
&nbsp;I have a 2 DC (DC1:3 and DC2:6) Cassandra1.0.7 setup. I have about =
300GB/node in the DC2.
<br>
<br>
&nbsp;The DCs are communicating over a gateway where I do NAT for ports =
7000, 9160 and 7199.<br>
<br>
&nbsp;I did a "nodetool repair" on a node in DC2 without any external =
load on the system.
<br>
<br>
&nbsp;It took 5 hrs to finish the Merkle tree calculations (which is =
fine for me) but then in the streaming phase nothing happens (0% seen in =
"nodetool netstats") and stays like that forever. Note: it has to stream =
to/from nodes in DC1!<br>

<br>
&nbsp;I tried another time and still the same.<br>
<br>
&nbsp;Looking around I found this thread&nbsp; <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
<a =
href=3D"http://www.mail-archive.com/user@cassandra.apache.org/msg22167.htm=
l" target=3D"_blank">
=
http://www.mail-archive.com/user@cassandra.apache.org/msg22167.html</a><br=
>
&nbsp;which seems to describe the same problem.<br>
<br>
The thread gives 2 suggestions:<br>
- a full cluster restart allows the first attempted repair to complete =
(haven't tested yet; this is not practical even if it works)<br>
- issue <a href=3D"https://issues.apache.org/jira/browse/CASSANDRA-4223" =
target=3D"_blank">https://issues.apache.org/jira/browse/CASSANDRA-4223</a>=
 can be the problem
<br>
<br>
Questions:<br>
1) How can I make sure that the JIRA issue above is my real problem? (I =
see no errors or warns in the logs; no other activity)<br>
2) What should I do to make the repairs work? (If the JIRA issue is the =
problem, then I see there is a fix for it in Version 1.0.11 which is not =
released yet)<br>
<br>
Thanks,<br>
Alex<br>
</blockquote>
</div>
<br>
</div></div></div>

</blockquote></div><br>
</blockquote></div><br></div></body></html>=

--Apple-Mail=_48A7EC27-D28E-4FCA-A318-BE6160006770--