Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Message-ID: <1476550704.78559.YahooMailMobile@web192901.mail.sg3.yahoo.com>
Date: Sun, 16 Oct 2016 00:58:24 +0800
From: Anuj Wadehra <anujw_2003@yahoo.co.in>
Subject: Re: Repair in Multi Datacenter - Should you use -dc Datacenter repair or repair with -pr
To: user <user@cassandra.apache.org>
In-Reply-To: <MWHPR14MB1647D783DBD626D8B65654A9C7DF0@MWHPR14MB1647.namprd14.prod.outlook.com>
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary="-296771632-438675444-1476550704=:78559"
archived-at: Sat, 15 Oct 2016 16:58:41 -0000

---296771632-438675444-1476550704=:78559
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable

Hi Leena,=0A=0ADo you have a firewall between the two DCs? If yes, &quot;co=
nnection reset&quot; can be caused by Cassandra trying to use a TCP connect=
ion which is already closed by the firewall. Please make sure that you set =
high connection timeout at firewall. Also, make sure your servers are not o=
verloaded. Please see https://developer.ibm.com/answers/questions/231996/wh=
y-do-we-get-the-error-connection-reset-by-peer-d.html=0A=0Afor general caus=
es of connection reset. Also, as I told earlier, Cassandra troubleshooting =
explains it well https://docs.datastax.com/en/cassandra/2.0/cassandra/troub=
leshooting/trblshootIdleFirewall.html . Make sure firewall and node tcp set=
tings are in sync such that nodes close a tcp connection before firewall do=
es that.=0A=0AWith firewall timeout, we generally see merkle tree request/r=
esponse failing between nodes in two DCs and then repair is hung for ever. =
Not sure how merkle tree creation  which is node specific would get impacte=
d by multi dc setup. Are repairs with -local options completing without pro=
blems?=0A=0AThanks=0AAnuj=0A
---296771632-438675444-1476550704=:78559
Content-Type: text/html; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable

<table cellspacing=3D"0" cellpadding=3D"0" border=3D"0"><tr><td valign=3D"t=
op"><div>Hi Leena,<br /><br />Do you have a firewall between the two DCs? I=
f yes, "connection reset" can be caused by Cassandra trying to use a TCP co=
nnection which is already closed by the firewall. Please make sure that you=
 set high connection timeout at firewall. Also, make sure your servers are =
not overloaded. Please see https://developer.ibm.com/answers/questions/2319=
96/why-do-we-get-the-error-connection-reset-by-peer-d.html<br /><br />for g=
eneral causes of connection reset. Also, as I told earlier, Cassandra troub=
leshooting explains it well https://docs.datastax.com/en/cassandra/2.0/cass=
andra/troubleshooting/trblshootIdleFirewall.html . Make sure firewall and n=
ode tcp settings are in sync such that nodes close a tcp connection before =
firewall does that.<br /><br />With firewall timeout, we generally see merk=
le tree request/response failing between nodes in two DCs and then repair i=
s
 hung for ever. Not sure how merkle tree creation  which is node specific w=
ould get impacted by multi dc setup. Are repairs with -local options comple=
ting without problems?<br /><br />Thanks<br />Anuj<br /></div></td></tr></t=
able>            <div id=3D"_origMsg_">=0A                <div>=0A         =
           <br />=0A                    <div>=0A                        <di=
v style=3D"font-size:0.9em">=0A                            <hr size=3D"1">=
=0A                            <b>=0A                                <span =
style=3D"font-weight:bold">From:</span>=0A                            </b>=
=0A                            Leena Ghatpande &lt;lghatpande@hotmail.com&g=
t;;                            <br>=0A                            <b>=0A   =
                             <span style=3D"font-weight:bold">To:</span>=0A=
                            </b>=0A                            user@cassand=
ra.apache.org &lt;user@cassandra.apache.org&gt;;                           =
                                                  <br>=0A                  =
          <b>=0A                                <span style=3D"font-weight:=
bold">Subject:</span>=0A                            </b>=0A                =
            Re: Repair in Multi Datacenter - Should you use -dc Datacenter =
repair or repair with -pr                            <br>=0A               =
             <b>=0A                                <span style=3D"font-weig=
ht:bold">Sent:</span>=0A                            </b>=0A                =
            Fri, Oct 14, 2016 2:44:27 PM                            <br>=0A=
                        </div>=0A                            <br>=0A       =
                     <table cellspacing=3D"0" cellpadding=3D"0" border=3D"0=
">=0A                                <tbody>=0A                            =
        <tr>=0A                                        <td valign=3D"top">=
=0A<div id=3D"divtagdefaultwrapper" style=3D"font-size:12pt;color:#000000;f=
ont-family:Calibri, Arial, Helvetica, sans-serif;">=0A<p>Thank you for the =
update.</p>=0A<p><br clear=3D"none">=0A</p>=0A<p>The repair fails with the =
Error &#39;Failed Creating merkle tree&#39; but does not give any additiona=
l details.=0A<br clear=3D"none">=0A</p>=0A<p><br clear=3D"none">=0A</p>=0A<=
p>With -pr running on all DC nodes, we see a peer connection reset error, w=
hich then results in hanged repair process even though the TCP connection s=
ettings looks good on all nodes.<br clear=3D"none">=0A</p>=0A<br clear=3D"n=
one">=0A<br clear=3D"none">=0A<div class=3D"yqt2356162376" id=3D"yqt39086">=
<div style=3D"color:rgb(0, 0, 0);">=0A<hr tabindex=3D"-1" style=3D"display:=
inline-block;width:98%;">=0A<div dir=3D"ltr" id=3D"divRplyFwdMsg"><font sty=
le=3D"font-size:11pt;" face=3D"Calibri, sans-serif" color=3D"#000000"><b>Fr=
om:</b> Anuj Wadehra &lt;anujw_2003@yahoo.co.in&gt;<br clear=3D"none">=0A<b=
>Sent:</b> Wednesday, October 12, 2016 2:41 PM<br clear=3D"none">=0A<b>To:<=
/b> user<br clear=3D"none">=0A<b>Subject:</b> Re: Repair in Multi Datacente=
r - Should you use -dc Datacenter repair or repair with -pr</font>=0A<div>=
=A0</div>=0A</div>=0A<div>=0A<table cellpadding=3D"0" cellspacing=3D"0" bor=
der=3D"0"><tbody><tr><td colspan=3D"1" rowspan=3D"1" valign=3D"top">=0A<div=
>Hi Leena,<br clear=3D"none">=0A<br clear=3D"none">=0AFirst thing you shoul=
d be concerned about is : Why the repair -pr operation doesnt complete ?<br=
 clear=3D"none">=0ASecond comes the question : Which repair option is best?=
<br clear=3D"none">=0A<br clear=3D"none">=0A<br clear=3D"none">=0AOne proba=
ble cause of stuck repairs is : if the firewall between DCs is closing TCP =
connections and Cassandra is trying to use such connections, repairs will h=
ang. Please refer https://docs.datastax.com/en/cassandra/2.0/cassandra/trou=
bleshooting/trblshootIdleFirewall.html=0A . We faced that.<br clear=3D"none=
">=0A<br clear=3D"none">=0AAlso make sure you comply with basic bandwidth r=
equirement between DCs. Recommended is 1000 Mb/s (1 gigabit) or greater.<br=
 clear=3D"none">=0A<br clear=3D"none">=0AAnswers for specific questions:<br=
 clear=3D"none">=0A1.As per my understanding, all replicas will not partici=
pate in dc local repairs and thus repair would be ineffective. You need to =
make sure that all replicas of a data in all dcs are in sync.<br clear=3D"n=
one">=0A<br clear=3D"none">=0A2. Every DC is not a ring. All DCs together f=
orm a token ring. So, I think yes you should run repair -pr on all nodes.<b=
r clear=3D"none">=0A<br clear=3D"none">=0A3. Yes. I dont have experience wi=
th incremental repairs. But you can run repair -pr on all nodes of all DCs.=
<br clear=3D"none">=0A<br clear=3D"none">=0ARegarding Best approach of repa=
ir, you should see some repair presentations of Cassandra Summit 2016. All =
are online now.<br clear=3D"none">=0A<br clear=3D"none">=0AI attended the s=
ummit and people using large clusters generally use sub range repairs to re=
pair their clusters. But such large deployments are on older Cassandra vers=
ions and these deployments generally dont use vnodes. So people know easily=
 which nodes hold=0A which token range.<br clear=3D"none">=0A<br clear=3D"n=
one">=0A<br clear=3D"none">=0A<br clear=3D"none">=0AThanks<br clear=3D"none=
">=0AAnuj<br clear=3D"none">=0A</div>=0A</td></tr></tbody></table>=0A<div i=
d=3D"_origMsg_">=0A<div><br clear=3D"none">=0A<div>=0A<div style=3D"font-si=
ze:0.9em;">=0A<hr size=3D"1">=0A<b><span style=3D"font-weight:bold;">From:<=
/span> </b>Leena Ghatpande &lt;lghatpande@hotmail.com&gt;;=0A<br clear=3D"n=
one">=0A<b><span style=3D"font-weight:bold;">To:</span> </b>user@cassandra.=
apache.org &lt;user@cassandra.apache.org&gt;;=0A<br clear=3D"none">=0A<b><s=
pan style=3D"font-weight:bold;">Subject:</span> </b>Repair in Multi Datacen=
ter - Should you use -dc Datacenter repair or repair with -pr=0A<br clear=
=3D"none">=0A<b><span style=3D"font-weight:bold;">Sent:</span> </b>Wed, Oct=
 12, 2016 2:15:51 PM <br clear=3D"none">=0A</div>=0A<br clear=3D"none">=0A<=
table cellpadding=3D"0" cellspacing=3D"0" border=3D"0"><tbody><tr><td colsp=
an=3D"1" rowspan=3D"1" valign=3D"top">=0A<div id=3D"divtagdefaultwrapper" s=
tyle=3D"font-size:12pt;color:#000000;font-family:Calibri, Arial, Helvetica,=
 sans-serif;">=0A<p><span>Please advice. Cannot find any clear documentatio=
n on what is the best strategy for repairing nodes on a regular basis with =
multiple datacenters involved.</span></p>=0A<p><br clear=3D"none">=0A</p>=
=0A<p>We are running cassandra 3.7 in multi datacenter=A0with 4 nodes in ea=
ch data center. We are trying to run repairs every other night to keep the =
nodes in good state.We currently run repair with -pr option , but the repai=
r process gets hung and does not complete=0A gracefully. Dont see any error=
s in the logs either. </p>=0A<p><br clear=3D"none">=0A</p>=0A<p>What is the=
 best way to perform repairs on multiple data centers on large tables.<br c=
lear=3D"none">=0A</p>=0A<p>1. Can we run Datacenter repair using -dc option=
 for each data center? Do we need to run repair on each node in that case o=
r will it repair=A0all nodes within the datacenter?</p>=0A<p>2. Is running =
repair with -pr across all nodes required , if we perform the step 1 every =
night?</p>=0A<p>3. Is cross data center repair required and if so whats the=
 best option?</p>=0A<p><br clear=3D"none">=0A</p>=0A<p>Thanks</p>=0A<p><br =
clear=3D"none">=0A</p>=0A<p>Leena<br clear=3D"none">=0A</p>=0A<p><br clear=
=3D"none">=0A</p>=0A<p><br clear=3D"none">=0A</p>=0A</div>=0A</td></tr></tb=
ody></table>=0A</div>=0A</div>=0A</div>=0A</div>=0A</div></div>=0A</div>=0A=
</td>=0A                                    </tr>=0A                       =
         </tbody>=0A                            </table>=0A                =
    </div>=0A                </div>=0A            </div>=0A
---296771632-438675444-1476550704=:78559--