Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
MIME-Version: 1.0
In-Reply-To: <794486693.5503516.1455905565312.JavaMail.yahoo@mail.yahoo.com>
References: 
 <1242549769.5244726.1455847648582.JavaMail.yahoo.ref@mail.yahoo.com>
 <1242549769.5244726.1455847648582.JavaMail.yahoo@mail.yahoo.com>
 <CA+VSrLqmTc96XrSBBTceBNM-V6zuExcQSdJfRM+7y+DKp7bmhQ@mail.gmail.com>
 <794486693.5503516.1455905565312.JavaMail.yahoo@mail.yahoo.com>
From: daemeon reiydelle <daemeonr@gmail.com>
Date: Fri, 19 Feb 2016 13:46:35 -0800
Message-ID: 
 <CAOUOv0FvJfZutNGhBwqRm=pJ5yxkgSjW9tLvn1imcbv7d1TdFQ@mail.gmail.com>
Subject: Re: Live upgrade 2.0 to 2.1 temporarily increases GC time causing
 timeouts and unavailability
To: user@cassandra.apache.org, Sotirios Delimanolis <sotodel_89@yahoo.com>
Cc: Alain RODRIGUEZ <arodrime@gmail.com>
Content-Type: multipart/alternative; boundary=089e0112d18e90593a052c266cf7

--089e0112d18e90593a052c266cf7
Content-Type: text/plain; charset=UTF-8

FYI, my observations were with native, not thrift.


*.......*


*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Fri, Feb 19, 2016 at 10:12 AM, Sotirios Delimanolis <sotodel_89@yahoo.com
> wrote:

> Does your cluster contain 24+ nodes or fewer?
>
> We did the same upgrade on a smaller cluster of 5 nodes and we didn't see
> this behavior. On the 24 node cluster, the timeouts only took effect once
> ~5-6-7+ nodes had been upgraded.
>
> We're doing some more upgrades next week, trying different deployment
> plans. I'll report back with the results.
>
> Thanks for the reply (we absolutely want to move to CQL)
>
>
> On Friday, February 19, 2016 1:10 AM, Alain RODRIGUEZ <arodrime@gmail.com>
> wrote:
>
>
> I performed this exact update a few days ago, excepted clients were using
> native protocol and it wen smoothly. So I think this might be thrift
> related. No idea what is producing this though, just wanted to give the
> info fwiw.
>
> As a side note, unrelated to the issue, performances using native are a
> lot better than thrift starting in C* 2.1. Drivers using native are also
> more modern allowing you to do very interesting stuff. Updating to native
> now that you are using 2.1 is something you might want to do soon enough
> :-).
>
> C*heers,
> -----------------
> Alain Rodriguez
> France
>
> The Last Pickle
> http://www.thelastpickle.com
>
> 2016-02-19 3:07 GMT+01:00 Sotirios Delimanolis <sotodel_89@yahoo.com>:
>
> We have a Cassandra cluster with 24 nodes. These nodes were running
> 2.0.16.
>
> While the nodes are in the ring and handling queries, we perform the
> upgrade to 2.1.12 as follows (more or less) one node at a time:
>
>
>    1. Stop the Cassandra process
>    2. Deploy jars, scripts, binaries, etc.
>    3. Start the Cassandra process
>
>
> A few nodes into the upgrade, we start noticing that the majority of
> queries (mostly through Thrift) time out or report unavailable. Looking at
> system information, Cassandra GC time goes through the roof, which is what
> we assume causes the time outs.
>
> Once all nodes are upgraded, the cluster stabilizes and no more (barely
> any) time outs occur.
>
> What could explain this? Does it have anything to do with how a 2.0
> communicates with a 2.1?
>
> Our Cassandra consumers haven't changed.
>
>
>
>
>
>
>
>
>

--089e0112d18e90593a052c266cf7
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div class=3D"gmail_default" style=3D"font-family:comic sa=
ns ms,sans-serif;color:rgb(7,55,99)">FYI, my observations were with native,=
 not thrift.<br></div></div><div class=3D"gmail_extra"><br clear=3D"all"><d=
iv><div class=3D"gmail_signature"><div dir=3D"ltr"><div><div dir=3D"ltr"><d=
iv><div dir=3D"ltr"><span style=3D"color:rgb(56,118,29)"><span style=3D"bac=
kground-color:rgb(255,255,255)"><b><span style=3D"font-family:comic sans ms=
,sans-serif"></span></b></span></span><span style=3D"color:rgb(56,118,29)">=
<span style=3D"background-color:rgb(255,255,255)"><b><span style=3D"font-fa=
mily:comic sans ms,sans-serif"><br>.......</span></b></span></span><span st=
yle=3D"color:rgb(56,118,29)"><span style=3D"background-color:rgb(255,255,25=
5)"><b><span style=3D"font-family:comic sans ms,sans-serif"><br><br>Daemeon=
 C.M. Reiydelle<br>USA (+1) 415.501.0198<br>London (+44) (0) 20 8144 9872</=
span></b></span></span><font size=3D"1"><i><br></i></font></div></div></div=
></div></div></div></div>
<br><div class=3D"gmail_quote">On Fri, Feb 19, 2016 at 10:12 AM, Sotirios D=
elimanolis <span dir=3D"ltr">&lt;<a href=3D"mailto:sotodel_89@yahoo.com" ta=
rget=3D"_blank">sotodel_89@yahoo.com</a>&gt;</span> wrote:<br><blockquote c=
lass=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;=
padding-left:1ex"><div><div style=3D"color:#000;background-color:#fff;font-=
family:Courier New,courier,monaco,monospace,sans-serif;font-size:16px"><div=
>Does your cluster contain 24+ nodes or fewer?=C2=A0</div><div><br></div><d=
iv dir=3D"ltr">We did the same upgrade on a smaller cluster of 5 nodes and =
we didn&#39;t see this behavior. On the 24 node cluster, the timeouts only =
took effect once ~5-6-7+ nodes had been upgraded.</div><div dir=3D"ltr"><br=
></div><div dir=3D"ltr">We&#39;re doing some more upgrades next week, tryin=
g different deployment plans. I&#39;ll report back with the results.</div><=
div dir=3D"ltr"><br></div><div dir=3D"ltr">Thanks for the reply (we absolut=
ely want to move to CQL)</div> <div><br><br></div><div style=3D"display:blo=
ck"> <div style=3D"font-family:Courier New,courier,monaco,monospace,sans-se=
rif;font-size:16px"> <div style=3D"font-family:HelveticaNeue,Helvetica Neue=
,Helvetica,Arial,Lucida Grande,sans-serif;font-size:16px"> <div dir=3D"ltr"=
><font face=3D"Arial" size=3D"2"> On Friday, February 19, 2016 1:10 AM, Ala=
in RODRIGUEZ &lt;<a href=3D"mailto:arodrime@gmail.com" target=3D"_blank">ar=
odrime@gmail.com</a>&gt; wrote:<br></font></div>  <br><br> <div><div><div><=
div dir=3D"ltr">I performed this exact update a few days ago, excepted clie=
nts were using native protocol and it wen smoothly. So I think this might b=
e thrift related. No idea what is producing this though, just wanted to giv=
e the info fwiw.<div><br clear=3D"none"></div><div>As a side note, unrelate=
d to the issue, performances using native are a lot better than thrift star=
ting in C* 2.1. Drivers using native are also more modern allowing you to d=
o very interesting stuff. Updating to native now that you are using 2.1 is =
something you might want to do soon enough :-).</div><div><br clear=3D"none=
"></div><div>C*heers,</div><div><div><div>-----------------</div><div>Alain=
 Rodriguez</div><div>France</div><div><br clear=3D"none"></div><div>The Las=
t Pickle</div><div><a rel=3D"nofollow" shape=3D"rect" href=3D"http://www.th=
elastpickle.com/" target=3D"_blank">http://www.thelastpickle.com</a></div><=
/div></div></div><div><div><br clear=3D"none"><div>2016-02-19 3:07 GMT+01:0=
0 Sotirios Delimanolis <span dir=3D"ltr">&lt;<a rel=3D"nofollow" shape=3D"r=
ect" href=3D"mailto:sotodel_89@yahoo.com" target=3D"_blank">sotodel_89@yaho=
o.com</a>&gt;</span>:<br clear=3D"none"><blockquote style=3D"margin:0 0 0 .=
8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div style=3D"color:#=
000;background-color:#fff;font-family:Courier New,courier,monaco,monospace,=
sans-serif;font-size:16px"><div dir=3D"ltr">We have a Cassandra cluster wit=
h 24 nodes. These nodes were running 2.0.16.=C2=A0</div><div dir=3D"ltr"><b=
r clear=3D"none"></div><div dir=3D"ltr">While the nodes are in the ring and=
 handling queries, we perform the upgrade to 2.1.12 as follows (more or les=
s) one node at a time:</div><div dir=3D"ltr"><br clear=3D"none"></div><ol d=
ir=3D"ltr"><li>Stop the Cassandra process</li><li>Deploy=C2=A0jars,=C2=A0sc=
ripts, binaries, etc.</li><li>Start the Cassandra process</li></ol><div><br=
 clear=3D"none"></div><div>A few nodes into the upgrade, we start noticing =
that the majority of queries (mostly through Thrift) time out or report una=
vailable. Looking at system information, Cassandra GC time goes through the=
 roof, which is what we assume causes the time outs.</div><div><br clear=3D=
"none"></div><div>Once all nodes are upgraded, the cluster stabilizes and n=
o more (barely any) time outs occur.=C2=A0</div><div><br clear=3D"none"></d=
iv><div>What could explain this? Does it have anything to do with how a 2.0=
 communicates with a 2.1?</div><div><br clear=3D"none"></div><div>Our Cassa=
ndra consumers haven&#39;t changed.</div><div><br clear=3D"none"></div><div=
><br clear=3D"none"></div><div><br clear=3D"none"></div><div><br clear=3D"n=
one"></div><div><br clear=3D"none"></div></div></div></blockquote></div><br=
 clear=3D"none"></div></div></div></div><br><br></div>  </div> </div>  </di=
v></div></div></blockquote></div><br></div>

--089e0112d18e90593a052c266cf7--