Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of pauloricardomg@gmail.com
 designates 209.85.220.53 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAEDUwd2+awuKJOC-qu9qxec-FgUHZ20_K8GRwqMK7kMazOTDpQ@mail.gmail.com>
References: 
 <CAKaZCX6BKQMntDPEGq-WaixGmD6BX+X_QV-+GgKOtKpUWA2+CQ@mail.gmail.com>
 <CAEDUwd2+awuKJOC-qu9qxec-FgUHZ20_K8GRwqMK7kMazOTDpQ@mail.gmail.com>
From: Paulo Motta <pauloricardomg@gmail.com>
Date: Fri, 4 Oct 2013 15:25:27 -0300
Message-ID: 
 <CAKaZCX4Vztz_jqn8f0B5iPJyyNk51y2WksrMke93goG6a+z0zA@mail.gmail.com>
Subject: Re: Increased read timeouts during rolling upgrade to C* 1.2
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Content-Type: multipart/alternative; boundary=047d7bd6aa5c6817d804e7ee6e63

--047d7bd6aa5c6817d804e7ee6e63
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

One more piece of information to help troubleshooting the issue:

During the "nodetool drain" operation just before the upgrade, instead of
just stopping accepting new writes, the node actually shuts itself down.
This bug was also reported in this other thread:
http://mail-archives.apache.org/mod_mbox/cassandra-user/201303.mbox/%3CCAFD=
WQMTrYm7hBxXKoW8+eVKfNE6zvjW2h8_BSVGmOL7=3DgRDtLw@mail.gmail.com%3E

Since I started Cassandra 1.2 only a few seconds before cassandra 1.1 died
(after the nodetool drain), I'm afraid there wasn't sufficient time for the
remaining nodes to update the metadata about the "downed" node. So when the
upgraded node was restarted, the metadata in the other nodes was still
referring to the previous version of the same node, so this may have caused
the handshake problem, and consequently the read timeout. Does that theory
make sense?


2013/10/4 Robert Coli <rcoli@eventbrite.com>

> On Fri, Oct 4, 2013 at 9:09 AM, Paulo Motta <pauloricardomg@gmail.com>wro=
te:
>
>> I manually tried to insert and retrieve some data into both the newly
>> upgraded nodes and the old nodes, and the behavior was very unstable:
>> sometimes it worked, sometimes it didn't (TimedOutException), so I don't
>> think it was a network problem.
>>
>> The number of read timeouts diminished as the number of upgraded nodes
>> increased, until it reached stability. The logs were showing the followi=
ng
>> messages periodically:
>>
>> ...
>
>> Two similar issues were reported, but without satisfactory responses:
>>
>> -
>> http://stackoverflow.com/questions/15355115/rolling-upgrade-for-cassandr=
a-1-0-9-cluster-to-1-2-1
>> - https://issues.apache.org/jira/browse/CASSANDRA-5740
>>
>
> Both of these issues relate to upgrading from 1._0_.x to 1.2.x, which is
> not supported.
>
> Were I you, I would summarize the above experience in a JIRA ticket, as
> 1.1.x to 1.2.x should be a supported operation and should not unexpectedl=
y
> result in decreased availability during the upgrade.
>
> =3DRob
>


--=20
Paulo Ricardo

--=20
European Master in Distributed Computing***
Royal Institute of Technology - KTH
*
*Instituto Superior T=E9cnico - IST*
*http://paulormg.com*

--047d7bd6aa5c6817d804e7ee6e63
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">One more piece of information to help troubleshooting the =
issue:<div><br></div><div>During the &quot;nodetool drain&quot; operation j=
ust before the upgrade, instead of just stopping accepting new writes, the =
node actually shuts itself down. This bug was also reported in this other t=
hread:=A0<a href=3D"http://mail-archives.apache.org/mod_mbox/cassandra-user=
/201303.mbox/%3CCAFDWQMTrYm7hBxXKoW8+eVKfNE6zvjW2h8_BSVGmOL7=3DgRDtLw@mail.=
gmail.com%3E">http://mail-archives.apache.org/mod_mbox/cassandra-user/20130=
3.mbox/%3CCAFDWQMTrYm7hBxXKoW8+eVKfNE6zvjW2h8_BSVGmOL7=3DgRDtLw@mail.gmail.=
com%3E</a></div>

<div><br></div><div>Since I started Cassandra 1.2 only a few seconds before=
 cassandra 1.1 died (after the nodetool drain), I&#39;m afraid there wasn&#=
39;t sufficient time for the remaining nodes to update the metadata about t=
he &quot;downed&quot; node. So when the upgraded node was restarted, the me=
tadata in the other nodes was still referring to the previous version of th=
e same node, so this may have caused the handshake problem, and consequentl=
y the read timeout. Does that theory make sense?</div>

</div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">2013/10=
/4 Robert Coli <span dir=3D"ltr">&lt;<a href=3D"mailto:rcoli@eventbrite.com=
" target=3D"_blank">rcoli@eventbrite.com</a>&gt;</span><br><blockquote clas=
s=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;pad=
ding-left:1ex">

<div dir=3D"ltr"><div class=3D"im">On Fri, Oct 4, 2013 at 9:09 AM, Paulo Mo=
tta <span dir=3D"ltr">&lt;<a href=3D"mailto:pauloricardomg@gmail.com" targe=
t=3D"_blank">pauloricardomg@gmail.com</a>&gt;</span> wrote:<br></div><div c=
lass=3D"gmail_extra">

<div class=3D"gmail_quote"><div class=3D"im">
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div>I manually tried to in=
sert and retrieve some data into both the newly upgraded nodes and the old =
nodes, and the behavior was very unstable: sometimes it worked, sometimes i=
t didn&#39;t (TimedOutException), so I don&#39;t think it was a network pro=
blem.</div>


<div><br></div><div>The number of read timeouts diminished as the number of=
 upgraded nodes increased, until it reached stability. The logs were showin=
g the following messages periodically:</div><div><br></div><div><div style=
=3D"font-family:arial,sans-serif;font-size:13.333333969116211px">


<div><span style=3D"font-size:13.333333969116211px;white-space:pre-wrap;fon=
t-family:Verdana,Geneva,Helvetica,Arial,sans-serif"></span></div></div></di=
v></div></blockquote></div><div>...=A0</div><div class=3D"im"><blockquote c=
lass=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;=
padding-left:1ex">


<div dir=3D"ltr"><div><div style=3D"font-family:arial,sans-serif;font-size:=
13.333333969116211px"><div><span style=3D"font-size:13.333333969116211px;wh=
ite-space:pre-wrap;font-family:Verdana,Geneva,Helvetica,Arial,sans-serif">T=
wo similar issues were reported, but without satisfactory responses:</span>=
<br>


</div><div>

<font color=3D"#000000" face=3D"Verdana, Geneva, Helvetica, Arial, sans-ser=
if"><span style=3D"white-space:pre-wrap"><br></span></font></div><div><font=
 color=3D"#000000" face=3D"Verdana, Geneva, Helvetica, Arial, sans-serif"><=
span style=3D"white-space:pre-wrap">- </span></font><a href=3D"http://stack=
overflow.com/questions/15355115/rolling-upgrade-for-cassandra-1-0-9-cluster=
-to-1-2-1" target=3D"_blank">http://stackoverflow.com/questions/15355115/ro=
lling-upgrade-for-cassandra-1-0-9-cluster-to-1-2-1</a></div>


<div>-=A0<a href=3D"https://issues.apache.org/jira/browse/CASSANDRA-5740" t=
arget=3D"_blank">https://issues.apache.org/jira/browse/CASSANDRA-5740</a></=
div></div></div></div></blockquote><div><br></div></div><div>Both of these =
issues relate to upgrading from 1._0_.x to 1.2.x, which is not supported.</=
div>


<div><br></div><div>Were I you, I would summarize the above experience in a=
 JIRA ticket, as 1.1.x to 1.2.x should be a supported operation and should =
not unexpectedly result in decreased availability during the upgrade.</div>


<div><br></div><div>=3DRob=A0</div></div></div></div>
</blockquote></div><br><br clear=3D"all"><div><br></div>-- <br><div>Paulo R=
icardo</div><div><br></div>-- <br><span>European Master in Distributed Comp=
uting<i></i></span><span style=3D"font-family:arial,sans-serif;line-height:=
15px"><i style=3D"font-style:normal"><br>

Royal Institute of Technology -=A0<i style=3D"font-style:normal">KTH</i><br=
></i></span><div><span style=3D"font-family:arial,sans-serif;line-height:15=
px"><i style=3D"font-style:normal"><i style=3D"font-style:normal">Instituto=
 Superior T=E9cnico - IST</i></i></span></div>

<div><span style=3D"font-family:arial,sans-serif;line-height:15px"><i style=
=3D"font-style:normal"><i style=3D"font-style:normal"><a href=3D"http://pau=
lormg.com" target=3D"_blank">http://paulormg.com</a></i></i></span></div>
</div>

--047d7bd6aa5c6817d804e7ee6e63--