Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of ethanrowe000@gmail.com
 designates 209.85.218.44 as permitted sender)
MIME-Version: 1.0
Sender: ethanrowe000@gmail.com
In-Reply-To: 
 <CAKkz8Q1SdgBAxxME0NKQmXCfi=Qekbu8Te+daCfsqO=wFuR6ew@mail.gmail.com>
References: 
 <CAECxnKTUV8mSp0jF+fCiM+=axSZjioaRhuLizKiqVBC1E6LqpQ@mail.gmail.com>
	<CAKkz8Q1SdgBAxxME0NKQmXCfi=Qekbu8Te+daCfsqO=wFuR6ew@mail.gmail.com>
Date: Thu, 15 Sep 2011 08:13:30 -0400
Message-ID: 
 <CAECxnKRDd6AA6w_Yjxgsexr137xLmuBrDKdOjedP2oF0VWsKxg@mail.gmail.com>
Subject: Re: New node unable to stream (0.8.5)
From: Ethan Rowe <ethan@the-rowes.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=90e6ba5bc83b13317304acf9ce4b

--90e6ba5bc83b13317304acf9ce4b
Content-Type: text/plain; charset=ISO-8859-1

Here's a typical log slice (not terribly informative, I fear):

>  INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,106 AntiEntropyService.java
> (l
> ine 884) Performing streaming repair of 1003 ranges with /10.34.90.8 for
> (299
> 90798416657667504332586989223299634,54296681768153272037430773234349600451]
>  INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,427 StreamOut.java (line
> 181)
> Stream context metadata
> [/mnt/cassandra/data/events_production/FitsByShip-g-1
> 0-Data.db sections=88 progress=0/11707163 - 0%,
> /mnt/cassandra/data/events_pr
> oduction/FitsByShip-g-11-Data.db sections=169 progress=0/6133240 - 0%,
> /mnt/c
> assandra/data/events_production/FitsByShip-g-6-Data.db sections=1
> progress=0/
> 6918814 - 0%, /mnt/cassandra/data/events_production/FitsByShip-g-12-Data.db
> s
> ections=260 progress=0/9091780 - 0%], 4 sstables.
>  INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,428 StreamOutSession.java
> (lin
> e 174) Streaming to /10.34.90.8
> ERROR [Thread-56] 2011-09-15 05:41:38,515 AbstractCassandraDaemon.java
> (line
> 139) Fatal exception in thread Thread[Thread-56,5,main]
> java.lang.NullPointerException
>         at
> org.apache.cassandra.net.IncomingTcpConnection.stream(IncomingTcpC
> onnection.java:174)
>         at
> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConn
> ection.java:114)


Not sure if the exception is related to the outbound streaming above; other
nodes are actively trying to stream to this node, so perhaps it comes from
those and temporal adjacency to the outbound stream is just coincidental.  I
have other snippets that look basically identical to the above, except if I
look at the logs to which this node is trying to stream, I see that it has
concurrently opened a stream in the other direction, which could be the one
that the exception pertains to.


On Thu, Sep 15, 2011 at 7:41 AM, Sylvain Lebresne <sylvain@datastax.com>wrote:

> On Thu, Sep 15, 2011 at 1:16 PM, Ethan Rowe <ethan@the-rowes.com> wrote:
> > Hi.
> >
> > We've been running a 7-node cluster with RF 3, QUORUM reads/writes in our
> > production environment for a few months.  It's been consistently stable
> > during this period, particularly once we got out maintenance strategy
> fully
> > worked out (per node, one repair a week, one major compaction a week, the
> > latter due to the nature of our data model and usage).  While this
> cluster
> > started, back in June or so, on the 0.7 series, it's been running 0.8.3
> for
> > a while now with no issues.  We upgraded to 0.8.5 two days ago, having
> > tested the upgrade in our staging cluster (with an otherwise identical
> > configuration) previously and verified that our application's various use
> > cases appeared successful.
> >
> > One of our nodes suffered a disk failure yesterday.  We attempted to
> replace
> > the dead node by placing a new node at OldNode.initial_token - 1 with
> > auto_bootstrap on.  A few things went awry from there:
> >
> > 1. We never saw the new node in bootstrap mode; it became available
> pretty
> > much immediately upon joining the ring, and never reported a "joining"
> > state.  I did verify that auto_bootstrap was on.
> >
> > 2. I mistakenly ran repair on the new node rather than removetoken on the
> > old node, due to a delightful mental error.  The repair got nowhere fast,
> as
> > it attempts to repair against the down node which throws an exception.
>  So I
> > interrupted the repair, restarted the node to clear any pending
> validation
> > compactions, and...
> >
> > 3. Ran removetoken for the old node.
> >
> > 4. We let this run for some time and saw eventually that all the nodes
> > appeared to be done various compactions and were stuck at streaming.
> Many
> > streams listed as open, none making any progress.
> >
> > 5.  I observed an Rpc-related exception on the new node (where the
> > removetoken was launched) and concluded that the streams were broken so
> the
> > process wouldn't ever finish.
> >
> > 6. Ran a "removetoken force" to get the dead node out of the mix.  No
> > problems.
> >
> > 7. Ran a repair on the new node.
> >
> > 8. Validations ran, streams opened up, and again things got stuck in
> > streaming, hanging for over an hour with no progress.
> >
> > 9. Musing that lingering tasks from the removetoken could be a factor, I
> > performed a rolling restart and attempted a repair again.
> >
> > 10. Same problem.  Did another rolling restart and attempted a fresh
> repair
> > on the most important column family alone.
> >
> > 11. Same problem.  Streams included CFs not specified, so I guess they
> must
> > be for hinted handoff.
> >
> > In concluding that streaming is stuck, I've observed:
> > - streams will be open to the new node from other nodes, but the new node
> > doesn't list them
> > - streams will be open to the other nodes from the new node, but the
> other
> > nodes don't list them
> > - the streams reported may make some initial progress, but then they hang
> at
> > a particular point and do not move on for an hour or more.
> > - The logs report repair-related activity, until NPEs on incoming TCP
> > connections show up, which appear likely to be the culprit.
>
> Can you send the stack trace from those NPE.
>
> >
> > I can provide more exact details when I'm done commuting.
> >
> > With streaming broken on this node, I'm unable to run repairs, which is
> > obviously problematic.  The application didn't suffer any operational
> issues
> > as a consequence of this, but I need to review the overnight results to
> > verify we're not suffering data loss (I doubt we are).
> >
> > At this point, I'm considering a couple options:
> > 1. Remove the new node and let the adjacent node take over its range
> > 2. Bring the new node down, add a new one in front of it, and properly
> > removetoken the problematic one.
> > 3. Bring the new node down, remove all its data except for the system
> > keyspace, then bring it back up and repair it.
> > 4. Revert to 0.8.3 and see if that helps.
> >
> > Recommendations?
> >
> > Thanks.
> > - Ethan
> >
>

--90e6ba5bc83b13317304acf9ce4b
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Here&#39;s a typical log slice (not terribly informative, I fear):<div><blo=
ckquote class=3D"gmail_quote" style=3D"margin-top: 0px; margin-right: 0px; =
margin-bottom: 0px; margin-left: 0.8ex; border-left-width: 1px; border-left=
-color: rgb(204, 204, 204); border-left-style: solid; padding-left: 1ex; ">
=A0INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,106 AntiEntropyService.jav=
a (l<br>ine 884) Performing streaming repair of 1003 ranges with /<a href=
=3D"http://10.34.90.8">10.34.90.8</a> for (299<br>9079841665766750433258698=
9223299634,54296681768153272037430773234349600451]<br>
=A0INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,427 StreamOut.java (line 1=
81)=A0<br>Stream context metadata [/mnt/cassandra/data/events_production/Fi=
tsByShip-g-1<br>0-Data.db sections=3D88 progress=3D0/11707163 - 0%, /mnt/ca=
ssandra/data/events_pr<br>
oduction/FitsByShip-g-11-Data.db sections=3D169 progress=3D0/6133240 - 0%, =
/mnt/c<br>assandra/data/events_production/FitsByShip-g-6-Data.db sections=
=3D1 progress=3D0/<br>6918814 - 0%, /mnt/cassandra/data/events_production/F=
itsByShip-g-12-Data.db s<br>
ections=3D260 progress=3D0/9091780 - 0%], 4 sstables.<br>=A0INFO [AntiEntro=
pyStage:2] 2011-09-15 05:41:36,428 StreamOutSession.java (lin<br>e 174) Str=
eaming to /<a href=3D"http://10.34.90.8">10.34.90.8</a><br>ERROR [Thread-56=
] 2011-09-15 05:41:38,515 AbstractCassandraDaemon.java (line=A0<br>
139) Fatal exception in thread Thread[Thread-56,5,main]<br>java.lang.NullPo=
interException<br>=A0 =A0 =A0 =A0 at org.apache.cassandra.net.IncomingTcpCo=
nnection.stream(IncomingTcpC<br>onnection.java:174)<br>=A0 =A0 =A0 =A0 at o=
rg.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConn<br>
ection.java:114)</blockquote><div><br></div><div><br></div><div>Not sure if=
 the exception is related to the outbound streaming above; other nodes are =
actively trying to stream to this node, so perhaps it comes from those and =
temporal adjacency to the outbound stream is just coincidental. =A0I have o=
ther snippets that look basically identical to the above, except if I look =
at the logs to which this node is trying to stream, I see that it has concu=
rrently opened a stream in the other direction, which could be the one that=
 the exception pertains to.</div>
<div><br></div><br><div class=3D"gmail_quote">On Thu, Sep 15, 2011 at 7:41 =
AM, Sylvain Lebresne <span dir=3D"ltr">&lt;<a href=3D"mailto:sylvain@datast=
ax.com">sylvain@datastax.com</a>&gt;</span> wrote:<br><blockquote class=3D"=
gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-=
left:1ex;">
<div><div></div><div class=3D"h5">On Thu, Sep 15, 2011 at 1:16 PM, Ethan Ro=
we &lt;<a href=3D"mailto:ethan@the-rowes.com">ethan@the-rowes.com</a>&gt; w=
rote:<br>
&gt; Hi.<br>
&gt;<br>
&gt; We&#39;ve been running a 7-node cluster with RF 3, QUORUM reads/writes=
 in our<br>
&gt; production environment for a few months. =A0It&#39;s been consistently=
 stable<br>
&gt; during this period, particularly once we got out maintenance strategy =
fully<br>
&gt; worked out (per node, one repair a week, one major compaction a week, =
the<br>
&gt; latter due to the nature of our data model and usage). =A0While this c=
luster<br>
&gt; started, back in June or so, on the 0.7 series, it&#39;s been running =
0.8.3 for<br>
&gt; a while now with no issues. =A0We upgraded to 0.8.5 two days ago, havi=
ng<br>
&gt; tested the upgrade in our staging cluster (with an otherwise identical=
<br>
&gt; configuration) previously and verified that our application&#39;s vari=
ous use<br>
&gt; cases appeared successful.<br>
&gt;<br>
&gt; One of our nodes suffered a disk failure yesterday. =A0We attempted to=
 replace<br>
&gt; the dead node by placing a new node at OldNode.initial_token - 1 with<=
br>
&gt; auto_bootstrap on. =A0A few things went awry from there:<br>
&gt;<br>
&gt; 1. We never saw the new node in bootstrap mode; it became available pr=
etty<br>
&gt; much immediately upon joining the ring, and never reported a &quot;joi=
ning&quot;<br>
&gt; state. =A0I did verify that auto_bootstrap was on.<br>
&gt;<br>
&gt; 2. I mistakenly ran repair on the new node rather than removetoken on =
the<br>
&gt; old node, due to a delightful mental error. =A0The repair got nowhere =
fast, as<br>
&gt; it attempts to repair against the down node which throws an exception.=
 =A0So I<br>
&gt; interrupted the repair, restarted the node to clear any pending valida=
tion<br>
&gt; compactions, and...<br>
&gt;<br>
&gt; 3. Ran removetoken for the old node.<br>
&gt;<br>
&gt; 4. We let this run for some time and saw eventually that all the nodes=
<br>
&gt; appeared to be done various compactions and were stuck at streaming.=
=A0 Many<br>
&gt; streams listed as open, none making any progress.<br>
&gt;<br>
&gt; 5.=A0 I observed an Rpc-related exception on the new node (where the<b=
r>
&gt; removetoken was launched) and concluded that the streams were broken s=
o the<br>
&gt; process wouldn&#39;t ever finish.<br>
&gt;<br>
&gt; 6. Ran a &quot;removetoken force&quot; to get the dead node out of the=
 mix.=A0 No<br>
&gt; problems.<br>
&gt;<br>
&gt; 7. Ran a repair on the new node.<br>
&gt;<br>
&gt; 8. Validations ran, streams opened up, and again things got stuck in<b=
r>
&gt; streaming, hanging for over an hour with no progress.<br>
&gt;<br>
&gt; 9. Musing that lingering tasks from the removetoken could be a factor,=
 I<br>
&gt; performed a rolling restart and attempted a repair again.<br>
&gt;<br>
&gt; 10. Same problem.=A0 Did another rolling restart and attempted a fresh=
 repair<br>
&gt; on the most important column family alone.<br>
&gt;<br>
&gt; 11. Same problem.=A0 Streams included CFs not specified, so I guess th=
ey must<br>
&gt; be for hinted handoff.<br>
&gt;<br>
&gt; In concluding that streaming is stuck, I&#39;ve observed:<br>
&gt; - streams will be open to the new node from other nodes, but the new n=
ode<br>
&gt; doesn&#39;t list them<br>
&gt; - streams will be open to the other nodes from the new node, but the o=
ther<br>
&gt; nodes don&#39;t list them<br>
&gt; - the streams reported may make some initial progress, but then they h=
ang at<br>
&gt; a particular point and do not move on for an hour or more.<br>
&gt; - The logs report repair-related activity, until NPEs on incoming TCP<=
br>
&gt; connections show up, which appear likely to be the culprit.<br>
<br>
</div></div>Can you send the stack trace from those NPE.<br>
<div><div></div><div class=3D"h5"><br>
&gt;<br>
&gt; I can provide more exact details when I&#39;m done commuting.<br>
&gt;<br>
&gt; With streaming broken on this node, I&#39;m unable to run repairs, whi=
ch is<br>
&gt; obviously problematic.=A0 The application didn&#39;t suffer any operat=
ional issues<br>
&gt; as a consequence of this, but I need to review the overnight results t=
o<br>
&gt; verify we&#39;re not suffering data loss (I doubt we are).<br>
&gt;<br>
&gt; At this point, I&#39;m considering a couple options:<br>
&gt; 1. Remove the new node and let the adjacent node take over its range<b=
r>
&gt; 2. Bring the new node down, add a new one in front of it, and properly=
<br>
&gt; removetoken the problematic one.<br>
&gt; 3. Bring the new node down, remove all its data except for the system<=
br>
&gt; keyspace, then bring it back up and repair it.<br>
&gt; 4. Revert to 0.8.3 and see if that helps.<br>
&gt;<br>
&gt; Recommendations?<br>
&gt;<br>
&gt; Thanks.<br>
&gt; - Ethan<br>
&gt;<br>
</div></div></blockquote></div><br></div>

--90e6ba5bc83b13317304acf9ce4b--