Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2029D8BC8 for ; Thu, 15 Sep 2011 12:13:59 +0000 (UTC) Received: (qmail 94888 invoked by uid 500); 15 Sep 2011 12:13:56 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 94864 invoked by uid 500); 15 Sep 2011 12:13:56 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 94856 invoked by uid 99); 15 Sep 2011 12:13:56 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Sep 2011 12:13:56 +0000 X-ASF-Spam-Status: No, hits=1.6 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ethanrowe000@gmail.com designates 209.85.218.44 as permitted sender) Received: from [209.85.218.44] (HELO mail-yi0-f44.google.com) (209.85.218.44) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Sep 2011 12:13:52 +0000 Received: by yie12 with SMTP id 12so2407980yie.31 for ; Thu, 15 Sep 2011 05:13:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type; bh=lY9CZxH5BCCeMTjPzPqao//5qe9Si8NY2jENzn4Fr8M=; b=aJpXpeTc7SYiyF6WgKfegJveiy7pOuUEoHi3509ypFfY+TxaKkBJvOGT6nB+CGyeQW WOkYrIk/xWqTmJ1dWBPjOCkrf+y7sGOOu5PkCdPC31m72rD/NBcU+AxDFdLwmfoeOXdH k0clBFsgxvxHRXiVammVP80yKBajcLd/gSIGI= MIME-Version: 1.0 Received: by 10.42.74.131 with SMTP id w3mr349571icj.223.1316088811040; Thu, 15 Sep 2011 05:13:31 -0700 (PDT) Sender: ethanrowe000@gmail.com Received: by 10.42.228.136 with HTTP; Thu, 15 Sep 2011 05:13:30 -0700 (PDT) In-Reply-To: References: Date: Thu, 15 Sep 2011 08:13:30 -0400 X-Google-Sender-Auth: sdEauJ9PJxgHrt6KBgl1_LoWxpI Message-ID: Subject: Re: New node unable to stream (0.8.5) From: Ethan Rowe To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=90e6ba5bc83b13317304acf9ce4b --90e6ba5bc83b13317304acf9ce4b Content-Type: text/plain; charset=ISO-8859-1 Here's a typical log slice (not terribly informative, I fear): > INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,106 AntiEntropyService.java > (l > ine 884) Performing streaming repair of 1003 ranges with /10.34.90.8 for > (299 > 90798416657667504332586989223299634,54296681768153272037430773234349600451] > INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,427 StreamOut.java (line > 181) > Stream context metadata > [/mnt/cassandra/data/events_production/FitsByShip-g-1 > 0-Data.db sections=88 progress=0/11707163 - 0%, > /mnt/cassandra/data/events_pr > oduction/FitsByShip-g-11-Data.db sections=169 progress=0/6133240 - 0%, > /mnt/c > assandra/data/events_production/FitsByShip-g-6-Data.db sections=1 > progress=0/ > 6918814 - 0%, /mnt/cassandra/data/events_production/FitsByShip-g-12-Data.db > s > ections=260 progress=0/9091780 - 0%], 4 sstables. > INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,428 StreamOutSession.java > (lin > e 174) Streaming to /10.34.90.8 > ERROR [Thread-56] 2011-09-15 05:41:38,515 AbstractCassandraDaemon.java > (line > 139) Fatal exception in thread Thread[Thread-56,5,main] > java.lang.NullPointerException > at > org.apache.cassandra.net.IncomingTcpConnection.stream(IncomingTcpC > onnection.java:174) > at > org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConn > ection.java:114) Not sure if the exception is related to the outbound streaming above; other nodes are actively trying to stream to this node, so perhaps it comes from those and temporal adjacency to the outbound stream is just coincidental. I have other snippets that look basically identical to the above, except if I look at the logs to which this node is trying to stream, I see that it has concurrently opened a stream in the other direction, which could be the one that the exception pertains to. On Thu, Sep 15, 2011 at 7:41 AM, Sylvain Lebresne wrote: > On Thu, Sep 15, 2011 at 1:16 PM, Ethan Rowe wrote: > > Hi. > > > > We've been running a 7-node cluster with RF 3, QUORUM reads/writes in our > > production environment for a few months. It's been consistently stable > > during this period, particularly once we got out maintenance strategy > fully > > worked out (per node, one repair a week, one major compaction a week, the > > latter due to the nature of our data model and usage). While this > cluster > > started, back in June or so, on the 0.7 series, it's been running 0.8.3 > for > > a while now with no issues. We upgraded to 0.8.5 two days ago, having > > tested the upgrade in our staging cluster (with an otherwise identical > > configuration) previously and verified that our application's various use > > cases appeared successful. > > > > One of our nodes suffered a disk failure yesterday. We attempted to > replace > > the dead node by placing a new node at OldNode.initial_token - 1 with > > auto_bootstrap on. A few things went awry from there: > > > > 1. We never saw the new node in bootstrap mode; it became available > pretty > > much immediately upon joining the ring, and never reported a "joining" > > state. I did verify that auto_bootstrap was on. > > > > 2. I mistakenly ran repair on the new node rather than removetoken on the > > old node, due to a delightful mental error. The repair got nowhere fast, > as > > it attempts to repair against the down node which throws an exception. > So I > > interrupted the repair, restarted the node to clear any pending > validation > > compactions, and... > > > > 3. Ran removetoken for the old node. > > > > 4. We let this run for some time and saw eventually that all the nodes > > appeared to be done various compactions and were stuck at streaming. > Many > > streams listed as open, none making any progress. > > > > 5. I observed an Rpc-related exception on the new node (where the > > removetoken was launched) and concluded that the streams were broken so > the > > process wouldn't ever finish. > > > > 6. Ran a "removetoken force" to get the dead node out of the mix. No > > problems. > > > > 7. Ran a repair on the new node. > > > > 8. Validations ran, streams opened up, and again things got stuck in > > streaming, hanging for over an hour with no progress. > > > > 9. Musing that lingering tasks from the removetoken could be a factor, I > > performed a rolling restart and attempted a repair again. > > > > 10. Same problem. Did another rolling restart and attempted a fresh > repair > > on the most important column family alone. > > > > 11. Same problem. Streams included CFs not specified, so I guess they > must > > be for hinted handoff. > > > > In concluding that streaming is stuck, I've observed: > > - streams will be open to the new node from other nodes, but the new node > > doesn't list them > > - streams will be open to the other nodes from the new node, but the > other > > nodes don't list them > > - the streams reported may make some initial progress, but then they hang > at > > a particular point and do not move on for an hour or more. > > - The logs report repair-related activity, until NPEs on incoming TCP > > connections show up, which appear likely to be the culprit. > > Can you send the stack trace from those NPE. > > > > > I can provide more exact details when I'm done commuting. > > > > With streaming broken on this node, I'm unable to run repairs, which is > > obviously problematic. The application didn't suffer any operational > issues > > as a consequence of this, but I need to review the overnight results to > > verify we're not suffering data loss (I doubt we are). > > > > At this point, I'm considering a couple options: > > 1. Remove the new node and let the adjacent node take over its range > > 2. Bring the new node down, add a new one in front of it, and properly > > removetoken the problematic one. > > 3. Bring the new node down, remove all its data except for the system > > keyspace, then bring it back up and repair it. > > 4. Revert to 0.8.3 and see if that helps. > > > > Recommendations? > > > > Thanks. > > - Ethan > > > --90e6ba5bc83b13317304acf9ce4b Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Here's a typical log slice (not terribly informative, I fear):
=A0INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,106 AntiEntropyService.jav= a (l
ine 884) Performing streaming repair of 1003 ranges with /10.34.90.8 for (299
9079841665766750433258698= 9223299634,54296681768153272037430773234349600451]
=A0INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,427 StreamOut.java (line 1= 81)=A0
Stream context metadata [/mnt/cassandra/data/events_production/Fi= tsByShip-g-1
0-Data.db sections=3D88 progress=3D0/11707163 - 0%, /mnt/ca= ssandra/data/events_pr
oduction/FitsByShip-g-11-Data.db sections=3D169 progress=3D0/6133240 - 0%, = /mnt/c
assandra/data/events_production/FitsByShip-g-6-Data.db sections= =3D1 progress=3D0/
6918814 - 0%, /mnt/cassandra/data/events_production/F= itsByShip-g-12-Data.db s
ections=3D260 progress=3D0/9091780 - 0%], 4 sstables.
=A0INFO [AntiEntro= pyStage:2] 2011-09-15 05:41:36,428 StreamOutSession.java (lin
e 174) Str= eaming to /10.34.90.8
ERROR [Thread-56= ] 2011-09-15 05:41:38,515 AbstractCassandraDaemon.java (line=A0
139) Fatal exception in thread Thread[Thread-56,5,main]
java.lang.NullPo= interException
=A0 =A0 =A0 =A0 at org.apache.cassandra.net.IncomingTcpCo= nnection.stream(IncomingTcpC
onnection.java:174)
=A0 =A0 =A0 =A0 at o= rg.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConn
ection.java:114)


Not sure if= the exception is related to the outbound streaming above; other nodes are = actively trying to stream to this node, so perhaps it comes from those and = temporal adjacency to the outbound stream is just coincidental. =A0I have o= ther snippets that look basically identical to the above, except if I look = at the logs to which this node is trying to stream, I see that it has concu= rrently opened a stream in the other direction, which could be the one that= the exception pertains to.


On Thu, Sep 15, 2011 at 7:41 = AM, Sylvain Lebresne <sylvain@datastax.com> wrote:
On Thu, Sep 15, 2011 at 1:16 PM, Ethan Ro= we <ethan@the-rowes.com> w= rote:
> Hi.
>
> We've been running a 7-node cluster with RF 3, QUORUM reads/writes= in our
> production environment for a few months. =A0It's been consistently= stable
> during this period, particularly once we got out maintenance strategy = fully
> worked out (per node, one repair a week, one major compaction a week, = the
> latter due to the nature of our data model and usage). =A0While this c= luster
> started, back in June or so, on the 0.7 series, it's been running = 0.8.3 for
> a while now with no issues. =A0We upgraded to 0.8.5 two days ago, havi= ng
> tested the upgrade in our staging cluster (with an otherwise identical=
> configuration) previously and verified that our application's vari= ous use
> cases appeared successful.
>
> One of our nodes suffered a disk failure yesterday. =A0We attempted to= replace
> the dead node by placing a new node at OldNode.initial_token - 1 with<= br> > auto_bootstrap on. =A0A few things went awry from there:
>
> 1. We never saw the new node in bootstrap mode; it became available pr= etty
> much immediately upon joining the ring, and never reported a "joi= ning"
> state. =A0I did verify that auto_bootstrap was on.
>
> 2. I mistakenly ran repair on the new node rather than removetoken on = the
> old node, due to a delightful mental error. =A0The repair got nowhere = fast, as
> it attempts to repair against the down node which throws an exception.= =A0So I
> interrupted the repair, restarted the node to clear any pending valida= tion
> compactions, and...
>
> 3. Ran removetoken for the old node.
>
> 4. We let this run for some time and saw eventually that all the nodes=
> appeared to be done various compactions and were stuck at streaming.= =A0 Many
> streams listed as open, none making any progress.
>
> 5.=A0 I observed an Rpc-related exception on the new node (where the > removetoken was launched) and concluded that the streams were broken s= o the
> process wouldn't ever finish.
>
> 6. Ran a "removetoken force" to get the dead node out of the= mix.=A0 No
> problems.
>
> 7. Ran a repair on the new node.
>
> 8. Validations ran, streams opened up, and again things got stuck in > streaming, hanging for over an hour with no progress.
>
> 9. Musing that lingering tasks from the removetoken could be a factor,= I
> performed a rolling restart and attempted a repair again.
>
> 10. Same problem.=A0 Did another rolling restart and attempted a fresh= repair
> on the most important column family alone.
>
> 11. Same problem.=A0 Streams included CFs not specified, so I guess th= ey must
> be for hinted handoff.
>
> In concluding that streaming is stuck, I've observed:
> - streams will be open to the new node from other nodes, but the new n= ode
> doesn't list them
> - streams will be open to the other nodes from the new node, but the o= ther
> nodes don't list them
> - the streams reported may make some initial progress, but then they h= ang at
> a particular point and do not move on for an hour or more.
> - The logs report repair-related activity, until NPEs on incoming TCP<= br> > connections show up, which appear likely to be the culprit.

Can you send the stack trace from those NPE.

>
> I can provide more exact details when I'm done commuting.
>
> With streaming broken on this node, I'm unable to run repairs, whi= ch is
> obviously problematic.=A0 The application didn't suffer any operat= ional issues
> as a consequence of this, but I need to review the overnight results t= o
> verify we're not suffering data loss (I doubt we are).
>
> At this point, I'm considering a couple options:
> 1. Remove the new node and let the adjacent node take over its range > 2. Bring the new node down, add a new one in front of it, and properly=
> removetoken the problematic one.
> 3. Bring the new node down, remove all its data except for the system<= br> > keyspace, then bring it back up and repair it.
> 4. Revert to 0.8.3 and see if that helps.
>
> Recommendations?
>
> Thanks.
> - Ethan
>

--90e6ba5bc83b13317304acf9ce4b--