Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Date: Fri, 10 Apr 2015 23:41:13 +0000 (UTC)
From: "Brandon Williams (JIRA)" <jira@apache.org>
To: commits@cassandra.apache.org
Message-ID: <JIRA.12746431.1412697958000.59735.1428709273549@Atlassian.JIRA>
In-Reply-To: <JIRA.12746431.1412697958000@Atlassian.JIRA>
References: <JIRA.12746431.1412697958000@Atlassian.JIRA>
 <JIRA.12746431.1412697958671@arcas>
Subject: [jira] [Updated] (CASSANDRA-8072) Exception during startup: Unable
 to gossip with any seeds
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


     [ https://issues.apache.org/jira/browse/CASSANDRA-8072?page=3Dcom.atla=
ssian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brandon Williams updated CASSANDRA-8072:
----------------------------------------
    Attachment: 8072.txt

Now we're getting somewhere.  It starts here, after the seed receives the d=
ead state for the decommissioned node:

{noformat}
DEBUG [GossipStage:1] 2015-04-10 22:05:10,147 ReconnectableSnitchHelper.jav=
a (line 70) Intiated reconnect to an Internal IP /10.2.1.139 for the /54.21=
9.189.161
{noformat}

Later, the seed receives the SYN and tries to send the ACK, but it tries to=
 send over the previous internal IP:

{noformat}
DEBUG [ACCEPT-/10.2.0.71] 2015-04-10 22:06:45,576 MessagingService.java (li=
ne 917) Connection version 7 from /54.219.189.161
DEBUG [Thread-11] 2015-04-10 22:06:45,621 MessagingService.java (line 780) =
Setting version 7 for /54.219.189.161
DEBUG [Thread-11] 2015-04-10 22:06:45,621 IncomingTcpConnection.java (line =
107) Set version for /54.219.189.161 to 7 (will use 7)
TRACE [GossipStage:1] 2015-04-10 22:06:45,658 GossipDigestSynVerbHandler.ja=
va (line 40) Received a GossipDigestSynMessage from /54.219.189.161
TRACE [GossipStage:1] 2015-04-10 22:06:45,660 Gossiper.java (line 768) loca=
l heartbeat version 179776 greater than 0 for /54.219.189.161
TRACE [GossipStage:1] 2015-04-10 22:06:45,666 GossipDigestSynVerbHandler.ja=
va (line 84) Sending a GossipDigestAckMessage to /54.219.189.161
TRACE [GossipStage:1] 2015-04-10 22:06:45,666 MessagingService.java (line 6=
60) /54.219.189.162 sending GOSSIP_DIGEST_ACK to 399@/54.219.189.161
DEBUG [WRITE-/54.219.189.161] 2015-04-10 22:06:45,666 OutboundTcpConnection=
.java (line 290) attempting to connect to /10.2.1.139
{noformat}

It seems like the 'new' 161 isn't binding this IP, which is fine depending =
on your circumstance, but at least one problem we have is we shouldn't be s=
ending the onJoin event for a dead state which triggers the initial reconne=
ct.  I can't think of any reason we'd want to send that event upon discover=
y of any dead state, so patch to only send it for live states.

That said, I don't think this is the original cause, because when I've seen=
 it I wasn't using INTERNAL_IP nor a reconnecting snitch.

> Exception during startup: Unable to gossip with any seeds
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-8072
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8072
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Ryan Springer
>            Assignee: Brandon Williams
>             Fix For: 2.0.15, 2.1.5
>
>         Attachments: 8072.txt, cas-dev-dt-01-uw1-cassandra-seed01_logs.ta=
r.bz2, cas-dev-dt-01-uw1-cassandra-seed02_logs.tar.bz2, cas-dev-dt-01-uw1-c=
assandra02_logs.tar.bz2, casandra-system-log-with-assert-patch.log, trace_l=
ogs.tar.bz2
>
>
> When Opscenter 4.1.4 or 5.0.1 tries to provision a 2-node DSC 2.0.10 clus=
ter in either ec2 or locally, an error occurs sometimes with one of the nod=
es refusing to start C*.  The error in the /var/log/cassandra/system.log is=
:
> ERROR [main] 2014-10-06 15:54:52,292 CassandraDaemon.java (line 513) Exce=
ption encountered during startup
> java.lang.RuntimeException: Unable to gossip with any seeds
>         at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:=
1200)
>         at org.apache.cassandra.service.StorageService.checkForEndpointCo=
llision(StorageService.java:444)
>         at org.apache.cassandra.service.StorageService.prepareToJoin(Stor=
ageService.java:655)
>         at org.apache.cassandra.service.StorageService.initServer(Storage=
Service.java:609)
>         at org.apache.cassandra.service.StorageService.initServer(Storage=
Service.java:502)
>         at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDa=
emon.java:378)
>         at org.apache.cassandra.service.CassandraDaemon.activate(Cassandr=
aDaemon.java:496)
>         at org.apache.cassandra.service.CassandraDaemon.main(CassandraDae=
mon.java:585)
>  INFO [StorageServiceShutdownHook] 2014-10-06 15:54:52,326 Gossiper.java =
(line 1279) Announcing shutdown
>  INFO [StorageServiceShutdownHook] 2014-10-06 15:54:54,326 MessagingServi=
ce.java (line 701) Waiting for messaging service to quiesce
>  INFO [ACCEPT-localhost/127.0.0.1] 2014-10-06 15:54:54,327 MessagingServi=
ce.java (line 941) MessagingService has terminated the accept() thread
> This errors does not always occur when provisioning a 2-node cluster, but=
 probably around half of the time on only one of the nodes.  I haven't been=
 able to reproduce this error with DSC 2.0.9, and there have been no code o=
r definition file changes in Opscenter.
> I can reproduce locally with the above steps.=E2=80=82 I'm happy to test =
any proposed fixes since I'm the only person able to reproduce reliably so =
far.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)