cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brandon Williams (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CASSANDRA-8072) Exception during startup: Unable to gossip with any seeds
Date Fri, 10 Apr 2015 23:41:13 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Brandon Williams updated CASSANDRA-8072:
----------------------------------------
    Attachment: 8072.txt

Now we're getting somewhere.  It starts here, after the seed receives the dead state for the
decommissioned node:

{noformat}
DEBUG [GossipStage:1] 2015-04-10 22:05:10,147 ReconnectableSnitchHelper.java (line 70) Intiated
reconnect to an Internal IP /10.2.1.139 for the /54.219.189.161
{noformat}

Later, the seed receives the SYN and tries to send the ACK, but it tries to send over the
previous internal IP:

{noformat}
DEBUG [ACCEPT-/10.2.0.71] 2015-04-10 22:06:45,576 MessagingService.java (line 917) Connection
version 7 from /54.219.189.161
DEBUG [Thread-11] 2015-04-10 22:06:45,621 MessagingService.java (line 780) Setting version
7 for /54.219.189.161
DEBUG [Thread-11] 2015-04-10 22:06:45,621 IncomingTcpConnection.java (line 107) Set version
for /54.219.189.161 to 7 (will use 7)
TRACE [GossipStage:1] 2015-04-10 22:06:45,658 GossipDigestSynVerbHandler.java (line 40) Received
a GossipDigestSynMessage from /54.219.189.161
TRACE [GossipStage:1] 2015-04-10 22:06:45,660 Gossiper.java (line 768) local heartbeat version
179776 greater than 0 for /54.219.189.161
TRACE [GossipStage:1] 2015-04-10 22:06:45,666 GossipDigestSynVerbHandler.java (line 84) Sending
a GossipDigestAckMessage to /54.219.189.161
TRACE [GossipStage:1] 2015-04-10 22:06:45,666 MessagingService.java (line 660) /54.219.189.162
sending GOSSIP_DIGEST_ACK to 399@/54.219.189.161
DEBUG [WRITE-/54.219.189.161] 2015-04-10 22:06:45,666 OutboundTcpConnection.java (line 290)
attempting to connect to /10.2.1.139
{noformat}

It seems like the 'new' 161 isn't binding this IP, which is fine depending on your circumstance,
but at least one problem we have is we shouldn't be sending the onJoin event for a dead state
which triggers the initial reconnect.  I can't think of any reason we'd want to send that
event upon discovery of any dead state, so patch to only send it for live states.

That said, I don't think this is the original cause, because when I've seen it I wasn't using
INTERNAL_IP nor a reconnecting snitch.

> Exception during startup: Unable to gossip with any seeds
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-8072
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8072
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Ryan Springer
>            Assignee: Brandon Williams
>             Fix For: 2.0.15, 2.1.5
>
>         Attachments: 8072.txt, cas-dev-dt-01-uw1-cassandra-seed01_logs.tar.bz2, cas-dev-dt-01-uw1-cassandra-seed02_logs.tar.bz2,
cas-dev-dt-01-uw1-cassandra02_logs.tar.bz2, casandra-system-log-with-assert-patch.log, trace_logs.tar.bz2
>
>
> When Opscenter 4.1.4 or 5.0.1 tries to provision a 2-node DSC 2.0.10 cluster in either
ec2 or locally, an error occurs sometimes with one of the nodes refusing to start C*.  The
error in the /var/log/cassandra/system.log is:
> ERROR [main] 2014-10-06 15:54:52,292 CassandraDaemon.java (line 513) Exception encountered
during startup
> java.lang.RuntimeException: Unable to gossip with any seeds
>         at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1200)
>         at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:444)
>         at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:655)
>         at org.apache.cassandra.service.StorageService.initServer(StorageService.java:609)
>         at org.apache.cassandra.service.StorageService.initServer(StorageService.java:502)
>         at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:378)
>         at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:496)
>         at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:585)
>  INFO [StorageServiceShutdownHook] 2014-10-06 15:54:52,326 Gossiper.java (line 1279)
Announcing shutdown
>  INFO [StorageServiceShutdownHook] 2014-10-06 15:54:54,326 MessagingService.java (line
701) Waiting for messaging service to quiesce
>  INFO [ACCEPT-localhost/127.0.0.1] 2014-10-06 15:54:54,327 MessagingService.java (line
941) MessagingService has terminated the accept() thread
> This errors does not always occur when provisioning a 2-node cluster, but probably around
half of the time on only one of the nodes.  I haven't been able to reproduce this error with
DSC 2.0.9, and there have been no code or definition file changes in Opscenter.
> I can reproduce locally with the above steps.  I'm happy to test any proposed fixes
since I'm the only person able to reproduce reliably so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message