cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefania (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-8072) Exception during startup: Unable to gossip with any seeds
Date Mon, 21 Dec 2015 15:02:46 GMT


Stefania commented on CASSANDRA-8072:

Building on [~brandon.williams] previous analysis but taking into account more recent changes
where we do close sockets, the problem is still that the seed node is sending the ACK to the
old socket, even after it has been closed by the decommissioned node. This is because we only
send on these sockets, so we cannot know when they are closed until the send buffers are exceeded
or unless we try to read from them as well. However, the problem should now only be true until
the node is convicted, approx 10 seconds with a {{phi_convict_threshold}} of 8. I verified
this by adding a sleep of 15 seconds in my test before restarting the node, and it restarted
without problems. [~slowenthal] would you be able to confirm this with your tests?

If we cannot detect when an outgoing socket is closed by its peer, then we need an out-of-bound
notification. This could come from the departing node announcing its shutdown at the end of
its decommission but the existing logic in {{Gossiper.stop()}} prevents this for the dead
states (*removing, removed, left and hibernate*) or for *bootstrapping*. This was introduced
by CASSANDRA-8336 and the same problem has already been raised in CASSANDRA-9630. Even if
we undo CASSANDRA-8336 there is then another issue: since CASSANDRA-9765 we can no longer
join a cluster in status SHUTDOWN and I believe this is correct. So the answer cannot be to
announce a shutdown after decommission, not without significant changes to the Gossip protocol.
Closing the socket earlier, say when we get the status LEFT notification, is not sufficient
because during the RING_DELAY sleep period we may re-establish the connection to the node
before it dies, typically for a Gossip update. 

So I think we only have two options:

* read from outgoing sockets purely to detect when they are closed
* send a new GOSSIP flag indicating it is time to close the sockets to a node

> Exception during startup: Unable to gossip with any seeds
> ---------------------------------------------------------
>                 Key: CASSANDRA-8072
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Lifecycle
>            Reporter: Ryan Springer
>            Assignee: Stefania
>             Fix For: 2.1.x
>         Attachments: cas-dev-dt-01-uw1-cassandra-seed01_logs.tar.bz2, cas-dev-dt-01-uw1-cassandra-seed02_logs.tar.bz2,
cas-dev-dt-01-uw1-cassandra02_logs.tar.bz2, casandra-system-log-with-assert-patch.log, screenshot-1.png,
> When Opscenter 4.1.4 or 5.0.1 tries to provision a 2-node DSC 2.0.10 cluster in either
ec2 or locally, an error occurs sometimes with one of the nodes refusing to start C*.  The
error in the /var/log/cassandra/system.log is:
> ERROR [main] 2014-10-06 15:54:52,292 (line 513) Exception encountered
during startup
> java.lang.RuntimeException: Unable to gossip with any seeds
>         at org.apache.cassandra.gms.Gossiper.doShadowRound(
>         at org.apache.cassandra.service.StorageService.checkForEndpointCollision(
>         at org.apache.cassandra.service.StorageService.prepareToJoin(
>         at org.apache.cassandra.service.StorageService.initServer(
>         at org.apache.cassandra.service.StorageService.initServer(
>         at org.apache.cassandra.service.CassandraDaemon.setup(
>         at org.apache.cassandra.service.CassandraDaemon.activate(
>         at org.apache.cassandra.service.CassandraDaemon.main(
>  INFO [StorageServiceShutdownHook] 2014-10-06 15:54:52,326 (line 1279)
Announcing shutdown
>  INFO [StorageServiceShutdownHook] 2014-10-06 15:54:54,326 (line
701) Waiting for messaging service to quiesce
>  INFO [ACCEPT-localhost/] 2014-10-06 15:54:54,327 (line
941) MessagingService has terminated the accept() thread
> This errors does not always occur when provisioning a 2-node cluster, but probably around
half of the time on only one of the nodes.  I haven't been able to reproduce this error with
DSC 2.0.9, and there have been no code or definition file changes in Opscenter.
> I can reproduce locally with the above steps.  I'm happy to test any proposed fixes
since I'm the only person able to reproduce reliably so far.

This message was sent by Atlassian JIRA

View raw message