Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4E4CB9D60 for ; Tue, 20 Mar 2012 18:26:02 +0000 (UTC) Received: (qmail 83787 invoked by uid 500); 20 Mar 2012 18:26:02 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 83749 invoked by uid 500); 20 Mar 2012 18:26:02 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 83675 invoked by uid 99); 20 Mar 2012 18:26:01 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Mar 2012 18:26:01 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Mar 2012 18:26:00 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 8FE39BDAA0 for ; Tue, 20 Mar 2012 18:25:40 +0000 (UTC) Date: Tue, 20 Mar 2012 18:25:40 +0000 (UTC) From: "Brandon Williams (Commented) (JIRA)" To: commits@cassandra.apache.org Message-ID: <461851478.37623.1332267940590.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <208710556.36758.1332258339573.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (CASSANDRA-4066) Cassandra cluster stops responding on time change (scheduling not using monotonic time?) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/CASSANDRA-4066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233648#comment-13233648 ] Brandon Williams commented on CASSANDRA-4066: --------------------------------------------- I can confirm that this is indeed the GossipTask no longer running when the clock is pushed backward far enough. As you mention, we use SES extensively and likely all those timed tasks have also quit firing, which could lead to an untold amount of confusion if we special-cased gossip, since there would be no immediate red flag to indicate a problem. Starting a node far in the future has other consequences too, such as CASSANDRA-3654. I think I would rather see the UAE and know that my machines have connectivity to identify this problem and fix it correctly. Even if we do special-case the GossipTask, we'll also need to fix LoadBroadcaster so we don't end up with a broken view of the load on the ring, and at that point it feels like a slippery slope where we need to fix everything, or fail as quickly as possible, which is what the current behavior does. Also an interesting thing to note is that the node still replies to gossip syn messages with a gossip ack, but because we only update the FD on a version/generation change, and because LoadBroadcaster is also broken the node has no reason to generate new versions, it remains seen as down to the other nodes. If LB did happen to work, we'd see the node flap every 90 seconds. > Cassandra cluster stops responding on time change (scheduling not using monotonic time?) > ----------------------------------------------------------------------------------------- > > Key: CASSANDRA-4066 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4066 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Linux; CentOS6 2.6.32-220.4.2.el6.x86_64 > Reporter: David Daeschler > Assignee: Brandon Williams > Priority: Minor > Labels: gossip > Fix For: 1.1.1 > > > The server installation I set up did not have ntpd installed in the base installation. When I noticed that the clocks were skewing I installed ntp and set the date on all the servers in the cluster. A short time later, I started getting UnavailableExceptions on the clients. > Also, one sever seemed to be unaffected by the time change. That server happened to have it's time pushed forward, not backwards like the other 3 in the cluster. This leads me to believe something is running on a timer/schedule that is not monotonic. > I'm posting this as a bug, but I suppose it might just be part of the communication protocols etc for the cluster and part of the design. But I think the devs should be aware of what I saw. > Otherwise, thank you for a fantastic product. Even after restarting 75% of the cluster things seem to have recovered nicely. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira